Workshop Description
The overall goal of this workshop is to bring researchers, academicians, professionals and policymakers under a single umbrella to innovate data engineering methods that make the best of the limited data in the medical domain. In our past two editions, we focused on addressing several key issues in data engineering, including:
- Generating task-specific biomedical synthetic data and augmenting it with real data.
- Developing principled methods to identify diverse and discriminative subsets of training examples to label for downstream tasks, minimizing annotation budget and time without compromising performance.
- Designing image-specific augmentation policies instead of relying on random policies. For instance, vertically flipping lungs in medical images could mislabel the right lung as the left, a common issue in X-rays.
- Designing suitable pretext tasks for specific downstream medical tasks, learning from noisy data, and learning from distributed data without sharing raw content.
Our contributions are compiled and published in the proceedings. Some of the contributions from the earlier editions, such as Khanal et al. (2023), proposed training models using self-supervised learning, followed by employing methods to learn from noisy labels. Similarly, Poudel et al. (2024) and Thrasher et al. (2024) proposed a task-aware active learning method to sample the most informative unlabeled data, reducing the need for training examples by 50%. Contributions from Dener et al. (2023), Reyes-Amezcua et al. (2024), and Rau et al. (2023) focused on curating both task-aware and task-unaware synthetic data and addressing biases in synthetic data. Babu et al. (2024) and Pokhrel et al. (2024) highlighted biases in data augmentation and their use in out-of-distribution detection, respectively.
These contributions have developed innovative and principled methods to integrate different aspects of data engineering, maximizing the benefits of available data—central to this workshop's theme. As Ilya Sutskever remarked in his speech at NeurIPS 2024, "Computers are advancing with better hardware, algorithms, and clusters." However, data—the "fossil fuel" of AI—has reached its growth limit. His statement underscores the growing importance of AI tools and techniques for effectively utilizing limited data.
Data-driven deep learning architectures such as UNet, VGGNet, ResNet, V-Net, DenseNet, and Vision Transformers are widely used in downstream tasks such as detection, classification, and 3D reconstruction. These architectures require large volumes of annotated data to train their millions of parameters, which is difficult and expensive to collect and annotate. In medical image analysis, standard sensors are often unsuitable for in vivo use. When they are suitable, data collection requires patient consent, lengthy acquisition procedures, and the large inter-rater variability in medical image analysis which makes it even harder to have high-quality labels.
To address these challenges, data engineering methods such as geometric transformations (e.g., rotation, flipping, cropping), MixUp, and Cutout —though limited to single modality—have been introduced to expand training datasets over the past decade. However, the frequency of such contributions remains low compared to architectural engineering. Although infrequent, these methods have proven effective in improving model generalization.
In recent years, there has been a growing trend in leveraging multimodal data. Large Language Models, Vision-Language Models, and multimodal generative models have been used to synthesize multimodal content, expanding training datasets. Despite the ability to generate large volumes of synthetic data, the fidelity and quality of this data are often insufficient, leading to either modest improvements or even deterioration in model generalization. Existing data engineering methods are predominantly designed for unimodal datasets, emphasizing the need to extend them to handle multimodal data effectively.
Workshop Themes
Data Augmentation in the Medical Domain: This sub theme covers data augmentation through geometric transformations, simulated data from phantoms and generative models in the medical domain, large language models, and multimodal data. It also investigates methods for designing application-aware data augmentation policies.
Active Learning and Active Synthesis: This sub theme focuses on methods for identifying the most discriminative and diverse subsets of unlabeled unimodal and multimodal data to train models for various clinical applications. Active synthesis involves generating synthetic data relevant to the target application.
Self-Supervised Learning: This sub theme explores methods for designing application-specific pretext tasks for pre-training models in a self-supervised manner. Generic pretext tasks are often suboptimal for downstream tasks, highlighting the need for tailored approaches for both unimodal and multimodal datasets.
Datasets and Benchmarking for Data Engineering: This subtheme explores datasets and benchmarks specifically designed for developing, assessing, and validating data engineering methods in both unimodal and multimodal setups, including validation of newly generated samples( e.g newly generated samples). As we know, the standard metrics are known to be suboptimal, expert rating is highly subjective and depends on the medical application.
Imaging Themes (not limited to):
Optical imaging, Endoscopy, OCT, histopathology, Hyperspectral imaging, opto-acoustics, fundus imaging, CT, PET, MRI, X-ray, Ultrasound, New imaging biomarkers, Multimodal imaging, Synthetic data of various imaging types, Other imagingClinical Applications (not limited to):
Surgical data science, Classification, detection, and diagnosis in medical image analysis, Organ/instrument/lesion segmentation, Image registration and data fusion, Image reconstruction, Prognosis and prediction, Tissue characterisation, Biology image analysisOrgans (not limited to):
Brain, Head and neck, Liver, Gastrointestinal tract diseases, LungsKeynote speakers
Coming Soon
Important Dates
Coming Soon
Submission
Coming Soon
Proceedings
Accepted papers will be published in LNCS as a separate DEMI 2025 (MICCAI Workshop) proceeding TBD We are seeking additional academic/industrial sponsorships. Please contact us for more details: demiworkshop23@gmail.com
Organising committee
Binod Bhattarai
University of Aberdeen, UKAnita Rau
Stanford University, USA Razvan Caramalau
Digital Technologies, Medtronic, UK
Annika Reinke
German Cancer Research Institute, Germany
Anh Nguyen
University of Liverpool, UKAna Namburete
University of Oxford, UK
Prashnna Gyawali
West Virginia University, USA
Danail Stoyanov
University College London, UK
Technical Program Committee
Web and Publicity Chair
Sponsors
Past Iterations