Collaborated Learning with medical data making high-stake decisions without information leakage

Project summary

Several studies underscore the potential of machine learning to identify diagnostic and prognostic biomarkers in medical records. However, finding the sufficiently large, diverse, labelled and balanced datasets required for training is a significant challenge in medicine and can rarely be found in individual institutions. Multi-institutional collaborations based on centrally shared patient data face privacy and ownership challenges. Collaborative learning is a novel paradigm for data-private multi-institutional collaborations and this project aims to investigate the use of privacy-preserving collaborative weakly supervised learning techniques in the medical domain with industry partner Royal Brisbane & Women’s Hospital, to investigate data privacy preservation, class imbalanced and data unlabelled problems.

Project description

Rationale

Deep learning models show promising results in medical diagnosis, annotation, and treatment recommendation tasks, but their performance relies on substantial amounts of diverse quality labelled data to be broadly effective. A recent study from Zech et al. [1] argues that most of the current deep learning models overfit on subtle institutional data biases; thus, their performance has deteriorated while data are not seen in training. Such models may result in good accuracy when tested against data from a single institution, but do not generalize well to multi-centre setting. The current paradigm for multi-centre collaborations in the medical domain requires patient data to be shared in a centralized location for model training, However, this setting does not scale well to large numbers of collaborators, especially in inter-state configurations, due to security, privacy, regulation, and data ownership concerns. Collaborative learning is a data-private learning method whereby the features of the same sample are distributed among multiple data owners. The collaboration among owners can improve the model accuracy by leveraging additional features from each other. It is desirable that the collaboration should not expose either the dataset of each individual owner or the model parameters trained on them. Although collaborative learning can eliminate information leakage issues, factors are hindering the machine learning model performance. In the medical domain these factors are the imbalanced nature of medical data and the high cost of labelling it. Imbalanced data is a classification problem in which the number of instances per class is not uniformly distributed. This project aims to investigate the use of privacy-preserved collaborative weakly supervised learning techniques in the medical domain together with our industry partner in RBWH to investigate data privacy preservation, class imbalanced and data unlabelled problems.

Method

We will address the above motioned problem in the following tasks:

Task 1: Weakly supervised learning - utilising domain expert knowledge a small set of quality data will be annotated and a weakly supervised learning method will be developed that deals with class imbalanced and few labelled medical data.
Task 2: Collaborative learning - a collaborative learning framework will be developed that utilises the weakly supervised learning method developed in Task 1 to produce a privacy preserved learning model by leveraging the RBWH`s partner network.

Innovation

Analysis of medical data such as time-series, images, and notes etc., has two important research areas: disease grading (annotation, diagnostic) and fine-grained segmentation. Although the former problem often relies on the latter, the two are usually studied separately. Disease severity grading can be treated as a classification problem. However, data annotation for medical data is highly time-consuming and requires domain experts. In this work, we propose a collaborative learning method to jointly improve disease grading performance by semi-supervised learning. However, this promising approach has not been explored yet in the cybersecurity or medical domain. This project aims to fill this gap by addressing several key challenges, including annotations and weakly supervised learning and collaborative learning models. We expect that this project will accelerate the realization of multi-centre auto electronic health record (EHR) data analysis.