Josh Miller, co-founder & CEO of Gradient Health focuses on the regulatory hurdles medtech companies face and the different options for overcoming these from a data perspective.
Following the record high of investments in biotech Q1 2021, it is becoming increasingly difficult to secure the funding necessary to bring a healthtech product to market, with a 61% reduction in venture funding in biotech according to Bay Bridge Bio. Investors are seeking to reduce risk and time to return on their investments. Furthermore, it can often be significantly more challenging to raise money prior to receiving a green light from regulators than afterward. Regulatory bodies define criteria that promote responsible development and distribution of safe and effective medical devices. Meeting regulatory milestones is thus a necessary challenge for device developers to meet to achieve commercial success. Altogether, this highlights the importance of understanding what regulators are looking for and best practices for meeting their criteria in a cost and time-effective manner. This article discusses some key qualities of good datasets for developing artificial intelligence (AI) and machine learning (ML) health technology as well as tips for evaluating tools that can accelerate development, validation, and regulatory clearance for such technologies.
Qualities of a strong dataset
The FDA, MHRA, and Health Canada have jointly defined 10 Good Machine Learning Practices (GMLP) that act as guidance for developers of medical devices that leverage AI/ML technology:
- Multi-disciplinary expertise is leveraged throughout the total product life cycle
- Good software engineering and security practices are implemented
- Clinical study participants and data sets are representative of the intended patient population
- Training data sets are independent of test sets
- Selected reference datasets are based upon best available methods
- Model design is tailored to the available data and reflects the intended use of the device
- Focus is placed on the performance of the human-AI team
- Testing demonstrates device performance during clinically relevant conditions
- Users are provided clear, essential information
- Deployed models are monitored for performance and re-training risks are managed
In particular, the criteria numbered 3, 4, and 5 define characteristics of high-quality datasets used for training and validation of AI/ML models.
Bias in medical AI may emerge when models lack diversity in patient demographics and disease presentation. Developers must objectively evaluate the performance of their technology in the intended patient population. Thoroughly and accurately labeled data can be incredibly useful to evaluate the representativeness of datasets for given patient populations.
Fast-tracking AI/ML device development and regulatory clearance with access to on-demand, quality data
Fortunately for innovators in the health tech space, there is an increasingly greater quantity and quality of options available to gain near-instantaneous access to curated data at their fingertips. Below are some routes to obtain data:
- Clinical study with data sharing agreement. This is the traditional route of generating data. However, this approach introduces challenges – there is typically a significant lead time including study design and patient recruitment and can be very costly. The diversity of the patient population is often dictated by the location and number of sites where data is collected, thus some medical bias may be introduced into models trained with the data collected in the clinical study
- Federated learning. Rather than training a model using a centralised data set, federated learning occurs when developers share their models with other institutions who then train the model with their respective data sets. The result is a model trained on a diverse array of data, without the sensitive data ever leaving the institutions’ servers
- Synthetic data. This approach is gaining popularity as it can be relatively inexpensive and fast to generate large datasets from a smaller subset of real data. However, developers should be mindful that errors may be introduced by synthetic data.
- Data partners. Data partners take on the task of sourcing, de-identifying, and labelling data so that developers don’t have to. Typically, by working with data partners, developers can instantaneously access large data sets while avoiding the risks associated with handling patient information as the data will already be de-identified.
Healthtech innovators may wish to look for the following qualities in data sets:
- Responsibly sourced
- Searchable, to easily retrieve and sort data based upon specific criteria
- De-identified to minimise the risk of HIPAA violations
- Enables flexible data labelling
- Built with FDA and WHO framework compliance in mind
There is a tremendous opportunity to progress healthcare by leaps and bounds by leveraging artificial intelligence and machine learning. Data intermediaries and partners can help innovators of medical technology to more easily overcome barriers to accessing large amounts of diverse data that may have previously represented nearly insurmountable obstacles for small, lightweight start-ups to overcome.