We’re a big fan of community projects at RedPixie, especially those where we can see real-life benefits. Following on from our recent participation in Microsoft’s Cortana Intelligence Competition – Decoding Brain Signals we had another chance to go for gold.
This article will talk about our involvement in the latest competition, Women’s Health Risk Assessment. The few RedPixie entries did very well, and I am delighted to say that my entry came first – out of 493 participants.
Below, we’ll explain the process of utilising Data Science for Women’s Health.
Based on the World Health Organisation report in 2011, about 820,000 women and men aged 15-24 were newly infected with HIV in developing countries. Among these, more than 60% were women.
Developing countries face serious reproductive health problems such as sexually transmitted infections (STIs), unintended pregnancies, and complications from childbirth. One of the top priorities for policymakers, researchers, and health care providers, is to emphasize prevention and provision of information about STIs and other reproductive tract infections (RTIs).
To help achieve the goal of improving women’s reproductive health outcomes in underdeveloped regions, Microsoft’s Cortana Intelligence Gallery created a competition that called for optimized machine learning solutions so that a patient can be accurately categorized into different health risk segments and subgroups.
Based on the categories that a patient falls in, healthcare providers can offer an appropriate education and training program to patients. Such customized programs have a better chance to help reduce the reproductive health risk of patients.
The competition and the data
The objective of this machine learning competition was to build machine learning models to assign young female subjects in one of the 9 underdeveloped regions into a risk segment, and a subgroup within the segment.
After the accurate assignments of the risk segment and subgroup in each region, a healthcare practitioner can deliver services to prevent the subject from the sexual and reproductive health risks.
The dataset used in this competition was collected via survey in 2015 as part of a Bill & Melinda Gates Foundation funded project exploring the wants, needs, and behaviours of women and girls with regards to their sexual and reproductive health in nine geographies. The data was collected from around 9000 young subjects (15 to 30 years old) when they visited clinics in 9 underdeveloped regions, with around 1000 subjects in each region.
Each subject was asked some questions by clinical practitioners and her answers were recorded together with her demographic information. The sexual and reproductive health risks were then evaluated by the practitioners and the individual assigned to the different risk segments and subgroups.
The data science for women’s health
With the collected dataset, which contains a wide variety of health, social, economic, and other properties of each young woman who participated in the survey, we are able to build strong predictive models, that can accurately classify new subjects to the correct categories, according to the similarity of their properties to others already classified.
Our Data team here at RedPixie saw a great opportunity to gamify this interesting and potentially useful Data Science problem, so each of us independently joined the competition and started building and testing models on the data.
I chose to build my solution locally in R Studio, and submitted it through Microsoft’s Azure ML Studio, which is a great platform for Machine Learning experts and enthusiasts, providing powerful cloud-based analytics right at your fingertips.
The first step for building such a model is loading, pre-processing, and cleaning the data for modelling.
Then feature selection and engineering is performed, to find the most important features that correlate each subject to the correct health-risk category – in this case I concluded that all features were actually important for classification.
Once the data was in the right format, I started testing the performance of several supervised machine learning classification algorithms, by building models on a portion of the training data and testing their predictive accuracy on another smaller portion of that data.
I figured out that the algorithm that performed the best for this particular problem is a modified gradient boosting algorithm called XGBoost, which is based on the theory of Decision Trees, a category of machine learning algorithms that I initially suspected would be ideal for this particular type of problem, due to their structure.
You can see an example of an extremely simplified decision tree below:
After carefully tuning the parameters of the algorithm using a grid search algorithm and a little bit of instinct, once the classification accuracy of the XGBoost model was satisfying, I re-trained the model using the training dataset in its entirety, and exported the model into a file.
I then loaded that file in ML Studio, in order to deploy a Web Service that is used to classify the unlabelled test set, which contains new female subjects whose health risk category was unknown to the algorithm.
The predictions of the algorithm are then tested against the original health-risk categories of the women. My best submission to the public dataset achieved a classification accuracy of 86.5%, meaning that out of 100 women 86 of them would be correctly assigned to the right health risk category, allowing for a care plan to be developed accordingly.
A few weeks after submitting my solution, I was notified that it had actually scored the highest in the private dataset on which the competition results were evaluated – in other words, I managed to come first out of the 493 participants, which of course made me and my colleagues at RedPixie very happy.
What seemed weird to me, was that initially I was ranked 24th on the public testing dataset, so I jumped 23 positions due to the final evaluation on a different dataset, as explained in the rules of the competition, which means that my machine learning model generalized pretty well – the generalization of a model is how strong or robust it is, or in other words how well it performs on new, previously unknown datasets.
Automating such healthcare processes can be extremely beneficial, especially in cases where the number of doctors, staff, and infrastructure are insufficient to handle large amounts of patients. Automatically categorizing the patients into their health-risk segments helps provide them with the appropriate customised services that lower the risks of diseases/infections, enabling the healthcare system to run more effectively.
Seeing tangible value in technology is sometimes difficult, but this example of data science for women’s health shows what is possible.
You can find the documentation and code of my solution here.
Written by Ion Kleopas | Data Scientist, RedPixie | See his LinkedIn Profile