Predict an outcome using machine learning
In this blog, we will discuss the use of machine learning to predict an outcome for a given business problem.
As before, we will use an example from the health care domain to illustrate how we would go about solving such a problem. Say, we want to assess if a certain patient has heart disease or not, and if so, what is the severity level of the disease occurrence?
But before we begin, let us take a step back and assess why we would even need machine learning in this case.
In the medical domain as in others, we know that certain outcomes are based on if a certain influencing factor is above or below a threshold. For instance, for diabetes, if the fasting glucose level is > 110 mg/dl , then the patient is flagged as "diabetic". In here the glucose level is the one factor that highly influences the outcome that the patient is diabetic.
But what happens when we have diseases where there is not just 1 or 2 factors influencing the outcome, but rather, a dozen or so values? Prior to the application of machine learning to this problem, medical professionals would chart out a flow chart or a decision tree. The caveats in this are:
The medical professional is very knowledgeable about the problem domain so they can construct the most optimal tree using their knowledge
As the saying "change is the only constant thing in life" goes, based on the data they collect, they have to ensure that the decision tree model is accurately computing the outcome. So manual tweaks to the model will be needed to keep it in top shape!
Now what if there is a new condition that has emerged that the healthcare provider observes but there is no text book knowledge behind what exactly contributes to the disease symptoms? This is very common in medical research where practitioners are studying health conditions that have never occurred before, or are studying how complicated health conditions can be cured.
In these cases, the patient must have taken several medical tests, and some 30 to 50 test result values are available, and it is up to the medical provider to figure out what caused the condition. The point is that, here, the practitioner does not know the answers for them to construct a decision tree.
Herein, machine learning plays a very useful role.
Machine learning can analyze a data set and construct the optimal model (eg: a decision tree) that can be used to compute the outcome
In our above example, the machine learning algorithm can be fed the data for 50 patients, containing around 30 or 50 test result values for each patient. Using math and computations behind the scenes, the machine learning algorithm can construct the most optimal decision tree and persists this as a model. This is referred to as "training the model".
The model is usually trained with 75% of the available data set and the remaining 25% of the data set is usually used to test if the model predicts the correct outcome. This is referred to as "model testing".
For any new patient that comes in, their test result values are then input to the initial model and the new patient's disease outcome is obtained as a result of the model processing the input values. Another perk of the model is that it can be updated daily using the new data that it processes daily, so the model "adapts" to any changes in the data. This alleviates the need to manually tweak the model based on new data.
The more diverse the data used to train the model is, the more scenarios the model is aware of. In other words, the model tends to less "biased" towards just a particular data set if it trained for different types of data sets.
And last but not the least, the model is only as good or as accurate as the data fed to it.
Some benefits of using machine learning models
Machine learning models alleviate the need for solely relying on a subject matter expert (SME) to manually construct the decision tree. This is not to say we do not need the subject matter expert at all, but, the model can complement subject matter expertise, and actually help the SME do their job even better
Secondly, the constant manual update of the solution logic is avoided as the model itself can be updated periodically with new incoming data
Approach for predicting the occurrence of heart disease
In our implementation we used the heart disease presence data set from the University of California at Irvine website. This data set has an outcome that indicates the presence or absence of heart disease based on 13 factors. A value of 0 means no presence, whereas values from 1 to 4 indicate the presence of heart disease, with 1 being the lowest level of severity and 4 being the highest.
Our approach was to train the model using the random forests algorithm (which basically constructs different decision trees using subsets of predictor variables, and gets each "tree" to vote towards the final outcome). This algorithm is very popular with its use in medical diagnosis.
The machine learning library packages within Apache Spark helped us implement this, and persist the model in a network file location.
Next, we implemented the front end piece of the application that would allow the end-user to input patient data values. For this we used a nodejs server, which would post the input values entered by the user to a kafka broker. A spark streaming application listened to this broker, and when data was received, it would process the input values against the persisted machine learning model, and generate a result that is saved to a table in a database like MySQL.
The front end app queries the MySQL database table once the result is available, and displays it on the screen. The screenshot below shows the front-end piece of this use case implementation and result for the data entered on the screen.
The above is an example of a "classification" problem. A useful extension of this use case would be to predict early diagnosis, so the patient can be put on preventive therapy. In order to do this, one approach is to use the historical data from the past for a given patient, and use linear regression to predict the future values say, in the next couple of years, for the above data points. Once the future values are available, we would run the above machine learning model on these future values to assess if the patient will have the heart disease condition in the future.
Data used for this use case is obtained from http://www.uci.edu, and is available at https://archive.ics.uci.edu/ml/datasets/Heart+Disease