by Scotty Skidmore
Earlier this year, I participated in a data science hackathon hosted by Macquarie University and Microsoft. My classmates and I put together a team of four to compete against others to analyse, understand and solve a problem with unseen data.
And the problem was a big one. We were given the challenge by CHE Proximity client TAL Insurance to predict future cases of self-harm using results from life insurance policy questionnaires. The hackathon lasted 24 hours, culminating in a presentation for the judges and a final prize of $4000, which my team and I were fortunate enough to win.
Here are the four lessons that allowed us to do it.
Not all data sources are perfect, and with a small dataset for prediction, it’s essential to structure the data effectively.
When it came to the hackathon, our team realised that the number of self-harm events present in the data were less than 1% of the overall responses. This made it difficult to identify any trends. In an attempt to work around this, we used a technique that picks up the small trends in the dataset and oversamples the data to add in new rows. This approach can sometimes introduce biases in the data, forcing trends that might not exist, but in this case it allowed us to create usable models later down the track.
It’s easy to get bogged down in the data, but a good data scientist never loses sight of their goal. This means all modelling should be built to work on other data in future predictions.
Once we had wrestled the TAL data into the format we needed, we clustered the data into artificial groups to establish trends that might appear. After a few hours of testing, this proved unsuccessful. Returning to our goal, we got to work in predicting the cases of self-harm by running the data through machine learning algorithms to build an accurate prediction model.
Machine learning is one of the most effective weapons in the data scientist’s arsenal – but that doesn’t mean it can do all the work on its own.
During the process of carrying out machine learning models on the TAL data, we tweaked each model until we were as happy as we could be with the result. From there, our team had to come up with a presentation and ideas on how to apply the information we’d learned. We built some amazing visualisations with the data, and a handful of suggestions that TAL could take away to improve its response to important health issues.
Some people respond to deadlines better than others, but a tight turnaround is often all the motivation you need to apply yourself fully to a problem-solving challenge.
After a gruelling 24 hours and very little sleep, our hackathon ended with a science fair-style judging process, followed by a series of final presentations. Living and breathing the data for the previous 24 hours gave us confidence when it came to answering the judges’ questions – nobody knew the data as well as we did, and this enabled us to convey our discoveries as meaningful insights for our client.