Garbage In, Garbage Out: The Role of Existing Data in Predictive Analytics
November 13, 2015 | NCCD
As tools that use patterns in data to determine the likelihood of something happening in the future, predictive analytics algorithms are highly dependent on the quality of the data they use. The current emphasis on predictive tools is made possible by a huge increase in the amount of data we can store and process. However, that mass of data is useless unless it accurately represents what is happening in the real world.
Predictive algorithms will find whatever patterns exist in a data set, even if those patterns are based on inaccurate, mis-entered, missing, or otherwise invalid data. The quality of the underlying data set has an enormous impact on the accuracy and efficacy of the resulting predictive tool.
Most successful predictive tools rely on highly structured data, such as financial transactions, online orders, streaming video views, etc. In most cases these data are generated by the computer system as a by-product of another action, such as processing a credit card transaction or serving up a video. This automatic data generation creates very clean, structured data that is amenable to analysis.
Social services agencies also collect a lot of data as staff work with clients and track their actions in the case management system. There is a strong temptation to use these data to power predictive analytics. In some cases this can be very helpful, providing needed information to help prioritize work, allocate resources, or prevent tragedies. However, case management data are entered by human beings and subject to all of the mistakes, biases, and other flaws that come along with that.
This doesn’t mean that we can’t use case management data for predictive analytics. We just need to be very vigilant about the quality of the underlying data and understand how errors and biases in those data can affect the results. Regular review of case-level data with business intelligence tools like SafeMeasures® can help improve the quality of the underlying data, which will lead to more accurate and useful predictions.