UC Berkeley students find a better way to fill empty data set values

Introduced by Jake Mainwaring, Chase Smith, Cyril Tamraz and Lawrence Yan

This December, a team of four UC Berkeley students presented a tool that could help data scientists deal with one of the most frustrating parts of their work: missing data set values.

Data scientists commonly address this problem by dropping rows that have missing values or replacing values with column means, but these methods can result in a loss of valuable information. The students involved in this project believe there is a better way to address to deal with missing values, and their new python library, Integrull, is their answer.

In their words, the library is “a tool to help you fill missing values in a more intelligent way.” So how exactly does it work? Integrull trains a different model for each column with missing values.

First, the user loads her CSV file via pandas and selects the columns where she wants to fill null values. For a specific column (e.g., gender), the data is split into rows where the gender value is empty and rows where it’s not.

On the rows where gender information is included, the gender value is the response variable and all of the other values in that row are the model’s features. Once this model is trained on all the complete rows, it can be used to make predictions for the missing values. This process repeats for every column until the user has a full dataset. Additionally, the original data remains untouched, so the data scientist can cross-reference the raw data against the predictions at any point.

This new tool can help data scientists save time and give them more accurate predictions when they’re working with data sets that have many empty values.

Try the program out for yourself here.

Posted in Programming and tagged coding, data science, datasets, Integrull, missing values, Python