Quality blog content since 2006
- the successor the imjtk.com
Quality blog content since 2006
- the successor the imjtk.com

Blog Post

Kaggle DataSets Are Not Real World Data

January 10, 2019 Dev, words
Kaggle DataSets Are Not Real World Data

While I was working through my latest data related project it occurred to me that yet again I was spending most of my time getting the data into the shape that I wanted it in.

While the size of the dataset I was working with was small by comparison to most Kaggle datasets;  I was dealing with a real world problem that Kaggle competition folks don’t have to deal with.

Cleaning data…

Now, I am not slamming Kaggle.  They do a great job proving data for people of all walks of life to experiment with.   The problem, such as it is, is that someone has already done the dirty work of getting that data ready to experiment on.

In the real world, if you want to do some machine learning with your company’s data, or data that you scrape yourself, etc. you are going to have to scrub it before you can find the k nearest neighbor’s…

In fact, with the latest project I was working on, more than eighty percent of my code was for cleaning the data.

For example the data I was using had a column for salary data.  It contained dollar signs and commas so it was a non-null object type.  I wanted to do math on it so I needed it to be numeric.

There are many ways to accomplish this, I decided to go with a regular expression:

df_salaries['salary'] = df_salaries.salary.str.replace(r"[\D]", '')  
df_salaries['salary'] = df_salaries['salary'].astype(np.int64)

The first line is the regex that says replace ( the r before the quotes is for raw string ) anything that is not a digit with the empty string.  The second line actually changes the column type from a string to an integer.

Side note:  The current guidance from the people behind Pandas is to not use the “inplace” qualifier as it may be depreciated in the future.  While I find the re-assigning clunkier I don’t want my code to quit working in the future either…

One of the names of a column was giving me trouble.  Maybe the “-” in it?  Not sure, but I wanted to rename it:

df_salaries.rename(columns = {'2018-19' : 'salary'}, inplace = True)

This time I was forced to use the inplace command… Nothing is simple 🙂

I could go on and on with examples of cleaning data, but others have already done it better than I ever could.  The point is that data in the wild is ( almost ) never ready to use, so if you want to do data science get comfortable cleaning data.

 Or you could always hire me to clean your data for you so that you can concentrate on the sexy stuff.

Write a comment