A New Wealth of Data Cleaning Tools

This should start with “back in my day” in order to convey a curdgeony feeling towards new-fangled tools that allow us to clean data more easily. The problem that approach is that these new tools are amazing. They allow us to prepare data for analysis more safely, quickly, repeatably, and reliably.

These are the tools I have currenlty experimented with:

Before I was introduced to OpenRefine, I would immediate reach for Python, Scala, or some other nice language for building a pipeline to transform the data. In the class on data prep that I’ve developed and taught, I start with Python and Pandas and run the gambit through all the various sources of data that you are likely to encounter with the goal of creating a data set that can be analyzed. Of course, this process is the similar to building an ETL workflow.

Once we are familar with looking for incorrect data types, using RegEx to transform strings, and generally familiar with looking at data and trying to figure out what questions it can answer and the best format for it to be in to answer those questions.

I, naturally, started with the unscientific statistic that 80% of a data scientists, but it takes a bit before the students are onboard with the idea that this class represents the bulk of the work rather than the flashy visualization class or the statistics and machine learning classes. It just boils down to the garbage-in; garbage-out principle. The more effort you put into preparing the data, the better your results will be.

But, I am singing to the choir, right? You came here to learn about the new data cleaning tools that are starting to appear. As I complete my reviews of the tools listed above, I will add links to the reviews below.

Data Preparation Tool Reviews: