Open Refine. A free, open source, powerful tool for working with messy data
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. You can download it from here.
In the Cognitive Class tutorial on OpenRefine it talks about 60-85% of time of data scientists spending on cleaning the data. So this tool could be very useful.
There are 3 videos worth watching on the Home page. It shows it to be a pretty powerful tool. I was pretty impressed by its capabilities.
On my first brief tinker I did not find it intuitive and will need to do some tutorials to see how you need to programme it. There seem to be a number of tutorials HERE.
In some ways it is a bit like the Power BI setup with it remembering the steps you used to filter your data, so you can go backwards through the steps if you have made errors.
In one of the videos it shows a demonstration of how it setup filters for an “Actresses” Table, then used the process to replicate all the steps on an “Actors” table that was in the same format.
Process
I am importing a 3D PDF export to CSV file. I want to filter out Room Data that I can put in a table to show Room information. So this is a data cleaning exercise.
The imput from CSV gets a bit garbled, there are 2 lines of data on one line. So I tried saving the file as an XLS file and importing it. See the results below. It comes in correctly
Note also , in image above I took away the header by changing Line(s) of Column header from 1 to 0.
I have since found, that I can bring this file in correctly in CSV format by using custom import (red arrow on left at bottom).
This is the preview window, you next need to create the project.
Once the project created, it is then ready to “Refine”.
Filter
I want to filter “Column 1” by 3D Room Tags , so use the text filter on pull-down.
Using just 3d, we get the some data not actualy related to Rooms, this is the 3D_RoomTag(56), which we don’t want, so we want to filter that out too.
So use 3DRm in filter and only get room data.
Other resources
The General Refine Expression Language( GREP) information can be found here.
End comment
This could work very well on the 3D PDF extract process to filter out specific types of data. Mainly for exploration as it could not be used for dispersal for use to others (they’d need to load the programme & learn to use it). Better with current Excel Macro, but good for a learning exercise.
I have spent a lot of time trying (and failing a lot) in doing data cleaning and I am excited to try this tool out. Being able to template some processes could be great too.
From my brief play, it is a part of the process of data cleaning to analysis. It’ll clean data up but then you have to use something like Pyton to actually Manipulate/Analyse the data. I look forward to finding a nice data-set to let it work its magic on.