{"id":3274,"date":"2018-05-23T17:11:49","date_gmt":"2018-05-23T17:11:49","guid":{"rendered":"https:\/\/max-drake.cc\/?p=3274"},"modified":"2018-05-23T17:17:21","modified_gmt":"2018-05-23T17:17:21","slug":"open-refine-a-free-open-source-powerful-tool-for-working-with-messy-data","status":"publish","type":"post","link":"https:\/\/max-drake.cc\/?p=3274","title":{"rendered":"Open Refine. A free, open source, powerful tool for working with messy data"},"content":{"rendered":"<p>OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. You can download it from <a href=\"http:\/\/openrefine.org\/download.html\" target=\"_blank\" rel=\"noopener\"><strong>here<\/strong><\/a>.<\/p>\n<p>In the Cognitive Class tutorial on OpenRefine it talks about 60-85% of time of data scientists spending on cleaning the data. So this tool could be very useful.<\/p>\n<p>There are 3 videos worth watching on the <a href=\"http:\/\/openrefine.org\/\" target=\"_blank\" rel=\"noopener\">Home page<\/a>. It shows it to be a pretty powerful tool. I was pretty impressed by its capabilities.<\/p>\n<p>On my first brief tinker I did not find it intuitive and will need to do some tutorials to see how you need to programme it. There seem to be a number of tutorials <a href=\"https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/External-Resources\" target=\"_blank\" rel=\"noopener\">HERE<\/a>.<\/p>\n<p>In some ways it is a bit like the Power BI setup with it remembering the steps you used to filter your data, so you can go backwards through the steps if you have made errors.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3275 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imj05-1024x1019.jpg\" alt=\"\" width=\"1256\" height=\"1250\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imj05-1024x1019.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imj05-150x150.jpg 150w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imj05-300x300.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imj05-768x764.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imj05.jpg 2004w\" data-sizes=\"(max-width: 1256px) 100vw, 1256px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1256px; --smush-placeholder-aspect-ratio: 1256\/1250;\" \/><\/p>\n<p>In one of the videos it shows a demonstration of how it setup filters for an &#8220;Actresses&#8221; Table, then used the process to replicate all the steps on an &#8220;Actors&#8221; table that was in the same format.<\/p>\n<p>&nbsp;<\/p>\n<h3>Process<\/h3>\n<p>I am importing a 3D PDF export to CSV file. I want to filter out Room Data that I can put in a table to show Room information. So this is a data cleaning exercise.<\/p>\n<p><img decoding=\"async\" class=\"size-large wp-image-3285 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr01-1024x510.jpg\" alt=\"\" width=\"678\" height=\"338\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr01-1024x510.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr01-300x149.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr01-768x383.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr01.jpg 1283w\" data-sizes=\"(max-width: 678px) 100vw, 678px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 678px; --smush-placeholder-aspect-ratio: 678\/338;\" \/><\/p>\n<p><img decoding=\"async\" class=\"wp-image-3284 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr02-1024x640.jpg\" alt=\"\" width=\"905\" height=\"566\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr02-1024x640.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr02-300x187.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr02-768x480.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr02-200x125.jpg 200w\" data-sizes=\"(max-width: 905px) 100vw, 905px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 905px; --smush-placeholder-aspect-ratio: 905\/566;\" \/><\/p>\n<p><img decoding=\"async\" class=\"wp-image-3282 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr03-749x1024.jpg\" alt=\"\" width=\"813\" height=\"1112\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr03-749x1024.jpg 749w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr03-219x300.jpg 219w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr03-768x1051.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr03.jpg 897w\" data-sizes=\"(max-width: 813px) 100vw, 813px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 813px; --smush-placeholder-aspect-ratio: 813\/1112;\" \/><\/p>\n<p><img decoding=\"async\" class=\"wp-image-3283 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr04-1024x627.jpg\" alt=\"\" width=\"881\" height=\"539\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr04-1024x627.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr04-300x184.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr04-768x470.jpg 768w\" data-sizes=\"(max-width: 881px) 100vw, 881px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 881px; --smush-placeholder-aspect-ratio: 881\/539;\" \/><\/p>\n<p>The imput from CSV gets a bit garbled, there are 2 lines of data on one line. So I tried saving the file as an XLS file and importing it. See the results below. It comes in correctly<\/p>\n<p><img decoding=\"async\" class=\"size-full wp-image-3281 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr05.jpg\" alt=\"\" width=\"925\" height=\"488\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr05.jpg 925w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr05-300x158.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr05-768x405.jpg 768w\" data-sizes=\"(max-width: 925px) 100vw, 925px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 925px; --smush-placeholder-aspect-ratio: 925\/488;\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3280 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr06-1024x649.jpg\" alt=\"\" width=\"851\" height=\"540\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr06-1024x649.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr06-300x190.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr06-768x487.jpg 768w\" data-sizes=\"(max-width: 851px) 100vw, 851px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 851px; --smush-placeholder-aspect-ratio: 851\/540;\" \/><\/p>\n<p>Note also , in image above I took away the header by changing Line(s) of Column header from 1 to 0.<\/p>\n<p>I have since found, that I can bring this file in correctly in CSV format by using custom import (red arrow on left at bottom).<\/p>\n<p>This is the preview window, you next need to create the project.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3288 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr07-1024x625.jpg\" alt=\"\" width=\"911\" height=\"556\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr07-1024x625.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr07-300x183.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr07-768x469.jpg 768w\" data-sizes=\"(max-width: 911px) 100vw, 911px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 911px; --smush-placeholder-aspect-ratio: 911\/556;\" \/><\/p>\n<p><img decoding=\"async\" class=\"wp-image-3294 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr08-1024x402.jpg\" alt=\"\" width=\"841\" height=\"330\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr08-1024x402.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr08-300x118.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr08-768x301.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr08.jpg 1598w\" data-sizes=\"(max-width: 841px) 100vw, 841px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 841px; --smush-placeholder-aspect-ratio: 841\/330;\" \/><\/p>\n<p>Once the project created, it is then ready to &#8220;Refine&#8221;.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3293 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr09-1024x672.jpg\" alt=\"\" width=\"849\" height=\"557\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr09-1024x672.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr09-300x197.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr09-768x504.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr09.jpg 1906w\" data-sizes=\"(max-width: 849px) 100vw, 849px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 849px; --smush-placeholder-aspect-ratio: 849\/557;\" \/><\/p>\n<h3>Filter<\/h3>\n<p>I want to filter &#8220;Column 1&#8221;&nbsp; by 3D Room Tags , so use the text filter on pull-down.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3292 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr10-1024x867.jpg\" alt=\"\" width=\"822\" height=\"696\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr10.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr10-300x254.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr10-768x650.jpg 768w\" data-sizes=\"(max-width: 822px) 100vw, 822px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 822px; --smush-placeholder-aspect-ratio: 822\/696;\" \/><\/p>\n<p>Using just 3d, we get the some data not actualy related to Rooms, this is the 3D_RoomTag(56), which we don&#8217;t want, so we want to filter that out too.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3291 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr11-1024x863.jpg\" alt=\"\" width=\"839\" height=\"707\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr11-1024x863.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr11-300x253.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr11-768x647.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr11.jpg 1300w\" data-sizes=\"(max-width: 839px) 100vw, 839px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 839px; --smush-placeholder-aspect-ratio: 839\/707;\" \/><\/p>\n<p>So use 3DRm in filter and only get room data.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3290 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr12-1024x846.jpg\" alt=\"\" width=\"824\" height=\"681\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr12-1024x846.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr12-300x248.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr12-768x635.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr12.jpg 1396w\" data-sizes=\"(max-width: 824px) 100vw, 824px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 824px; --smush-placeholder-aspect-ratio: 824\/681;\" \/><\/p>\n<h3>Other resources<\/h3>\n<p>The General Refine Expression Language( GREP) information can be found <a href=\"https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/General-Refine-Expression-Language\" target=\"_blank\" rel=\"noopener\"><strong>here<\/strong><\/a>. &nbsp;<\/p>\n<p><iframe title=\"Clean Your Data: Getting Started with OpenRefine [workshop]\" width=\"678\" height=\"381\" data-src=\"https:\/\/www.youtube.com\/embed\/wGVtycv3SS0?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" data-load-mode=\"1\"><\/iframe><\/p>\n<p><iframe title=\"OpenRefine Tutorials: How To Join Two Data Set\" width=\"678\" height=\"509\" data-src=\"https:\/\/www.youtube.com\/embed\/ogE2xZk7354?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" data-load-mode=\"1\"><\/iframe><\/p>\n<h3>End comment<\/h3>\n<p>This could work very well on the 3D PDF extract process to filter out specific types of data. Mainly for exploration as it could not be used for dispersal for use to others (they&#8217;d need to load the programme &amp; learn to use it). Better with current Excel Macro, but good for a learning exercise.<\/p>\n<p>I have spent a lot of time trying (and failing a lot) in doing data cleaning and I am excited to try this tool out. Being able to template some processes could be great too.<\/p>\n<p>From my brief play, it is a part of the process of data cleaning to analysis. It&#8217;ll clean data up but then you have to use something like Pyton to actually Manipulate\/Analyse the data. I look forward to finding a nice data-set to let it work its magic on.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. You can download it from here. In the Cognitive Class tutorial on OpenRefine it talks about 60-85% of time of data scientists spending on [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3290,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-3274","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-extraction"],"featured_image_src":"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr12.jpg","featured_image_src_square":"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/05\/imr12.jpg","author_info":{"display_name":"Max Drake","author_link":"https:\/\/max-drake.cc\/?author=1"},"_links":{"self":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts\/3274","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3274"}],"version-history":[{"count":0,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts\/3274\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/media\/3290"}],"wp:attachment":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3274"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3274"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3274"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}