{"id":3403,"date":"2018-06-06T21:46:56","date_gmt":"2018-06-06T21:46:56","guid":{"rendered":"https:\/\/max-drake.cc\/?p=3403"},"modified":"2018-06-07T15:12:25","modified_gmt":"2018-06-07T15:12:25","slug":"web-searching-scraping-free-tools-free-extract-tables-tool-from-pdfs","status":"publish","type":"post","link":"https:\/\/max-drake.cc\/?p=3403","title":{"rendered":"Google searching,  free Web Scraping tools and free Extract Tables from PDF tool."},"content":{"rendered":"<p>I have been looking at some of the free <strong><a href=\"https:\/\/cognitiveclass.ai\/\" target=\"_blank\" rel=\"noopener\">Data Science and Cognitive Computing Courses<\/a><\/strong> and was following the <span class=\"course-name\"><strong>Data Journalism: First Steps, Skills and Tools<\/strong> course.<br \/>\n<\/span><\/p>\n<p>The <a href=\"https:\/\/player.vimeo.com\/video\/90266811\" target=\"_blank\" rel=\"noopener\"><strong>Google Searching video<\/strong><\/a> was on improving your searches in Google using the following:<\/p>\n<ul>\n<li>quotes to get specific key words eg &#8220;data science on construction&#8221;<\/li>\n<li>using &#8211; sign to exclude certain sites eg &#8220;data science on construction&#8221; -mbie -nz&nbsp;&nbsp; (<em> note no space between &#8211; and word you want excluded<\/em>)<\/li>\n<li>using wildcard &#8220;*&#8221; in search&nbsp; &#8220;data on construction *&nbsp; 2018&#8221;<\/li>\n<li>using a specific site to explore eg <strong>site<\/strong>: police.uk statistics&nbsp;&nbsp; or more specific&nbsp; site: dorset.police.uk statistics or site: *.nhs or by country site:nz &#8220;poverty statistics&#8221;<\/li>\n<li>getting filetype&nbsp;&nbsp;&nbsp;&nbsp; eg site:nl <strong>filetype<\/strong>:pdf<\/li>\n<li>for a database, where searches could vary&nbsp; eg site:nz database &#8220;search by&#8221;&nbsp;&nbsp;&nbsp; (<em> the search by would most probably be a place for putting a query to the database<\/em>)<\/li>\n<\/ul>\n<p>There was a <a href=\"https:\/\/player.vimeo.com\/video\/90266809\" target=\"_blank\" rel=\"noopener\"><strong>web scraping video<\/strong><\/a>&nbsp; that I thought was great. I had previously had a couple of attempts with Python &amp; the Beautiful Soup Extension package that I have had limited results with so far.<\/p>\n<p>I particularly liked the <strong>google spreadsheets example<\/strong> (from 3.40 to 6.30 in the video above). This required a command in a cell with&nbsp;&nbsp; =importhtml(&#8220;URL&#8221;, Query, index), in the example he used it was a table<\/p>\n<p>Also the free <a href=\"https:\/\/www.outwit.com\/\" target=\"_blank\" rel=\"noopener\"><strong>Outwit Hub tool<\/strong><\/a> that he demonstrated that works over several pages (from 9.10 to 12.40 on the video above).<\/p>\n<h3>Outwit Example also using Google Search site:nz database &#8216;search by&#8221;<\/h3>\n<p>I tried it out on a database search that I found and the web site did not include&nbsp;&nbsp; the actual search in the url so I had to run the search in the web page within Outwit Hub and then it crawled through the pages to get the first 100 lines (there was some messy lines I had to clear). I could run the search again from a later date to grab more data as it exceeded the 100 row limit of the free version. I am still impressed by the tool.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3407 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp01-1024x659.jpg\" alt=\"\" width=\"1623\" height=\"1044\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp01-1024x659.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp01-300x193.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp01-768x494.jpg 768w\" data-sizes=\"(max-width: 1623px) 100vw, 1623px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1623px; --smush-placeholder-aspect-ratio: 1623\/1044;\" \/><\/p>\n<p><img decoding=\"async\" class=\"wp-image-3406 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp02-1024x661.jpg\" alt=\"\" width=\"1461\" height=\"944\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp02-1024x661.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp02-300x194.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp02-768x496.jpg 768w\" data-sizes=\"(max-width: 1461px) 100vw, 1461px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1461px; --smush-placeholder-aspect-ratio: 1461\/944;\" \/><\/p>\n<p><img decoding=\"async\" class=\"wp-image-3408 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp03-883x1024.jpg\" alt=\"\" width=\"1168\" height=\"1354\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp03-883x1024.jpg 883w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp03-259x300.jpg 259w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp03-768x890.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp03.jpg 1331w\" data-sizes=\"(max-width: 1168px) 100vw, 1168px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1168px; --smush-placeholder-aspect-ratio: 1168\/1354;\" \/><\/p>\n<p>I did a search on the web for some other tools and came across <a href=\"https:\/\/www.quora.com\/What-are-some-of-the-best-web-data-scraping-tools\" target=\"_blank\" rel=\"noopener\"><strong>this article<\/strong><\/a> which then referenced a <a href=\"https:\/\/www.octoparse.com\/blog\/best-data-scraping-tools-for-2018-top-10-reviews\/?qu\" target=\"_blank\" rel=\"noopener\"><strong>later article<\/strong><\/a> highlighting other web scraping tools, some free.<\/p>\n<h3>Google Spreadsheet importhtml()<\/h3>\n<p>Using the Google Spreadsheets as a test, I also got this&nbsp; ( although I couldn&#8217;t get the table from https:\/\/www.stats.govt.nz\/topics\/building as I think it is in a separate tab (graph tab and table tab) )<\/p>\n<p>&nbsp;<img decoding=\"async\" class=\"wp-image-3411 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp05-622x1024.jpg\" alt=\"\" width=\"1177\" height=\"1938\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp05-622x1024.jpg 622w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp05-182x300.jpg 182w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp05-768x1264.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp05.jpg 1003w\" data-sizes=\"(max-width: 1177px) 100vw, 1177px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1177px; --smush-placeholder-aspect-ratio: 1177\/1938;\" \/><\/p>\n<h3>PDF Table Extract using Tabula<\/h3>\n<p>And on a similar subject another tool I would like to mention is&nbsp; <a href=\"https:\/\/tabula.technology\/\" target=\"_blank\" rel=\"noopener\"><strong>TABULA<\/strong><\/a> which is able to extract tables from PDF&#8217;s and export to say CSV&#8217;s . This runs as a server on your computer and opens in your browser. <strong><a href=\"https:\/\/flowingdata.com\/\" target=\"_blank\" rel=\"noopener\">FlowingData<\/a><\/strong> recommended this tool.<\/p>\n<p><img decoding=\"async\" class=\"wp-image-3412 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp06-1024x621.jpg\" alt=\"\" width=\"1467\" height=\"889\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp06-1024x621.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp06-300x182.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp06-768x466.jpg 768w\" data-sizes=\"(max-width: 1467px) 100vw, 1467px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1467px; --smush-placeholder-aspect-ratio: 1467\/889;\" \/> <img decoding=\"async\" class=\"wp-image-3413 aligncenter lazyload\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp07-1024x637.jpg\" alt=\"\" width=\"1501\" height=\"934\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp07-1024x637.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp07-300x187.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp07-768x478.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp07-200x125.jpg 200w\" data-sizes=\"(max-width: 1501px) 100vw, 1501px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1501px; --smush-placeholder-aspect-ratio: 1501\/934;\" \/><\/p>\n<p>On this subject I came across an on-line version<a href=\"https:\/\/pdftables.com\/\" target=\"_blank\" rel=\"noopener\"><strong> here<\/strong><\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have been looking at some of the free Data Science and Cognitive Computing Courses and was following the Data Journalism: First Steps, Skills and Tools course. The Google Searching video was on improving your searches in Google using the following: quotes to get specific key words eg &#8220;data science on construction&#8221; using &#8211; sign [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3407,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,36,29],"tags":[],"class_list":["post-3403","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-extraction","category-databases","category-web"],"featured_image_src":"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp01.jpg","featured_image_src_square":"https:\/\/max-drake.cc\/wp-content\/uploads\/2018\/06\/impp01.jpg","author_info":{"display_name":"Max Drake","author_link":"https:\/\/max-drake.cc\/?author=1"},"_links":{"self":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts\/3403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3403"}],"version-history":[{"count":0,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts\/3403\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/media\/3407"}],"wp:attachment":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}