{"id":6134,"date":"2020-04-12T08:12:50","date_gmt":"2020-04-12T08:12:50","guid":{"rendered":"https:\/\/max-drake.cc\/?p=6134"},"modified":"2020-04-12T08:12:56","modified_gmt":"2020-04-12T08:12:56","slug":"web-scraping-with-browser-development-console-javascript","status":"publish","type":"post","link":"https:\/\/max-drake.cc\/?p=6134","title":{"rendered":"Web Scraping with Browser development console &#038; javaScript"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">After pulling the data from an API ans displaying it in a web page Chart, and developing a crude selection list for different countries and dates, I started to think about creating my own API as a project. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As I looked around for data I came across Dixon Cheng&#8217;s  <strong><a rel=\"noreferrer noopener\" aria-label=\"Github repository of Covid-19 data (opens in a new tab)\" href=\"https:\/\/github.com\/dixoncheng\/covid19map\" target=\"_blank\">Github repository of Covid-19 data<\/a><\/strong> scraped from the MoH&#8217;s website and in JSON format. So I can play with that rather than relying on my Excel Scraping.  So that will allow me to revisit what I&#8217;ve done already. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">On reviewing my <strong>Google Sheets importXML() <\/strong>and  <strong>Excel  get from Web<\/strong> I wondered if I could do that in JavaScript. Then when asking Mr YouTube up pops all these packages with JQuery &amp; React etc. I only wanted to use native JS as it doesn&#8217;t break when you neglect it and you don&#8217;t have to bloat your server with packages to stop them breaking when updating (<em>still an issue when browser won&#8217;t read older stored code<\/em> <em>packages<\/em>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Then, I saw this cool video , this is real raw and I love it, a bit manual, but I love its simplicity:<\/p>\n\n\n\n<figure class=\"wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"JavaScript Web Scraper Using Google Chrome Console (Part 1\/2)\" width=\"678\" height=\"381\" data-src=\"https:\/\/www.youtube.com\/embed\/0NC9_R9TON4?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" data-load-mode=\"1\"><\/iframe>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">I really like this simple solution so I will explore trying to capture information from the MoH site. This is my next project. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I tried to set up the process in Firefox Dev tools, but could only get XPath,  and I needed to <strong>copy JS Path<\/strong> and so I had to use Chrome browser to do that. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"572\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-1024x572.jpg\" alt=\"\" class=\"wp-image-6137 lazyload\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-1024x572.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-300x168.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-768x429.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-1536x858.jpg 1536w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-50x28.jpg 50w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-90x50.jpg 90w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-100x56.jpg 100w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-179x100.jpg 179w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-1146x640.jpg 1146w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-640x357.jpg 640w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV-1375x768.jpg 1375w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV.jpg 1594w\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1024px; --smush-placeholder-aspect-ratio: 1024\/572;\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Also, when I built up the script to extract information it  as follows:  <strong><span style=\"color:#cf2e2e\" class=\"color\">document.querySelector<\/span>(&#8220;#content > article:nth-childs2) > header > h2&#8221;)<\/strong> it has the code at the front of it  <strong>document.querySelector<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, for the actual code, after identifying the element we want to extract data from it, so it becomes  <strong>document.querySelector(&#8220;#content > article:nth-childs2) > header > h2&#8221;)<span style=\"color:#cf2e2e\" class=\"color\">.innerText<\/span><\/strong> and in the video he adds <strong>innerText<\/strong> (also could use <strong>innerHTML<\/strong>) so what is returned is ext. As this is pulling numbers from the  table it is returning an array with only TEXT. So the numbers aren&#8217;t usable in that form . <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So you need to do some sort of conversion to the returned array. I tried doing   <strong>document.querySelector(&#8220;#content > article:nth-childs2) > header > h2&#8221;)<\/strong> <strong>.innerText<\/strong> .<strong><span style=\"color:#cf2e2e\" class=\"color\">parseINT()<\/span><\/strong> but that didn&#8217;t work, but got a partial success with   <strong>document.querySelector(&#8220;#content > article:nth-childs2) > header > h2&#8221;)<\/strong> <strong>.innerText<\/strong>*1, multiplying the resulting text by 1. This worked on numbers such as &#8220;24&#8221;, &#8220;243&#8221; but the numbers in the table were formatted as &#8220;1,234&#8221; and this just returned Null as it doesnt recognise the comma in the number.<\/p>\n\n\n\n<p class=\"has-background has-very-light-gray-background-color wp-block-paragraph\">Two thing I can try are: replacing  .<strong>innerText<\/strong>  with<strong> .innerHTML<\/strong> and see if that returns a number, the second is to try the parsing before the innerText as  <strong>document.querySelector(&#8220;#content > article:nth-childs2) > header > h2&#8221;)<\/strong> <strong>..parseINT(innerText<\/strong>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I tried both the above suggestions, neither worked. The innerHTML still gave a string, and the parseINT() encapsulation innerHTML or before it with a dot didn&#8217;t work.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If getting numbers out of a table is going to be difficult then this may not be a particularly useful tool if there is a whole lot of data manipulation to change type . I&#8217;ll do the tests above, but if it&#8217;s a major I may not follow it up further. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I thought there was real potential in this method, as I could get the data and add to a JSON object and then use that as data source, there would not be the issue of tables moving on the page, which is an issue with the Excel from Web scrape method. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Using Firefo<\/strong>x<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You could use firefox and get the Xpath then you need to wrap it :   <strong><span style=\"color:#cf2e2e\" class=\"color\">document.querySelector(&#8220;<\/span>XPath from Firefox<span style=\"color:#cf2e2e\" class=\"color\">&#8220;).innerText;<\/span><\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Creating an a API<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">After that I want to look at seeing if I can setup my own API, there is this video below that uses a DataBase and PHP. That is sort of my level of coding so I may try that project after the web scraping one.<\/p>\n\n\n\n<figure class=\"wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Coding a Simple API [PHP, SQL]\" width=\"678\" height=\"381\" data-src=\"https:\/\/www.youtube.com\/embed\/bwBbypKq61I?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\" data-load-mode=\"1\"><\/iframe>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">What I was thinking of was to have a JSON object held somewhere that I could call to retrieve data.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I&#8217;d initially downloaded Dixon Cheng&#8217;s repository above to use the JSON data, but found that I could call the file from a JS fech() command if I called the file in its RAW format     <strong>https:\/<span style=\"color:#cf2e2e\" class=\"color\">\/raw<\/span>.githubusercontent.com\/dixoncheng\/covid19map\/master\/data\/summary.json<\/strong> and this would return a JSON object. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is calling all the data, rather than just a portion of the data that I was doing with the John Hopkins API on Github that allows you to pull data from a specific date range, and only a specific country.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Although its not as flexible, for my purposes this method would work fine for small datasets, as you pull it down then create arrays in the date range that you want. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I tried putting it onto my server site and calling it from there with VS Code and live server but I get &#8221; (Reason: CORS header \u2018Access-Control-Allow-Origin\u2019 missing). &#8220;<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-1024x200.jpg\" alt=\"\" class=\"wp-image-6138 lazyload\" width=\"580\" height=\"113\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-1024x200.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-300x59.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-768x150.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-1536x300.jpg 1536w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-50x10.jpg 50w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-256x50.jpg 256w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-100x20.jpg 100w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-512x100.jpg 512w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV-640x125.jpg 640w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_15_34-Films-TV.jpg 1909w\" data-sizes=\"(max-width: 580px) 100vw, 580px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 580px; --smush-placeholder-aspect-ratio: 580\/113;\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">But if I put the file on my server and call it from there, with the JSON file on my server also, it works fine and I can reach the array data, which is great. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"505\" data-src=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-1024x505.jpg\" alt=\"\" class=\"wp-image-6139 lazyload\" data-srcset=\"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-1024x505.jpg 1024w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-300x148.jpg 300w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-768x379.jpg 768w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-1536x758.jpg 1536w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-50x25.jpg 50w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-101x50.jpg 101w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-100x49.jpg 100w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-203x100.jpg 203w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-1297x640.jpg 1297w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-640x316.jpg 640w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API-1556x768.jpg 1556w, https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-07_13_36-Chart-from-API.jpg 1882w\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" style=\"--smush-placeholder-width: 1024px; --smush-placeholder-aspect-ratio: 1024\/505;\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The solution is to bring the JSON file onto the PC so that I&#8217;m calling a local file to set up and test, then move the information onto the server, along with the JSON file. Then its a matter of updating the JSON file if its a time series that is changing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I tried putting RAW in the URL as per the github for my server site but that did not work either. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The plain solution at this time is to use the GitHub repository to call the info too. Actually, I&#8217;m a bit confused now. With the JSON file on my server  didn&#8217;t need it in raw format, but when I was in Github because there is a lot of other html info on the page you do.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>After pulling the data from an API ans displaying it in a web page Chart, and developing a crude selection list for different countries and dates, I started to think about creating my own API as a project. As I looked around for data I came across Dixon Cheng&#8217;s Github repository of Covid-19 data scraped [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6137,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[35,204,3,29],"tags":[],"class_list":["post-6134","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-api_json","category-automation","category-data-extraction","category-web"],"featured_image_src":"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV.jpg","featured_image_src_square":"https:\/\/max-drake.cc\/wp-content\/uploads\/2020\/04\/2020-04-13-05_43_28-Films-TV.jpg","author_info":{"display_name":"Max Drake","author_link":"https:\/\/max-drake.cc\/?author=1"},"_links":{"self":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts\/6134","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6134"}],"version-history":[{"count":0,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/posts\/6134\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=\/wp\/v2\/media\/6137"}],"wp:attachment":[{"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6134"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/max-drake.cc\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}