Web Scraping With Node.js & Cheerio

This is based on the video by Brad Traversy below:

Just as a tag, there is a wiki api page here. I’m putting it here instead of previous article because the website is going extremely slowly.

I liked the idea of this and the beginning part when he was pulling the whole page worked fine. But when I got into the const $ (jQuery apparently) it didn’t work for me, the server would start and nothing would be returned.

I also had trouble with the Terminal in VS Code, it would not run nodemon so I had to go into powershell in admin mode and change configuration and it would only work after I restarted the computer. All a bit time consuming. You tend to loose sight of what you are doing. I watched another video of someone doing scraping below. I’m in awe of both these people the way they fly through it, they must be very familiar with node.js programming.

I tried axios, but still had problems with $. So I went back to first video.

I’m trying to scrape the MoH webpage, that is static, the data is fixed on the site. So I’ll work away at that.

The testing is now allowing me to use the $ but I’m still not drilling to the right level to get what I need.

I also got confused as I was trying to start a webpage to view on localhost:3000 but nothing was happening. On reflection, 1/ i don’t have the express package installed and 2/ I’m scraping a site, not generating a web page, so all the action occus in the console. So i need to focus on using console.log() to see the information and see what is being returned.

I have a dilema in that 2 tables have exactly the same class name so am having difficulty extracting the data from the first table. Finally found a solution on StackOverflow here.

So not the most productive day, I can now scrape the first table of the fields.

I need to get this into a JSON file, and get the date, and later be able to append it to an existing JSON file.

So, in cheerio, to get a class you use $(‘ .rabbit’). The full stop denotes class=, and for id= you use $(‘ #rabbit’). I was trying to use $(‘ #rabbit’).att(‘tr’) to get the tr tag but with only partial success. Also tried .children() and .content().

Static & Dynamic webpages

I started watching a few videos on web scraping, I’d heard of Cheerio and also of pupeteer for the purpose. The video below is impressive and talks about site loading then js scripts bring in data dynamically. Also about pupeteer being a chrome emulator so cvan log in for you and then scrape dynamic sites as JS data loaded. This will be a future step I need to explore. I take my hat off to the person in the video below, he’s fast and knowledgeable and explains well.

An intereting thing in the tutorial below is how he is going to grab the JS loading data to grab incomming data from a dynamic site. Also he doesn’t seem to use as much $ in his code. The other interesting thing he has is the use of the table library to present information, I will have to play with that too.

End comment

I’m going to stop this post here as I’ve sort got the initial grab of table data.

I’ve been doing a bit of research and Code Train have a series on building an API in Node, so I’m going to follow that.

I have picked up a bit of learning from this onme, like how to grab items and test in a web scrape with cheerio, but as that is only one part of the whole process I’ll start afresh in the next post.