Python Selenium Webscrape Family History
In this project I posed the question of which sibling lives the longest. It was a question that popped into my head as two of my siblings passed away last year. I looked up on the interwongle and couldn’t find any data on the subject so decided to do my own research.
I used the NZ Births Deaths & Marriages website and did some web scraping using Selenium and Firefox to scrape data from the tables to store for analysis.
As I’ve mentioned in previous videos I usually have no great success scraping from the front end but I decided that I would persevere with the process.
Two things helped me immensely:
1.from selenium.webdriver.support.wait import WebDriverWait & from selenium.webdriver.support import expected_conditions as EC.
With this method it was able to wait for the next page of the broser to load pro from selenium.webdriver.support import expected_conditions as EC
This allowed me to ensure the loading of the next page fully prior to scraping the page as it worked though multiple pages of the table.
2.Try/Exception which would throw an exception if the try code failed, and this would just allow the script to continue running in the next part of the loop.
I expanded the code to work on a list of names and also broke down the time period to smaller segments so that I’d hope not to reach the maximum matches on a query (500, or 17 pages). Because there may be cases where I do meet that maximum, I created a log file that would record that information so that I could go back and scrape a smaller list to make it happen.
Sample group and framing project
For the actual scope of the project there were issues surrounding the sample of the group, whether to :
- A lot of births are registered but far fewer deaths as many people one presumes, went overseas and passed away outside NZ borders
- Obviously a lot fewer deaths as a lot of the female offspring would ave married and taken their husbands name
- So in sample group whether include females in the siblings, as trying to find when they passed away would require possibly tracking down their married name and then when they passed away.
- With that twist, females do generally live longer than men.
- Also supposedly we are looking at natural deaths, so what about the World Wars? They would have had an effect on the country’s demographics. One way to reference that is to only look at siblings who were 15 years old in 1945. As they would be too young to enlist and so your sample would start from 1930 onwards.
- Ethnicity is another issue. While scraping most popular names in Districts, there are asian names like Singh and Chow that show up, but when you look historically, these are migrants that have come through in the last few generations so the amount of historic comparison, generationally is skewed.
- Maori names are a particular challenge, it was only after a specific date that Maori marriages needed to be registered. Also finding the names to do web scrapes on to have a reasonable body of data to make comparisons.
- & Pacific Island comparisons are also a challenge as, similar to Asians they came into the country reallly in the 70’s to work in the fatories, so there is not a lot of data on families.
- Then there is the number of siblings, whether only 2 brothers , or families with more that 2 brothers, how do you manage those? As a separate exercise and sample to investigate independently?
So all of these things narrow down the group that you can compare. So is the process sound or is it too skewed in the first place.
I think the framing of what is trying to be achieved needs to be specified up front so that it is clear what the comparison and data shows.
I’m still thinking of this at present and will take it up again at a later date to try and get some results, but I feel I need a good frame of reference, or narrow it down to make it very simplistic to begin with and then expand it later.
End comment
After starting on this I got a bit distracted and started to search my own family tree, so wandered off on another project.
Then I got some of my historic Electricity data and focused on that.
So I do need to revisit the project at some time.