AI web scraping- where to stand?
I saw this article Sites scramble to block ChatGPT web crawler after instructions emerge and have some feelings about this emerging technology.
For ChatGPT to learn and improve it needs data to be fed into it.
It is obtaining information and data from open-source locations, and also scraping websites to get data to train the models on.
I’m using Bing Crosby (aka Bing Chat) & Bard, both free at the moment. But once they have improved will they be charging for use?
So on one hand I’m profiting from the use of AI, I usually use them in coding to get a quicker solution than trying to do the research on StackOverflow and Google. So I’m benefiting from lots of other peoples experiences and their knowledge is being shared around to the benefit of others.
Github & Paid Copilot
On the other hand, Microsoft with github have included copilot for github which they charge for but its been learning on all the repositories in github. So people have posted code to github, in my case for free, whilst others have subscription accounts, and Microsoft is selling their work back to them after taking a copy from them for their AI tool.
Regarding Github, I’m using a service for free , so I suppose them using my data (most probably to their detriment) to then on-sell with copilot, is I suppose fair. After all, I’ve used their service for free. Although I don’t recall anywhere in the Terms & Conditions saying that they can grab my code regardless. Although most of my repositories are public so anyone can have them.
Now this is my work that I’m posting up there, some taken from others and tweeked to meet my needs – eg my AHK timesheet script. But that was my effort.
Microsoft scrape and use the data and share it around, but could possibly making a profit from my efforts. So if the code is to accumulate worth, how come Microsoft get the reward rather than the creator of the code? Is there any way they are sharing the reward with others?
As it is I only post repositories on Github of things I want to share, so most of my code is not on Github.
Photographers, Artists and AI art
The same for photographers and Artists, the AI artwork is learning on their information and creativity and the company doing the scraping to train the AI are the ones making the profit from it.
This does not seem fair.
My Websites on free hosting sites
I’m not sure if I have a google site any longer, I do have 2 or 3 blogger sites, I think also run by Google too.
That information is on a site that someone else hosts so they have to manage the hardware & software, all I have on them are posts. So if those posts are scraped then there is nothing I can do about it.
The content is not that great, more a covid diary for a few years and a recipe site.
The only choice I have is to migrate that data to a different site.
My Websites on my paid server
So I’ve been blogging and writing articles and developing websites and tools, these I have on my server that I pay for. I want to share this information with the wold, sort of beginner and intermediate learning’s that I want to pass around.
I have spent a lot of time and effort doing things to the point where I write articles on topics, so I don’t believe I want ChatGPT to scrape my site for their own profit.
ChatGPT
I tried ChatGPT at the beginning when it was free and easy to use, but they slowly made it more difficult to use and also wanted to to pay for tokens to use it. Although the cost was minimal I found accessing it a bit of a nuisance so moved onto Bing Crosby (aka Bing Chat) as my go to tool.
ChatGPT is a commercial enterprise and it want to make money. No issue with that. But it is mining other peoples efforts and selling their, and other peoples efforts back to them. I don’t think that is fair.
Google & Google Drive
I also have a couple of free Google accounts , so get Google Drives with those, as well as the free email accounts. Will Bard be scraping all of that stuff too? I’m not sure of user agreement with them.
Microsoft Hotmail & OneDrive
A similar situation to Google for free email accounts and drive.
Free Accounts? Move data to NAS
If there is a lot of data on Google Drive & One Drive should I move that onto my NAS (Network Attached Storage)?
Current View
For the moment I’ve put robot.txt files on all my websites to stop chatGPT scraping them for free.
Maybe this is the state of play for the future, in that the AI companies are making profit off other peoples creativity and work. If that’s the case then I’ll be very conscious of not giving these people access to my data unless I get some reward from it.
For open-source work, there is nothing I can do apart from close it down or move it. But I want people to have access to that data.
Maybe in the future chatGPT and others will look to reward some of the creators of stuff and look to do some sort of sharing of rewards, one way or another. So we’ll watch that space too.
End Comment
It’s a bit like rape, they are coming and grabbing what they can, but that’s no reason to encourage them by opening your legs.
So for the time being I’ve my robot.txt hopefully stopping them from scraping my sites indiscriminately.
I may need to enhance these files to protect against other things like Bing Crosby (aka Bing Chat) and Bard too and other private scrapers that are doing the same thing.
There is a bit of hubris in this, in that I have the arrogance to assume that my sites are of any relevance at all to these big companies. I’m most probably the smallest irrelevant microbe, but better to stand by values and not give them everything.