Knowledge

Heigh-Ho

By Paul

June 23, 2024

If you want to go mining in Canada, say for iron ore, there’s a lot to be done. First, you need a prospector’s licence to survey the land. You'll then need to secure the mineral rights to the area. You’ll need access to the area, and that might involve negotiating agreements with land owners and engage with Indigenous groups for their approval. You’ll have to provide an environmental assessment of the impact on the land, wildlife and people affected. You’ll have to say how long the mine will operate and have a plan for eventually closing it down. You’ll also need to pay taxes and royalties on the ore you take out of the ground.

Pretty complex, eh? There’s a lot of steps. I’ve left out about a dozen of them. But you get the idea.

There’s an old Internet term for gathering information called “data mining”. People would go on-line, do Google searches, visit company and educational web sites, and make notes on what they found. It was time consuming and labour intensive. It was also one-off. It was difficult to search many things in a short period of time. On the other hand, if the information wasn’t subject to copyright, there was nothing wrong with doing this.

But what got me thinking about this was an article in the Globe (paywall) featuring Innovation, Science and Economic Development Canada, along with Canadian Heritage’s public consultation regarding changes to the Copyright Act in Canada.

Companies like Meta, Microsoft and a few home-grown ones like Cohere argue that internet content should be free from any copyright when used to educate their Artificial Intelligence (AI) models. These large language models (LLMs) depend on vast amounts of data to train the system. The more data, likely the more accurate the outcome (despite Google’s AI model recently recommending using glue to keep cheese from sliding off a pizza, but I digress).

The thing is, this is just mining without a permit. No one is being consulted. There is no government regulation. No negotiation with the rights holders who own the data. There are now lawsuits pending from Getty Images and the New York Times against AI companies for using Getty's and the NYT's data on their AI systems.

How is AI any different than mining for iron ore? Shouldn’t there be a consultation process? Shouldn’t licences be granted and royalties paid? Just because it’s “easy” to scrape web sites, scientific papers and books, does that somehow make it okay?

From the article:

Cohere, which builds large language models, which underlie chatbots and other applications, said in its submission that AI training does not infringe on copyright, making licensing unnecessary. “Remuneration would not be appropriate,” according to the submission, which was posted online recently.

Google similarly argued that requiring licensing or permission would be “essentially impossible given the large amount of data needed to train AI models and the lack of comprehensive data about copyright ownership,” the company said in its submission. “It would effectively block the development and use of large language models and other types of cutting-edge AI.” Google added that it has introduced tools to allow web publishers to opt out of having their content used to train future AI models.

Well, boo-hoo. Something is hard. They propose an “opt-out” model. Remember how well that worked for Rogers a few years back? As usual, it comes down to money. Why pay someone for their work when you can take it for free?

Likely everything that’s out there has already been scraped, and the AI companies are backfilling their potential liability.

Anyway, it’s all a big fat mess, and it’ll be interesting to see how it settles. Will the Federal Government put some teeth in the legislation, or do the usual and roll over, tummy up for a rub by the tech companies?

In the interest of full disclosure, I used AI to help me write this blog post. It gave me the steps for how to get a mine up and running in Canada. The difference is, the AI I used fully attributes all of its data to the web sites that it retrieves it from, rather than the opaque results from Open AI, Google and Microsoft. The system I use is essentially Google search without the advertising. Kind of like how Google used to be, once upon a time. Call me a hypocrite for using AI, but there are spectacular uses for AI that don't infringe on copyright.

You never know who or what you'll find when you take a stroll in downtown Toronto.

P.S. I had stopped posting to this blog for a while. Most of my posts seemed to be on the "grumpy-old-man" side of things. Well, I've decided, if the shoe fits...

Tweet 0

Paul

June 23, 2024

Be sure to check out Dana's blog, Time to Write. I like to think I'm a pretty good writer. Dana is an AMAZING writer.

Want to Comment? Click here to sign in or to subscribe to this Blog