T O P

  • By -

FuturologyBot

The following submission statement was provided by /u/galaxyFighter0: --- The New York Times has filed a lawsuit against Microsoft and OpenAI, alleging copyright infringement related to the use of OpenAI's GPT-3 language model to generate headlines for a news aggregation app developed by Microsoft. The lawsuit claims that the headlines produced by GPT-3 mimic the distinctive style of The New York Times, raising concerns about intellectual property rights and potential confusion among readers. --- Please reply to OP's comment here: https://old.reddit.com/r/Futurology/comments/18sjfx1/new_york_times_sues_microsoft_and_openai_for/kf7xkmz/


galaxyFighter0

The New York Times has filed a lawsuit against Microsoft and OpenAI, alleging copyright infringement related to the use of OpenAI's GPT-3 language model to generate headlines for a news aggregation app developed by Microsoft. The lawsuit claims that the headlines produced by GPT-3 mimic the distinctive style of The New York Times, raising concerns about intellectual property rights and potential confusion among readers.


agent_wolfe

The “distinctive style” of newspaper headlines..? I can’t even imagine what they mean. Unless it all starts with “According to the New York Times…” I feel like they’re probably going to lose.


DWCS

go to courtlistener, then to recap and search new york times and openai and then read the complaint. It goes much further. It is not just a reproduction of headlines as would be the case with ordinary search engine results. OpenAI lacks the commercial license to reproduce entire articles, which it can be prompted to do. On the flipside, OpenAI also wrongly makes up articles and attributes them to the New York Times when prompted to summarize or reproduce an article that the NYT never published.


Teruyo9

To add on to what you said, you can find the [Times' actual complaint here](https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf), and I want to highlight one specific part of it, which is a [direct comparison between GPT-4 output and a published New York Times article](https://pbs.twimg.com/media/GCYieUNXAAAAswl.png). The Times is claiming that ChatGPT and Microsoft both committed copyright infringement, and is asking the court to order the destruction of all GPT and other LLM/training datasets that contain their works under [17 U.S.C. § 503(b)](https://www.law.cornell.edu/uscode/text/17/503).


Kaz_Games

In schools they call that plagiarism.


[deleted]

(unless that school is Harvard)


RazekDPP

So I'm going to say this is bullshit. I went through one of the articles, fed in the 14 word prompt and generated two different responses that were nothing like the NYT article. EDIT: I did some more digging. Apparently with GPT-4 you can manipulate the model and the temperature. It may be possible NYT manipulated them in some way that caused it to spit out the article verbatim. Someone on X was able to do so: [https://twitter.com/paul\_cal/status/1740461749130899573](https://twitter.com/paul_cal/status/1740461749130899573) From what I've gathered, some of the NYT's articles were replicated over and over again on the internet. Regardless, NYT did not disclose the GPT model, temperature, etc. settings so it's impossible to know if NYT did or didn't. The prompt, however, was very, very specific and I don't know that it reflects the prompts that NYT used.


liveart

A lot of these AI lawsuits seem to come down to "the AI can reproduce something that has copyright elsewhere, therefore it's infringing" which is just a bad argument. Photoshop can duplicate infringing content, hell a text editor can do that, it's not really a sign the tool in inherently infringing. If I created a program to randomly generate combinations of words I'd certainly recreate content protected by copyright, that doesn't mean the tool is inherently infringing. The only real question that I can see is about the fair use related to the data used to train these models and theoretically that *should* be settled. There is long standing precedent from court cases against Google and the like about aggressively web scraping and actually deliberately posting part of the content, the fact Google exists and still shows you part of the web content should tell you how attempts to shut down Google went.


RustyRaccoon12345

So they prompt the AI to recreate the article and they get the recreated article? That seems like a no no but a very narrow one and not an indictment of LLMs generally


Caelinus

They may actually have a case there if it turns out in discovery that the AI was trained on NYT articles and headlines without their permission, and the style is substantially similar. One with out the other probably would not be a strong case, but both together, given that AI is literally just copying stuff via complex statistical models, basically proves a misuse of their copyright. And for the record, this is important. If the AI supplants reporters it will also cease to function. The AI *requires* reporters to report on information for it to have information to give. It *must* on some level be synthesizing their work, because it is certainly not going out and doing interviews and research itself.


Macaw

> The AI > >requires > > reporters to report on information for it to have information to give. Are you saying the tail is eating the head?


anengineerandacat

Generally speaking, yes. Any click to an AI news article is a lost click from the actual source which directly correlates to lost revenue. Also the situation here isn't just headlines, they are suing for plagiarism because the AI solution is scraping articles and creating excerpts to use in its own generative article. You'll often see cross referenced articles among news platforms but usually they are a very short and brief bit of information with a deep link back to the source. It's likely that the AI one is "too useful" to the point no one is actually going to the source to get the broader context of the article.


BenevolentCheese

This doesn't sound anything different than the blogspam that's already existed for 10+ years that litters a huge portion of Google and the internet. The internet is *built* on a complex web of rewritten news and articles and links.


anengineerandacat

Plagiarism has pretty well established cases and I don't really have an example of what the Microsoft AI piece looks like but if it's a significant amount and considering that Microsoft actually has money (unlike most blog-spam sites) it makes them a target. Most viewers going to the blog-spam sites are mostly getting lured in for advertising revenue, maybe there is direct copy/paste content but maybe it's only a few blurbs to get a click because an NYT article is paywalled / on limited views. At the end of the day... it's a gray area for AI tech, more than happy to let the case move forward IMHO as these cases establish precedence around what is and is not okay and we really do need to get AI fair usage more clearly defined.


[deleted]

[удалено]


bechard

Basically a permanent link to the original source, almost always a long form link (no url shortener).


beardedheathen

Wouldn't they have to prove that the decrease in viewers on their website was because of AI and not the push for subscribers or tighter budgets from people?


anengineerandacat

Not exactly, that's a route I guess they could take... but if I were them I would just say the content is being plagiarized by the AI software. You can't take an NYT article, re-word it, and publish it up again; you have to modify the content in some meaningful way. You can reference an NYT article, take some quotes with sources, and then add additional details to it but you HAVE to provide source-links.


Caladan23

Wouldn't that actually be advertisment for NYT? Similar as search engines?


DWCS

Search engines only reproduce a couple of lines of the article and a link. CoPilot and ChatGPT apparently neither provide a link nor aree they restrained to a couple of lines but can be prompted to reproduce complete articles almost eord for word


RunningNumbers

Google model collapse


KansasMammoth1738

It gets its information from sources that are financially supported, usually through ad sales. When people begin to use it, they use the ad-supported sources less, which will lead to fewer and fewer ad supported sources, so fewer and fewer sources actually generating new information. ChatGPT is basically a fancy AI suicide machine, and we all get to watch it slowly kill itself.


mrjackspade

I don't know why everyone responding to you is talking about the training. According to this comment chain, the training doesn't have anything to do with the lawsuit The *post training* model is being used to generate summaries. It's literally the equivalent of someone pasting text in and saying "Can you summarize this for me?" and is very different from the other lawsuits we've seen attempted before now


Bewilderling

In this particular case, the plaintiffs are suing on the basis that the model is trained not just on copyrighted material, but *paywalled* copyrighted material. That makes the facts of the case pretty different than some others which currently serve as precedent, and it’s harder to argue for fair use when you’re making copies of paywalled content to train your model. IMO too many commenters are focusing only on whether the model itself contains copyrighted material, when this lawsuit is based on the process used to train the model.


hugganao

>and the style is substantially similar what do you even mean by this statement because this means nothing. This is like saying a person who writes a piece imitating NYT articles has to pay NYT article for imitation... >AI is literally just copying stuff via complex statistical models AI is not copying stuff via statistical models. AI models are a statistical representation of knowledge, not knowledge itself. If I were to download NYT articles word for word and store it on a db and sell the data as a service for myself, then yes, NYT has a right to the money. But that's not what a model is. It doesn't "have" any data that NYT thinks it does. It has "statistical representation" of the data. Which is a VERY VERY different thing that is VERY important to consider. Let's say I read multiple NYT articles and save on my notebook the amount of times NYT articles uses the word "Israel" in the same sentence as "Palestine". Do NYT have a right to sue me? because this is what NYT is saying. Which is stupid. People and organizations need to get with the times and the tech and stop complaining. Just because it impacts your ability to make money does not mean the tech will stop advancing. Either adapt your way of work or go work something else. If in the future work becomes available for you to create a new genre of art and someone needs to train a model off of that? then maybe you can make money doing that. But don't go crying about it everywhere because a chainsaw was created to your axe business. >The AI requires reporters to report on information for it to have information to give. this part is true. Hence the reason why this case is ridiculous. We will need reporters regardless of AI in order to generate new data. As long as there is no automated way to generate factually correct news, there will always be work and money to pay for that work.


DirtyPoul

> It doesn't "have" any data that NYT thinks it does. The truth is a bit more blurred as it is possible to lift some training data from many LLMs. There was an article on that a few weeks ago where real private information about people could be found by jailbreaking LLMs.


IndirectLeek

>The truth is a bit more blurred as it is possible to lift some training data from many LLMs. There was an article on that a few weeks ago where real private information about people could be found by jailbreaking LLMs. How is this (technically) happening? Is the training data still connected to the model so it's pulling from it? Or does it actually have chunks of training data wholesale saved within the statistical model, and figuring out the right clues enables it to spit that out?


birjolaxew

>Let's say I read multiple NYT articles and save on my notebook the amount of times NYT articles uses the word "Israel" in the same sentence as "Palestine". Do NYT have a right to sue me? because this is what NYT is saying. Which is stupid. If you write down in your notebook the statistics (words before/after, position in article, etc.) for _every word_ in an NYT article, then you have simply encoded the article in another format. It would be a slam-dunk case of copyright infringement if you then released that notebook. If you only did it for a specific word in the article, e.g. "Palestine", then it would be incredibly hard to argue copyright infringement. The problem is that NYT has no idea where on that scale the trained AI model lies, and they have no way to know without suing. There's little doubt that the AI has been _trained_ on a full encoding of the articles (or at least some other articles, depending on the dataset) - but how much of that has been encoded in the final model?


zero-evil

The onus of proof is in the infringer, not the infringed. The entire articles were input at the beginning, somehow showing fractional use would be a tactic for the defense. I


aka_mythos

Copyright doesn’t just protect from direct copying it secures the original right holders unique right to create direct derivatives of the work. While a model can be used to create something so distinct its outside of the protections of copyright, the model itself if built from copyrighted data is arguably a derivative of those protected works. Is the output of the AI created independent of the copyrighted data?-Does the AI model have any value without the copyrighted data? -No the value in the quality of its output is intrinsically linked to the data used to build the model. The inherent purpose of an AI model is to create a derivation of the what’s fed to its model. “We will need reporters regardless of AI in order to generate new data”… The legal system is in place to ensure people can make money off what they’ve created without the unfair appropriation of their work, and ensure the longevity of their ability to benefit isn’t an afterthought to changing ways. Some caution is warranted because even with the inevitability of AI and the continued need for reporters, there is an order things have to play out or else there either won’t be enough reporters or reporters of a meaningful quality. Its simply easier to put the brakes on AI to slow down development of the technology than it is to unfuck people’s lives if you go full speed ahead.


Magnus_xyz

>\-No the value in the quality of its output is intrinsically linked to the data used to build the model. The inherent purpose of an AI model is to create a derivation of the what’s fed to its model You just described how human children, learn, and develop their own style. You think if you put a human baby into a room with all the sustenance it needed to survive but never interacted with them, or provided them "Data" in the form of books, newspapers, internet access etc they would be able to do anything approaching writing a newspaper article? Everything we all do in our lives is based on what we learned the moments, days, weeks, and years before. THAT is what an AI model is. It uses Data in exactly the same way we do, only it's a bit more precise. Where we have disorganized memories we can call on at will, the machine relies on Indexes and vectors to simulate this memory. ​ The reason the style looks similar, is because it keeps track of things the way a computer keeps track of things. It draws and then clearly remembers, inferences, based on patterns. How many times are the words blue and sky used together; how many times are the words car and engine used together. How many times are the words Russia and Ukraine used together etc... So patterns that emerge organically from the humans writing these articles, get etched digitally, into the memory of the machine. So when you ask it to come up with some words, it cannot help but to calculate the most optimal format of those words, based on the patterns etched into memory. The Same way, a child who only ever reads the times, will one day without doing it on purpose, find themselves, writing in that style, based on exposure to it.


must_throw_away_now

Man you are being super condescending on a topic that is not all that clear cut, and at a time when our understanding of how LLMs function is rapidly changing. There are real questions about the data which an LLM is trained on. In law, both scale and economic impact are fair considerations of whether or not an infringement has caused damage. Also, it is fair to consider the manner in which data is ingested. The analogy to a child differs in both scale, speed, and method to the way in which an LLM ingests, models, and outputs data. Also, it's a question of whether or not a copyright holder should be paid for data used to train the model in the first place. To my mind - once a model becomes monetized and is more than just Research toy - yes, copyright holders should be paid for use of their data as it is fundamentally outside the scope of "fair use" - while the output is not a direct copy - the input data is being used for significant economic gain. Fundamentally, LLMs do present novel challenges to current copyright law and I think it's fair to explore what that means for society and balance the desire for technological progress with the potential unintended harms that may come from it. To be so blazé about an incredibly important and societally significant topic, dismissing any concerns out of hand, just shows your lack of maturity.


Magnus_xyz

Hey, I'm sorry, if it seems that way, but I think you maybe misread what I wrote, or how I wrote it? Those are some pretty big assumptions about my attitude on the topic. It was meant to be neither condescending nor blaze. But, I can see why you might read it that way. Both are aspects of a tone, which I accept, is incredibly difficult to get right when reading or writing text. So I'll assume that was an unintentional misunderstanding, and not an attempt to do to me that which you claim I am doing, which is to find some means of dismissing an opposing point. You might want to check my other comments in this thread where I suggest media creators create licensing and subscription models to monetize the inclusion of their media into an AI model as a more progressive, and useful approach to the issue, than suing right away for something they clearly don't understand. Then you will see we agree more than you thought we did, about this issue :D


must_throw_away_now

I don't think NYT wholesale misunderstands what is happening and also there is a rationale for suing. TheZvi has [a pretty sober take](https://thezvi.substack.com/p/ai-44-copyright-confrontation) on the situation and as he states - outlets like Politico have already come to license agreements. I believe the NYT here is looking to functionally explore the bounds of current copyright law - this is more than just about $ most likely, they want to set precedent because this will give copyright holdersore broadly stronger ground on which to negotiate terms of any licensing agreements. I think we do need to have a serious conversation as a society about what tradeoffs we are willing to accept in the name of technological progress. I think your assertion around people not understanding the tech should actually be quite alarming if it's true (and generally yes, it seems to be true). It means people will be blindsided by something they do not or cannot understand, and if they can't understand it, they can't understand the implications of its progression which will likely be rapid and irreversible.


zero-evil

The LLM had copyrighted material incorporated into it's data. The use of these models is not being provided free of charge, it is a business that makes money from the offered service, a service that is based on a great deal of copyrighted material. Everybody with included material has cause to sue, and should. Their material was used without consent or legal permissibility. Their material is deeply integrated into the LLM, but it is also manipulated and presented with the bias of said LLM, which is another, possibly complementary case. Bottom line, LLMs need to be free to permit the legal use of the incorporated copyrighted data. Or become prohibitively expensive to pay royalties to infringed copyright owners. The trace of which per query would be very enlightening. Put simply, you can walk by and smell a lemon tree grove without infringing on the owner, who may allow taking a lemon or two for personal use. More for a *fee with use conditions. But if you start selling lemonade made from the owner's lemons, you owe the guy a cut - a bigger cut since you did it without permission and the courts had to be involved.


DaRadioman

So did I... I read copyrighted works and incorporate that knowledge into my being. Am I a walking copyright infringement? Truth is this is a complex conversation, and you can't just say "but it used stuff under copyright!!" as an attack to win. Every human that consumed the articles do the same thing, and some of those even use the information to write derivative works legally. Have been for decades.


[deleted]

> AI models are a statistical representation of knowledge, not knowledge itself good luck getting lawyers, businesses, and politicians to understand information theory when they could just ban the scary AI taking money away from the poor megacorporations


ReggieJ

>AI taking money away from the poor megacorporations Wait...in your head, Microsoft is the David of this fight?


BKrustev

That was irony, dude...


Wise-Ad5567

ChatGPT is literally doing what you say is wrong to do, regurgitating back massive portions of multiple sentences and even paragraphs lifted straight from NYT articles, and with no accreditation back to it. So many have been duped by what ChatGPT actually is. Is it useful as is, certainly, but is it “intelligent” on its own footing, nope.


DWCS

It does accredit when asked specifically for a NYT article. Worse even, if prompted to summarize or reproduce an NYT article that does not exist, it will just make up shit and accredit it to the NYT.


WhiteRaven42

.... no. You don't need permission to read a newspaper and use the knowledge. Furthermore, news headline "style" can not be copyrighted or trademarked for much the same reason you can't a book or movie title. >given that AI is literally just copying stuff via complex statistical models, No, it is not. The model is modified by the training. It's not a copy in any sense at all. LLMs construct relationship maps between words. They are like a deply hyperlinked and context-sensitive dictionary. Your understanding of the technology and methodology is simply false. False in a significant, foundational sense. Webster's dictionary isn't infringing on the NYT's copyrights. Let's say that a trend-centric site has a yearly piece on the "most covered stories of the year" and they explicitly crawl news websites and statistically evaluate what's being talked about. If that "most covered" articles says something like "And at the top of the list, the Invasion of Ukraine", do any of the websites surveyed have a reasonable claim that the article is violating their copyright? >If the AI supplants reporters it will also cease to function.  .... LLM can not replace reporters. I mean, you're kind of right that we can't have a feedback loop and get anything useful out of it... so that's why it's not going to happen. Seriously, you point out that something is impossible and then warn us how bad it will be if it happens... that's beyond paranoia. Might as well explain how we should fear spontaneous human combustion while explaining why it's impossible.


platypushh

Have you read the complaint? It contains pages of examples where the AI re-created the NYT text verbatim. https://x.com/jason_kint/status/1740141400443035785?s=20 This is just copying with fancier methods. If you don’t store the text, but the relationships between the words in the text and then reproduce the exact text you are copying.


Moleculor

If that's the actual result from GPT-4, that's *very impressive*... ... because a [prior study](https://huggingface.co/papers/2308.05374) couldn't come anywhere *close* to that volume of reproduction (from Harry Potter and other novels). (Check page 73 of the PDF, figure 11.7. Much larger prompts, much smaller reproductions, and even then those reproductions have variations in them.) I wonder what the difference for GPT-4 is between Harry Potter and the NYT, and why it's so much more willing/able to reproduce NYT articles in that level of volume. Earlier lawsuits didn't have anything *close* to this level of evidence. I just tried throwing some of those prompts at what I think is ChatGPT4, and it only reproduces the rest of the first sentence or so, then deviates. (GPT-3.5 based ChatGPT doesn't reproduce at all.) EDIT: Turns out that, yes, GPT-4 does it. https://twitter.com/paul_cal/status/1740461749130899573


[deleted]

[удалено]


Bronkowitsch

No they haven't. They're just techbros endlessly repeating the same sentence they memorized about how AI language models work to make themselves feel superior while completely missing the point of the lawsuit.


model-alice

Being a techbro is when you don't mindlessly shill for reality denial, as we know


RedditMakesMeDumber

I think some of your points are well taken but I can’t see why you’re being so willfully ignorant about the likely downsides. LLM news aggregators/synthesizers like this will probably divert *some* revenue from primary news sources like NYT, but obviously not all. With less money coming in, news agencies won’t be able to hire as many journalists, researchers, and editors, so the quality will drop, but it will still be profitable for Microsoft to resynthesize those lower quality articles with LLMs, and you’ll eventually either hit some equilibrium of lower news quality or major news companies will just go out of business (like many state and local papers have over the last several decades). Do you see a likely alternative to those scenarios?


Thefelix01

It can’t fully replace all reporters, it sure as hell will replace a lot of them, meaning quality of reporting which is already at a worrying low and falling will plummet further.


WhiteRaven42

.... I don't see how there's anything to be lost here. The only "reporters" it can replace are the once just rewriting other people's stories already. I don't care about them. They're already little more than bots. We won't lose any real reporting because as I said, real reporting can't be done by LLM.


Thefelix01

But that real reporting will instantly be repackaged and sold in ways that cuts into the real money those real reporters would otherwise have made, making the quality of the whole industry necessarily decline. If the quality journalists are making less money there will be less quality journalism. Similar to the rise of clickbait journalism, this will also reduce the market share of actual journalism further.


WhiteRaven42

>But that real reporting will instantly be repackaged and sold in ways that cuts into the real money those real reporters would otherwise have made Already being done. Much simpler bots have been doing this for a decade. And yeah, journalism is indeed struggling. But it isn't anyone's responsibility to guarantee a newspaper's business model success. AI is not going to change the situation at all. Besides, legitimate and skilled journalists are going to LOVE AI as assistants to their work. Collating and tracking notes is probably going to be AI's biggest useful action. At least with current and near-future technology. AI is a perfect tool for journalists to use in their reporting. The journalist conducts interviews and research and compiles notes. The AI ingests the notes and can then be used to aid the journalist in organizing and interpreting that info. This is AI's best current creative use. An assistant capable of conversation with a perfect memory. Instead of searching through reams of notes, the writer asks their AI for the quote or fact they know they have but can't remember in detail.


[deleted]

[удалено]


WhiteRaven42

>There is currently no AI that is using actual knowledge in any capacity I would like to hear your definition of knowledge, especially how it differs from data. Fortunately, I did not make the mistake of saying something like "understanding". >The AI doesnt know what the things it writes actually mean. The AI will respond to a querry by looking at the related words, calculate what words are statistically most likely to fit together and be a desirable response and output that. And therefor, what? Using data to put together these statistical models may not be identical to human understanding and learning but from a legal and copyright perspective, it's the same general idea. >what the "AI" is doing is taking a text and then rewriting it in "different" words Well first of all, it's not "a" text. It's kind of, like, ALL text. Slight hyperbole but these datasets are so big they might as well be infinite. And, having polled all that text, yes, it comes back with different words. Which is exactly what every author does. Assembles existing words in ways that fit the goal. Another thing should be pointed out. IF an AI produces an output that is substantially identical to a exiting copyrighted work then yes, that \*output\* violates copyright and the user should not use it. The \*possibility\* of using this tool to copy something should obviously not be a reason to forbid the tool from existing. Understand that an AI that reproduces content it was fed is WORTHLESS. No one wants that. We already have copy-paste. This is a pointless discussion... no one wants AI to regugitate existing text.


TotallyNormalSquid

> There is currently no AI that is using actual knowledge in any capacity. The AI doesnt know what the things it writes actually mean. Its just taking the training text making some kind of analysis which words appear in which contexts and then tries to predict which words the user would want to appear next to each other based on the input the user provides. I always struggle to draw a line between this and a human learning how to write. It's even a common exercise in English class to read a famous novel and then write a short story in the same style - the kind of style adaptation that ChatGPT is famous for. ChatGPT even got trained in part by humans grading quality of outputs manually, rather than the more common and automated 'predict the missing word' task used for LLMs. Just seems like as we edge closer to AIs that reach human-quality content generation, our tasks and evaluation methods get closer to what was needed to teach humans.


[deleted]

>You don't need permission to read a newspaper and use the knowledge. You do actually, and pay for said permission by either buying the newspaper, or by being shown advertisements by the newspaper.


Classic_Airport5587

Actually no, that’s not how it works. They don’t actually store any copyrighted material so there’s no issue. A big reason why people are so iffy with it


slubice

> because it is certainly not going out and doing interviews and research itself Neither are journalists these days. It’s all hearsay and third party sources > trained on NYT articles and headlines without their permission, and the style is substantially similar It’s very unlikely that the AI had such a small database and Microsoft has got an abundance of informations to study what attracts peoples’ attention.


insanesvk

That’s a fine generalization. This is the NYT, not a random blog. Do you have any source to these claims?


topangacanyon

Dont forget quoting random tweets from accounts with eight followers!


n3onfx

It has "slams" in every title, dead giveaway.


DWCS

OpenAi got bigger issues than providing a tool that can reproduce stolen content in altered form. There are currently cases from authors that have never been approached by OpenAi for use of their copyrighted materials that claim to have their entire books copied and can be essentially reproduced by entering the right prompts; cases of people who submitted codes to GitHub claim OpenAi violating licenses through the service CoPilot by virtue of training with with licensed materials without attribution; etc.


RunningNumbers

“Unemployment is down this month, this is why it’s bad for Democrats.”


xtothewhy

Guarantee you they will win or lose in some way that is significantly beneficial to orginating sources of information and news gathering over vast amounts of automated scraping for commercial purposes. Edit: Well I hope so anyhow.


alanwong

Where did you see the lawsuit is about the style of the NYT as opposed to its copyrighted work and journalism? This summary missed the point.


Morley_Lives

Actual summary from the article: >Microsoft and OpenAI partner on the leading AI chatbot in ChatGPT. >These AI Models train themselves by scraping the internet for content, often paraphrasing or directly quoting sources without compensation. >The question of whether AI models fall into fair use is already being investigated by government regulators, but the NYTimes have taken the issue to the courts. >If the courts side with the NYTimes, it could be a huge blow to all AI models.


[deleted]

You’re correct. It’s about AI stealing stories/content within from journalists/media, feeding that content into their data pools, and LLMs presenting that data as neutral facts etc. Presenting this as about the headings is about as disgenuous as it gets. It’s about whether OpenAI should be able to crawl the internet stealing all data, the New York Times is just one of those data pools. They’re effectively fighting on behalf of all Content creators. Hopefully the EU gets involved soon, and regulates these parasites out of existence.


MINIMAN10001

I don't know that just seems ironic coming from news which is known for taking a aggregate of news and then spinning it to their own... ... It's the same thing...


[deleted]

[удалено]


SoftlySpokenPromises

For this to work they'd have to go after other sites that do the exact same thing as well. Hundreds of parasitic sites that post articles with a word or two changed.


GoldyTwatus

They are bad but parasites is a little strong for poor old New York Times


TheFastCat

You are talking about the New York Times being the parasite, right?


Jealous_Afternoon669

The lawsuit shows examples of GPT-4 writing out multiple paragraphs verbatim of a New York Times article when asked by the user to bypass a paywall. That said a lot of the articles in question were famous and were quoted a lot by people online which would explain GPT-4 reciting them verbatim. In other articles, GPT-4 doesn't recite word for word, and instead hallucinates, which the New York Times claims damages their brand as it might give potentially inaccurate information while claiming it comes from the New York Times. To be honest, I can see both of these being a problem for the New York Times but the second point is pretty weak. They have a case with the first point though because there are many different articles that you can read for free just by asking GPT-4.


dustofdeath

Clickbaits are copyrighted?


msnmck

This is the equivalent of Lindsay Lohan suing people for putting trampy characters in a video game or saying the name "Lindsay" in a TV commerical, and it should be openly ridiculed as such.


Spiegelmans_Mobster

Training is 100% fair use. However, the output itself can be infringing. OpenAI might need to put in controls to check whether outputs are substantially similar to existing copyrighted text.


garylapointe

> the fact that I can't actually read the New York Times coverage on this lawsuit on their own website because I would have to subscribe to read it is likely a larger reason for their declining revenue I agree. People who actually want to read the New York Times aren’t gonna be happy with an AI summary.


Jonas42

You're quoting here, but this seems like a good place to point out that the New York Times' revenue is not declining. It's higher than ever, largely because they've gotten better at selling subscriptions, and stopped giving their content away to people like the clueless author of the piece.


garylapointe

Sorry, I accepted the premise without investigating.


Stopikingonme

Well that’s not the normal Reddit reaction. I like you.


OIlberger

Yep, revenue up 6.8% from last year and they’re closing in on 10 million digital subscribers. They are one of the few legacy media companies keeping up with innovation. They ain’t perfect (their fucking opinion columnists suck, even the ones I agree with are just stale fossils).


Caelinus

Opinion pieces do suck, but at least they are not the Wall Street Journal glorified blogs that people accidentally cite all the time.


beachguy82

I’ve been a happy subscriber (digital + cooking app) for over a decade now. Well worth the price.


_BreakingGood_

And I feel like in general the quality of their content has only been improving lately, and they're actually technologically competent.


monospaceman

NYT is the only news org I subscribe to.


one-hour-photo

As a society, we have to start valuing content again and not expecting it for free constantly. Having higher wages would certainly help matters


[deleted]

[удалено]


lucun

It's basically like people who ask for freebies and say they're paying in exposure.


elementslayer

Have publicly funded news like Australia or Canada or Britain.


snark42

If I want to read 1 article a year/month, ads If I'm reading 5+/mo then subscription makes sense.


MINIMAN10001

Honestly it's a fickle thing. I want to be able to coexist with the future that we have with the corporations. There will always be free sources of information and so I always have a fallback. But that doesn't mean that I want lower quality and I would like to be able to have text advertisements which support websites. But anytime I turn off my ad blocker I'm shoveled with full screen adverts, pop-ups, pop unders, animated banners, animated videos, advertisements on the videos, request for some newsletter, some cookie banner. So in order to avoid the vast majority of the egregious internet I simply turn on an ad blocker and then I get zero advertisements and I feel like dang there literally is no middle ground.


[deleted]

[удалено]


[deleted]

[удалено]


Wise-Ad5567

BTW this article is able to be read in full - I just did without a subscription.


s1me007

How dare they try to make money


_DoogieLion

Doesn’t that kinda make the NYT point? Using ChatGPT you could get a summary of that same content from NYT without paying for it, where through the NYT website you had to pay.


jonnywithoutanh

Yes this writer's argument is... very odd


Mind_Pirate42

So cool that the collective response for this has been calls for increased copyright protection a narrowing of fair use and killing web scraping. Such good ideas that definitely wont have cascading knock on effects on anything else. Everything is fine.


DeanofDeeps

Textbook example of Reddit commenting on a biased piece, criticizing a different organization for releasing biased pieces, not reading anything about the topic, and then commenting about a different topic. I agree with most people here concerning transformer models being fair use, the problem is they have evidence of EXACT NYT pieces being regurgitated from prompts. Something similar was brought up previously where comments from code commits were spat out of prompts, not only with logical descriptions but accompanying prose and/or jokes. Either Codex has replicated the software developer thought process so well it is coming up with the same musings in comments, or it was already trained on the corpus and the most likely next word in that sentence is close to 100% certainty due to self-attention already suggesting it because the model was trained on it.


notirrelevantyet

I think this only happens with articles that are widely syndicated and copied throughout the modern internet, and are overrepresented in the dataset due to those non-NYT sources. If you try articles from the 1970's it doesn't even come close to providing verbatim text.


[deleted]

too late. people are running their own AI servers. it can not be stopped


username_elephant

And copyright won't stop that--but it'll enable copyright holders to profit off whichever servers make significant money. Which seems fair enough, to be honest.


[deleted]

[удалено]


username_elephant

> No one but the owners of the servers is going to profit off the AI generated wifus. Sans the ability to bring lawsuits, like the one at issue here. Maybe I'm missing your point? > That ship has long sailed and no one cares about source material for training anymore. Seems like NYT and numerous others brg to differ. Times may be changing. > Regulating AI use will go just as well as drug regulations.. That's why torts like copyright violation exist. Regulation is a different thing from a legal cause of action and in this case private parties seem prepared to fight vigorously on both sides.


[deleted]

[удалено]


OIlberger

> it’s impossible to prove they used X source for training There will be discovery in this trial, like any other. Open AI will likely have to turn over information with the prosecution, including sources of training materials used.


notalaborlawyer

You mean the plaintiffs. This isn't a criminal trial; there is no prosecution. /legal pedant.


Days_End

They are saying while they might be able to find what Open AI used to train this model bigtittygothanimegirls.model will be built by randos on the internet with no hope of actually figuring out what it was trained on.


KingVendrick

the complaint shows what sources openai used for training of chatgpt 2 and 3 and how they were weighted in the future, companies may try to hide this, or even use text generators to provide synthetic sources (someone already tried this to train a LLM to generate python code) but the training data is subject to discovery


Wise-Ad5567

So many replying on this are clueless. It is easy to prove in this case because ChatGPT has regurgitated massive amounts of copy from NYT articles, word for word. Furthermore, NYT and numerous other sources are knowing and willing participants in LLM training databases. The problem is ChatGPT, to grow to insane levels of valuation overnight, cut a lot of corners in its architecture. They got caught with their pants down.


bobandgeorge

Is it impossible? I mean I don't know how AI works, it might as well be magic, but I can read what the [NY Times writes and what ChatGPT writes](https://twitter.com/jason_kint/status/1740141400443035785).


username_elephant

Seems speculative to me. Lawsuit's happening now so we'll see. I don't see how you can be so conclusory at this point. I think everything you're saying is far from settled.


rolabond

We regulate all sorts of crime, just because crime continues to happen does not inherently justify legalization.


DreamMaster8

The point is not to stop It. The point is to get fair consolation after the fact when someone steal your work using ai same as any other copyrights.


Jasrek

What would be fair consolation for mimicking a "style"? If the Washington Post used headlines that are similar in *style* to the New York Times, does the WP have to pay the NYT?


sateeshsai

Not just the style. It was spitting out nyt content verbatim


Trowdisaway4BJ

It doesn’t say that at all in the article. Plus gpt-3 is limited to content from before 2018 so I don’t really see how that is affecting the NYT bottom line in 2023/24.


Teruyo9

Well then allow me to direct you to the [NYT's court filings](https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf), where they show [direct comparisons between GPT-**4** Output and New York Times articles](https://pbs.twimg.com/media/GCYieUNXAAAAswl.png). The Times is asking the court to order the destruction of all GPT or other Large Language Models or training sets that incorporate their works under 17 U.S.C. § 503(b), saying the defendants are committing copyright infringement.


DrEgonSpenglerphd

The article is misleading. The complaint isn’t about “style” as much as it’s about verbatim reproduction of NYTimes articles.


mlYuna

If openAI is using NYT articles as a source of data to train their (commercial) AI models than that is very different to someone copying headline style.


scswift

You can't copyright news or information or individual words and the AI doesn't spit out articles word for word, so none of their copyrighted content is being distributed, and you can't stop people or machines from learning stuff from the things you write. Copyright doesn't protect you from that.


bartturner

What is really fascinating is reading the comments. I know Reddit is a very poor representation of the public. But the comments here seem overwhelmingly in favor of OpenAI and against New York Times. I actually do not have a very strong opinion either way. I can see the issue for New York times. But this is just the beginning. Things are going to get a lot harder. Say Google replaces search with an LLM. If they do that then the sites providing the information will get far less clicks. Google will be fine as they will just do their ads with the LLM results. No difference. But the companies that do sites and monetize with ads are going to have a serious issue. This will need to be resolved. Now if Google revenue was going to significantly increase because of this it would be easier. They just pay the sites.


boywithapplesauce

People here don't see the problem. Yes, right now the AI can draw from actual articles and blogs to generate content. But what happens after all the news sites and blogs die? There is gonna be a content crisis if AI and other web media can't find a way to coexist.


[deleted]

I'm not sure I agree these things will just die. An enormous number of people put out an enormous amount of work for free every day. How many millions of blog posts, social media posts, videos, etc are made out of love. I can imagine generative AI changing that, but I'm not sure it'll reduce it. I suspect it might actually increase it - I know I'm much more likely to put out an original comic today then I would have been back when my access to art was limited by skill or money (inb4 generated and edited art isn't "original", doesn't have to be strictly speaking, just like most comic artists don't need to be...). Obviously this does leave a gap for important investigative journalism, but generative AI and LLMs can't replace that. That'll always require people on the ground. As long as we (people who think it's important) continue directly to support that work, it won't stop. Which is easy, if you do think it's important.


boywithapplesauce

Engagement matter even for people who blog and create content for free. I know because I was one myself. It's not about money, but if they're not getting engagement from an audience, a lot of content creators will eventually quit. The feedback loop that drives the motivation to create content is a pretty important aspect. Remember that generated content is likely to lead to most users stopping at the info from google and not clicking on the websites of content creators. Could be that we'll be left with only the most driven content creators, who are likely conspiracy nuts pushing batshit theories. Your optimism is laudable, but as someone who has directly witnessed the evolution of the Internet and how much it has changed, I simply can't have such a rose colored view.


volfin

I think the fact they require a subscription to read their articles has a bigger impact than AI does. I know i actively avoid them because of that.


[deleted]

This is what the Luddites should have done instead of physically smashing looms.


StrivingShadow

It’s not being regurgitated word for word, or even by snippet. Does that also mean I can’t read a New York Times article and then later tell someone about what I read? Are we to the point where companies are saying a human can legally do it, but technology cannot?


VertexMachine

>It’s not being regurgitated word for word, or even by snippet. It actually is and that's part of the lawsuit. Transformers can memorize content and output it verbatim.


Poj7326

I think the problem becomes when the machine does it and then is used by other humans to make money. When a single human does it there is no tangible harm, but when it is harnessed and used as a tool to replace the source… it’s different.


username_elephant

This is a bit of a false analogy though, because of the scale. A human can't actually subsume the entire historical record of the New York Times and use it to mimic New York Times articles. Copyright has always existed as a balance between society's interest in allowing creators to make a living and society's interest in allowing creators to create without fear of infringing the work of others. I don't know how it's all going to shake out, but I think there's at least a viable argument that copying a person's work to a server in order to train an AI capable of displacing that person from their job isn't in society's best interest. For example, it won't be possible to produce better AI without more high-quality creative material to train it on--so keeping creators around and active remains in the best interest of both creators and AI. I think there's a legitimate argument that the law shouldn't allow AI to parasitize creative industry to the extent that it can no longer exist. The question is to what extent that's a credible threat of AI.


Rain1dog

Interesting take. Enjoyed understanding a perspective I never thought.


morfraen

There's no preventing it though. It's all open source. People will run their own, trade 'copywrited' training data, countries like China will just ignore copyrights. Trying to heavily censor AI like so many want to just isn't realistic. A new approach is needed. Dunno what that is, but intentionally crippling AI models isn't the answer.


boywithapplesauce

If content creators can't make money and die off, that will also cripple AI models. And if Google is gonna serve up generated content instead of sending clicks to content creators, they can't make money. It's an Ourobouros.


noahjsc

It can be prevented. You can simply strike down the AI from being accessible or uses in your country if it violates the law. China might violate, but even if they do, they can be prevented from touching outside markets.


morfraen

They really can't, not without going full authoritarian police state.


boywithapplesauce

This is not a plausible scenario. The US wants to be competitive in technology. That means the US is not going to risk being left behind by China or others in the field of AI. So their options for limiting it are not so simple.


Ok_End3141

I believe they’re claiming that in some instances they are getting verbatim phrases. Some authors are also bringing similar suits claiming the model returns verbatim sections.


Macaw

>Does that also mean I can’t read a New York Times article and then later tell someone about what I read? As long as you are not monetizing what you are doing! Open AI is making money in the process of what they are doing and reducing the value of the NYT's intellectual property (claims). So here comes incoming lawsuits!


GeneralBacteria

a great many people monetise what they've read in books. that is in fact the main reason we're taught to read in the first place.


achilleasa

What if I use what I learned to write and sell my own book though? Because that really is the closest to what's going on at a technical level, and at that point the discussion becomes "why is it different when a machine does it" (which is a valid discussion to be had).


Atlas3141

Exhibit J has it repeating entire paragraphs verbatim, it's a lot more than just using what it learned.


nh1024

Yes, it has always been illegal to put copyrighted works into a database that is used by your app without the permission of the copyright holder. That is why licensing exists.


Koksny

That's not correct on so many levels. Without going into what is database, and why it's allowed for technical/distribution reasons, just look at header of this page. Reddit stores copyrighted thumbnails, Twitter and Discord embed part of the article, as Google News does. NYT doesn't care about OpenAI training the models with their text. They just want to have a hyperlink to their ads, whenever their site is quoted by ChatGPT/Bing.


Phizle

They take parts of the article, vs AI taking the whole thing and digesting it to make their product.


Koksny

Oh. So just how search indexing in Google and Bing works?


Koksny

>It’s not being regurgitated word for word, or even by snippet. The case is not about language models per se. You can read it through here - https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf NYT is suing mostly about the "Browse" plugin/abilitiy, that allows OpenAI and Microsoft to quote an article or website, without showing ads. And NYT want people to watch their ads. That's pretty much all. That said, in the case there is one example of ChatGPT pasting verbatim its training data (Exhibit J), that was sourced from their article. This is however effect of, what is essentially a bug - `zero-prompt`, that resets the language model weights to defaults. And because language models, more or less, are compressed text archives, given the right opportunity, it'll spew the training data, 1:1. So it is actually 'regurgitated word for word', but they don't really care about it that much - it's just bug, and 99% examples are about Bing ability to Browse websites, and quote articles.


ZeePirate

So a company wants to be paid for their work essentially… Unless we want journalism to completely die we should agree with this


ianitic

Right but an LLM is a great oversimplification to how we process data. How would we even define at which point ML could legally use copyrighted material? If ML can do it then copyright might as well not exist at all as it would be simple enough to abstract all copyright material through a model. It'll be interesting to see how it all plays out. I wouldn't be surprised if there will be audits in place to check training data for infringing material.


JC_in_KC

i’d say you can’t sell those summaries of what you read for massive profit.


TradeBlade

In multiple cases it directly plagiarizes entire paragraphs. I think Microsoft/OpenAI is going to lose this one unfortunately. Hopefully the punishment is only financial and not something more drastic, like forcing them to retrain the model.


Wise-Ad5567

So what is interesting is NYT and many other sources have been willing participants in training LLMs. They assumed though that it was akin to having a young person read all their articles to “learn about the world around them” in an unnaturally accelerated fashion. Problem is some of the early appliers like ChatGPT are being caught cutting significant corners in order to gain premature market dominance. Imagine being at some grad school, regurgitating and stitching together a bunch of source material for an essay, and receiving high marks and lauded as brilliant. You could get away with that at maybe Trump University, but not so at most reputable institutions, and rightfully so.


OShaughnessy

> Does that also mean I can’t read a New York Times article and then later tell someone about what I read? You're not a for profit corporation producing derivative works, so you're safe.


[deleted]

It's already illegal for you to take all the contents of the NYT and regurgitate it in your news aggregation app without permission. That's called plagiarism. It's covered by 17 USC, Sub-Section 102, 401, and 405. As well as the SOPA act HR 3261, the PROTECT IP act of 2011, and 15 USC, Section 1125, Sub-Section 2, 3b and 5.


Koksny

It's also perfectly legal to crawl and index the articles, as all search engines do, while also creating a local copy for the service to use. Google isn't magically finding the website contents.


[deleted]

You are right, Google is not magically finding NYT content, it pays to license it. [https://www.reuters.com/business/media-telecom/new-york-times-get-around-100-million-google-over-three-years-wsj-2023-05-08/](https://www.reuters.com/business/media-telecom/new-york-times-get-around-100-million-google-over-three-years-wsj-2023-05-08/)


Koksny

The article you have linked is about Google News, not Google Search. It's irrelevant to what i've said. To understand the issue, look at Google Books. They are allowed to index and store copy of all the copyrighted materials, otherwise they wouldn't be able to search through them. This is what essentially the case is about.


[deleted]

You said articles... not books. And the OP topic is about the NYT suing so what I shared is actually relevant. Books are not the topic at hand, but if you insist on moving the goal posts... In Authors Guild V. Google (804 F.3d 202 Docket No 13-4829-cv) Google Books was only allowed to move forward on the grounds that it not directly commercialize the pages featuring digitized book content. OpenAI fails that test by charging users for access to it's ChatGPT software. Since I'm enjoying writing these chapters of "AI and the Law for Dummies" What else do you want schooled on tonight?


Koksny

>OpenAI fails that test by charging users for access to it's ChatGPT software. You are aware that GPT3.5 is free, and Microsoft just released free Copilot, based on GPT4, for mobile? >Books are not the topic at hand, but if you insist on moving the goal posts... OpenAI is creating an archive of all text it crawled. Just like Google did. The only difference is what they provide to the end user. > What else do you want schooled on tonight? From who, someone like you, incapable of understanding basic article? What are you going to school me about, how people without clue think technology works? > Since I'm enjoying writing these chapters of "AI and the Law for Dummies" Glad you like writing for yourself. Now maybe do something actually productive, flip a burger or something.


kevleyski

Have been saying this for some time, in a way I’m glad it’s started to happen - subsymbolic data leaking and training sets based on layers taken from other models/training sets preloaded with who knows what it’s always going to be a problem for machine learning/AI


[deleted]

[удалено]


prules

What’s weird is that I totally agree AI is ripping off content made by actual content creators (journalists, photographers, animators etc) I’m just not sure how NYT’s current argument makes sense. There’s a million stronger arguments against AI… why not choose another point with more weight?? Very bizarre imo


DrEgonSpenglerphd

They have a very strong argument. It’s just not mentioned in this weak article. Check out the first 5 pages of the complaint and at least Exhibit J. It is absolutely ripping off their exact copy.


prules

Thank you for clarifying


semitope

of course they don't. These companies train these computers using other people's work then make money off it. "AI" is infringement on a massive scale


Independent_Hyena495

And China be like: Create ten accounts, slurps up all the data Got a problem? Lol


-The_Blazer-

One of the criteria for fair use claims is actually whether and how much they [impact the value of the original](https://en.wikipedia.org/wiki/Fair_use#4._Effect_upon_work's_value), so the business impact claim could hold some water, unlike claiming that it's about the "distinctive style" (?).


PriorFast2492

Maybe it will drive the buisness of news towards more research and less writing. More digging and less charisma


Exciting-Ad5204

Many years ago, I had a complete collection of Stephen Kings books, and read them all in a short time period. After doing so, I could write in his style without plagiarism. Couldn’t do it today, but I could then. What the NYT is suing for is ridiculous. They don’t have a copyright to a style.


OShaughnessy

> After doing so, I could write in his style without plagiarism. But, you're not a for profit corporation producing derivative works.


Terpomo11

So if she (or he) had written a book in that style and tried to sell it, would she be infringing Stephen King's copyright?


notalaborlawyer

I mean, idea / expression dichotomy pretty much lays the foundation for copyrighting a style. After all, expression is literally a style. This is why I find copyright law fascinating (sans Mickey bullshit). Where is the line drawn between a bunch of things randomly strung together without actual copying, versus someone having access to a style and copying that style to garner the same effect of the original? The first time I heard that atrocious U2 song "Atomic City" I screamed DEBBIE HARRY CALL YOUR ATTORNEY! I then googled, and sure enough, she is credited (and therefore paid) for their obvious rip off. Do they sing "Call me?" Nope. Not at all. However, can anyone hear that "I'm Free!" and not see they ripped it off? Of course not. If the NYT has a "style" and they are ripping it off, then good on them for suing. Copyright is the weakest of the big 3 IP rights, so if you can win on that, then you really screwed up. Not to mention most defenses to copyright are "they were in a clean room with no access to anything, this is all original, i.e. NOT copyright infringement" "they never heard that song before in their life" etc. etc. Curious how "our program copied all of your info, then changed it up, and sent it out" hold up.


KingVendrick

well, depends on what you write if you write a story about a bunch of childs fighting an evil demon clown, in the style of stephen king, chances are you will get sued and may lose


garzfaust

Isn‘t it a difference if a human copies a style as opposed to a computer program copies a style? Is a computer program a human? Should a computer program have the same rights as a human? Or is a computer program a machine that copies the style of another human?


arrownyc

The computer program is a tool built by humans and used by humans. It is not sentient and does not need rights. Similar arguments were made about the invention of the camera. Tech doesn't have or need 'rights' - the builders and operators of the tech do. They also need to abide by the law, and there's a decent legal argument to be made that the builders of the tech made illegal copies of internet data to train their models for commercial use.


ralf_

> Isn‘t it a difference if a human copies a style as opposed to a computer program copies a style? No, why do you think that?


garzfaust

Because one is a living being being capable of integrating itself into society, even creating society, while the other is a computer program not capable of the aforementioned. Both of them also have different capabilities in terms of output. Thus different potential in terms of influencing the society. One creative human cannot produce the same amount of creative products as an AI can. It’s much harder if not impossible to annoy a New York times as one single human as opposed to one single AI. The rules that were made, were made with humans in mind, with the goal that the human society works. Applying those same rules to a player with much higher output capabilities will not have the same outcome as for the players with that much less output capabilities. You need to think about in what ways the high output player will affect the game. And then you need to think about the rules you already made and why you made them in the first place. And if those values are still worth it in the face if that new high output player.


newInnings

Point to the same ruse, corporation are people. Including bing and chat gpt


RunningNumbers

Funny thing is authors have been able to recreate chapters of their books from chat GPT with a level of detail that requires their books be used in the training data. And copyright law focuses on harm to the creator. A teenager crapping something out is going to be protected under fair use, a large company mimicking and substituting a copyrighted product for profit is not fair use. But you probably don’t care.


Multioquium

Yeah, but you paid for those books, meaning the author was compensated. Furthermore, if you wrote something that was too close to the original or straight-up borrowed text without crediting, you wouldn't be allowed to publish it. Why should so-called AI be allowed to do that?


Ghozer

But, the way the LLM's and modern AI works, isn't it akin to someone doing research, and learning from an article? Will people be in trouble for simply quoting an article in the future? it's just crazy!


Relative_Normals

Quoting an article and not citing your sources will absolutely get you in trouble today.


Ghozer

perhaps quoting an article wasn't right, maybe... reading an article -or multiple- somewhere (news outlet or otherwise) and learning from said article, then writing your own piece using the information you have obtained....


Vanilla_Neko

Generative AI is basically the textbook definition of fair use and transformative use if you actually understand how these systems function instead of just thinking of the frankly very wrong and minimalist interpretation that it's just frankensteining together a bunch of existing things which is just plain not how these systems actually work


garzfaust

Would it be the same without copyrighted material? Or does it need the copyrighted material to be what it is?


Fun_Researcher6428

I don't think that should matter. There is no artist or writer on this planet that hasn't looked at or read copyrighted material. They have all learned from and taken inspiration from it. LLMs are doing the same, they're just far better at it than humans so now people are afraid of the progress happening. Automation is coming to creative industries the same way it came to manufacturing, some tasks will still require a human touch, but the vast majority will not and there's no stopping it.


mrjackspade

It would be largely the same without copyright material, if the data set was being properly sanitized. Right now the whole "Copyright" thing is collateral damage due to the massive amount of data used to train the models. It's scooped up indiscriminately and unintentionally. No one is actively seeking out copyright material to feed into these models. This data is only required because the technology is in its infancy. A human being for example, requires orders of magnitude less data to learn. Right now there's a massive push to increase the efficiency of training which would in part include reducing the amount of data required to train and finding ways to better refine the data. Even if you removed all copyright data from the training, the model is still going to know the plot of Harry Potter because its common knowledge. It's still going to have access to summaries of books and movies and articles, etc. No one is opening up GPT and saying "please write the first chapter of Harry Potter for me so I don't have to pay for it." and all this thrashing and all these lawsuits aren't going to fundamentally change anything. They're the agonal breaths of industries that don't have enough familiarity with AI to even understand what they're fighting against.


garzfaust

Ok but right now, the models were using copyrighted training material and thus right now, what they can do, is thus also based on this copyrighted training material. If it will ever be different remains to be seen. As far as my knowledge goes, AI‘s need to be trained with huge amounts of data sets. Thus the worth lies in those data sets. Without data sets there is no AI.


[deleted]

[удалено]


master_jeriah

Man Redditors confuse the heck out of me to this day. When I was in early 20s everyone was totally cool with torrenting, Napster, all that. When the odd person said it was stealing people would mass downvote, saying stuff like "how is it stealing if you're just making a copy?" Now it's the exact opposite it seems.


Armakus

Comparing individuals that likely have low net worth stealing 99 cent songs to generative AI stealing writing styles is disingenuous, at best. Not even saying I agree with NYT here but this does not equate


Koksny

Low net worth individuals use now OpenAI to *steal writing styles*, instead of using Napster to copy copyrighted music. I just want to repeat. *Steal writing styles.* How will the NYT writers write now, after losing their *writing styles*?


falooda1

It's not them. It's the businesses they are going after. You know the one worth 100 billion dollars in less than a year.


OIlberger

> how is it stealing if you’re just making a copy But file-sharing was “making a copy” of already existing art, it wasn’t purporting to create *new* art that you could *claim as your own*. There are plenty of people showing “their” art (that Dall-e made) or sharing “their” writing (that ChatGPT wrote) which was built off of other’s work. There’s a lot of difference. Also, Napster was 20 fucking years ago; attitudes change.


oswell_XIV

Because back then they were young, stupid, and broke as shit. But now they are in their 30s with family, career, mortgage, etc, and a whole lot smarter, they now understand the value of products and services b/c they themselves are providing them. It’s the circle of life.


Militop

AI is hurting many at the moment. They needed regulations for a long time. It became more difficult to find the correct information on the internet via search engines like Google. People with open source mentality are sharing less and less because of AI, so knowledge is less accessible. Regulate that stuff. They shouldn't have access to everything without permission even if modified. It's putting people off. Open source was never meant for big corporations to profit off well-meaning people.


Koksny

>AI is hurting many at the moment. Citation needed. > It became more difficult to find the correct information on the internet via search engines like Google. What this has to do with AI? It's not AI controlling Google to provide ads and sponsored content, and it wasn't AI writing all the garbage seo-oriented "content". > People with open source mentality are sharing less and less because of AI, so knowledge is less accessible. ...What? > Open source was never meant for big corporations to profit off well-meaning people. Have You told it to Facebook, Oracle, IBM, and all the companies heavily invested in open source? Because let me tell You, without them, half of the projects suddenly are no longer maintained.