ACRE

OpenAI training data ‘contains millions’ of NYT and Daily News works

PUBLICADO

1 ano atrás

8 de novembro de 2024

Por:

José Gomes - Da Amazônia para o Mundo! contato@acre.com.br

New York Times, OpenAI and Bing apps on phone. Picture: Shutterstock/Tada Images

Millions of stories published by sites including The New York Times and The New York Daily News have been found in three weeks of searching OpenAI’s training dataset.

The news publishers are currently trawling through data to find instances of their copyrighted work being used to train OpenAI’s models – but they say the tech company should be forced to provide the information itself.

They are now asking for a court order requiring OpenAI to “identify and admit” which of their copyrighted content was used to train each of its large language models between GPT-1 and GPT-4o.

According to the ChatGPT creator, which objected to the request, the publishers have asked for information about almost 20 million pieces of content mentioned in the case, “effectively resulting in almost 500 million requests”.

The publishers told the court on Friday that their requests to the AI company for help with inspecting the data “would be significantly reduced if OpenAI admitted that they trained their models on all, or the vast majority, of News Plaintiffs’ copyrighted content”.

A letter to the court also stated: “While they have already found millions of News Plaintiffs’ works in the training datasets, they do not know how many more works are yet to be uncovered – information that OpenAI, as the party that chose to copy these works, should be ordered to provide.”

The New York Times was the first major news publisher to file a copyright case against OpenAI and its partner Microsoft in December last year.

The New York Daily News and seven sister publications, all owned by Alden Global Capital, followed suit in April and the two cases have since been combined after OpenAI and Microsoft argued they “involve nearly identical allegations relating to the same new technology”.

In the new letter, the news publishers argued that identifying which of their copyrighted work was taken and used to train the GPT models is “foundational to these cases and informs the scope” of their claims.

“But News Plaintiffs and OpenAI have a fundamental disagreement about who is responsible for identifying this information.”

The publishers said they have served numerous requests since February for information about what’s in OpenAI’s training datasets, to which the tech company replied: “OpenAI will make available for inspection, pursuant to an inspection protocol to be negotiated between the parties, the pretraining data for models used for ChatGPT that it locates after a reasonable search.”

After long-running negotiations, since last month the news publishers have been inspecting OpenAI’s training data under strict conditions, previously described by the court as a “sandbox” (meaning a highly controlled environment in which only certain applications can be run).

But the news publishers said they initially faced “severe and repeated technical issues” stopping them from being able to “effectively and efficiently” carry out the search and “ascertain the full scope of OpenAI’s infringement”.

They complained that the process is “time-consuming, burdensome, and hugely expensive” and said they had spent the equivalent of 27 days via lawyers and experts in the OpenAI sandbox but were “nowhere near done”.

The New York Times Company results published on Monday revealed it has so far spent at least $7.6m on the case against OpenAI and Microsoft.

OpenAI: Training data searches are ‘uncharted waters’

OpenAI responded within the same letter that the publishers’ complaints about the inspection have either been resolved or are being actively discussed. It blamed the issues on consultants for the publishers “overwhelming the file system with malformed searches”.

OpenAI added: “Taking a step back, everyone agrees the parties are navigating uncharted waters with training-data discovery.

“There are no precedents for such discovery, where Plaintiffs seek access to several hundred terabytes of unstructured textual data. OpenAI cannot easily identify the specific content that Plaintiffs are interested in, so it did exactly what Rule 34 allows: it invited Plaintiffs to inspect the data as it is kept in the ordinary course. There is no ‘sandbox’. Rather, because the data is far too voluminous to produce, OpenAI built the hardware and software that Plaintiffs need to inspect.

“Specifically, OpenAI organised hundreds of terabytes of training data in an object-storage file system for Plaintiffs’ exclusive use; it built an enterprise-grade virtual machine with the computing power to access, search, and analyse the datasets; it installed hundreds of software tools and tens of gigabytes of Plaintiffs’ data upon their request; and it managed the necessary firewalls and secure virtual private network to support the inspection.”

OpenAI said it would continue to help the publishers overcome technical challenges provided they “engage in good faith” but added: “Unfortunately, this has not always been the case,” accusing them of delaying the process for months and submitting “hundreds of irrelevant requests”.

Representatives for the Authors Guild and progressive newsbrand Raw Story Media have also viewed the OpenAI training data for their own cases.

OpenAI previously asked a judge to force The New York Times to hand over its journalists’ confidential notes, a move the publisher warned would have “serious negative and far-reaching consequences” and was ultimately denied in September.

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our “Letters Page” blog

Relacionado

TÓPICOS RELACIONADOS:Daily data millions News NYT OpenAI training works

Up Next

o risco de inelegibilidade automática para Marine Le Pen

Don't Miss

A importância de não olhar só para a metade vazia do copo – 08/11/2024 – Poder

Comentários

Warning: Undefined variable $user_ID in /home/u824415267/domains/acre.com.br/public_html/wp-content/themes/zox-news/comments.php on line 48

You must be logged in to post a comment Login

Comente aquiCancelar resposta

ACRE

Cerimônia do Jaleco marca início de jornada da turma XVII de Nutrição — Universidade Federal do Acre

PUBLICADO

22 horas atrás

31 de março de 2026

Por:

José Gomes - Da Amazônia para o Mundo! contato@acre.com.br

No dia 28 de março de 2026, foi realizada a Cerimônia do Jaleco da turma XVII do curso de Nutrição da Universidade Federal do Acre. O evento simbolizou o início da trajetória acadêmica dos estudantes, marcando um momento de compromisso com a ética, a responsabilidade e o cuidado com a saúde.

Relacionado

Continue lendo

ACRE

Ufac realiza aula inaugural do MPCIM em Epitaciolândia — Universidade Federal do Acre

PUBLICADO

1 dia atrás

31 de março de 2026

Por:

José Gomes - Da Amazônia para o Mundo! contato@acre.com.br

Ufac realiza aula inaugural do MPCIM em Epitaciolândia — Universidade Federal do Acre

A Ufac realizou a aula inaugural da turma especial do mestrado profissional em Ensino de Ciência e Matemática (MPCIM) no município de Epitaciolândia (AC), também atendendo moradores de Brasileia (AC) e Assis Brasil (AC). A oferta dessa turma e outras iniciativas de interiorização contam com apoio de emenda parlamentar da deputada federal Socorro Neri (PP-AC). A solenidade ocorreu na sexta-feira, 27.

O evento reuniu professores, estudantes e representantes da comunidade local. O objetivo da ação é expandir e democratizar o acesso à pós-graduação no interior do Estado, contribuindo para o desenvolvimento regional e promovendo a formação de recursos humanos qualificados, além de fortalecer a universidade para além da capital.

A pró-reitora de Pesquisa e Pós-Graduação, Margarida Lima Carvalho, ressaltou que a oferta da turma nasceu de histórias, compromissos e valores ao longo do tempo. “Hoje não estamos apenas abrindo uma turma. Estamos abrindo caminhos, sonhos e futuros para o interior do Acre, porque quando o compromisso atravessa gerações, ele se transforma em legado. E o legado transforma vidas.”

Relacionado

Continue lendo

ACRE

Ufac recebe visita da RFB para apresentação do projeto NAF — Universidade Federal do Acre

PUBLICADO

6 dias atrás

26 de março de 2026

Por:

José Gomes - Da Amazônia para o Mundo! contato@acre.com.br

Ufac recebe visita da RFB para apresentação do projeto NAF — Universidade Federal do Acre

A Ufac recebeu, nesta quarta-feira, 25, no gabinete da Reitoria, representantes da Receita Federal do Brasil (RFB) para a apresentação do projeto Núcleo de Apoio Contábil e Fiscal (NAF). A reunião contou com a participação da Coordenação do curso de Ciências Contábeis e teve como foco a proposta de implantação do núcleo na universidade.
O reitor em exercício e pró-reitor de Planejamento, Alexandre Hid, destacou a importância da iniciativa para os estudantes e sua relação com a curricularização da extensão. Segundo ele, a proposta representa uma oportunidade para os alunos e pode fortalecer ações extensionistas da universidade.

A analista tributária da RFB e representante de Cidadania Fiscal, Marta Furtado, explicou que o NAF é um projeto nacional voltado à qualificação de acadêmicos do curso de Ciências Contábeis, com foco em normas tributárias, legislação e obrigações acessórias. Segundo ela, o núcleo é direcionado ao atendimento de contribuintes de baixa renda e microempreendedores, além de aproximar os estudantes da prática profissional.

Durante a reunião, foi informada a futura assinatura de acordo de cooperação técnica entre a universidade e a RFB. Pelo modelo apresentado, a Ufac disponibilizará espaço para funcionamento do núcleo, enquanto a receita oferecerá plataforma de treinamento, cursos de capacitação e apoio permanente às atividades desenvolvidas.

Como encaminhamento, a RFB entregou o documento referencial do NAF, com orientações para montagem do espaço e definição dos equipamentos necessários. O processo será enviado para a Assessoria de Cooperação Institucional da Ufac. A expectativa apresentada na reunião é de que o núcleo seja integrado às ações de extensão universitária.

Também participaram da reunião o professor de Ciências Contábeis e vice-coordenador do curso, Cícero Guerra; e o auditor fiscal e delegado da RFB em Rio Branco, Claudenir Franklin da Silveira.