The Future of Writing: Postplagiarism & Hybrid Writing?

Joseph M. Moxley - University of South Florida

The image depicts two masked thieves inside what appears to be a bank vault. They are shoveling large stacks of books into burlap sacks instead of the expected cash or gold bars. The thieves are making off with what seems to be a literary score - books and written works representing the valuable data that large language AI models are trained on, often through appropriating copyrighted texts without permission from the rights holders. This image symbolizes the allegations that AI companies have effectively 'robbed' this intellectual property vault of human knowledge to build their language models.

Like thieves ransacking a vault for gold bars, large language models indiscriminately extract value from troves of copyrighted texts. By ingesting nearly all of human knowledge, including copyrighted works, these AI systems have, according to content creators such as The New York Times engaged in the greatest theft of intellectual property of all time. In contrast, creators of GAI tools argue this dataset construction is covered under fair use law. As data-hungry AI techniques grow, our governments must grapple with whether our copyright regimes can adapt to these new modes of massive linguistic modeling and appropriation of written works.

Summary

This assignment constitutes the fifth of eight creative challenges that undergraduate students complete for Writing with Artificial Intelligence, an undergraduate writing course. This college-level writing assignment historizes copyright laws and plagiarism conventions, introduces students to recent copyright controversies, and challenges students to analyze whether society needs to redefine IP standards in response to ubiquitousness of GAI tools and the move toward hybrid writing — writing coauthored by machines and humans.

Building on the past three creative challenges, this writing assignment maintains the course’s focus on the critical AI literacies students need to develop to use AI critically as opposed to offshoring thinking, research, writing, and design to GAI (generative artificial intelligence) tools. Students using powerful AI writing tools — and especially students who aspire to be content creatives, knowledge workers — need to understand the key intellectual property debates surrounding their creation.

Introduction

In this creative challenge, you will research the controversy over whether AI companies violate copyright by training LLMs (large language models) on copyrighted materials. More specifically, they examine the high-stakes lawsuit by The New York Times against OpenAI/Microsoft over alleged copyright infringement in developing their breakthrough AI system. You will review changes to copyright laws since their inception in Great Britain’s Statute of Anne in 1710, and you will learn about the concept of “fair use,” which is at the heart of the New York Time’s lawsuit. Also, you will watch Lawrence Lessig’s Ted Talk on “open copyright,” and consider his argument that licenses like those of Creative Commons empower authors/societies to be more creative. Finally, you will write a brief editorial for the college newspaper, exploring the questions, “Should companies like OpenAI be forced to discard their LLMs and create new ones based on solely license content? Or, should society remediate and redefine intellectual property laws and academic integrity conceptions to allow for “hybrid writing” — i.e., writing coauthored by humans and generative artificial intelligence tools? In other words, even if LLMs are founded on unethical behavior, given their usefulness, should they be permitted or prohibited, assuming that’s possible?

A Brief Outline of the Debate: The New York Times vs. Open AI

To summarize this case succinctly, while OpenAI claims ingesting copyrighted data qualifies as fair use, The New York Times alleges bots from OpenAI snuck behind their paywall and stole thousands of articles written by its journalists. The New York Times and other content creators assert scraping of humanity’s creative works — past published articles, and books and everything on the internet — constitutes unlawful infringement and misuse of authors’/inventors’ works.

In 2022, OpenAI released GPT-3, their groundbreaking language model that could generate human-like text. But this technological feat also sparked heated controversy. GPT-3 was trained on a colossal dataset of over 300 billion words, comprising millions of copyrighted books, articles, websites and other creative works, which were compiled without explicit permission from rights holders. While OpenAI argued this data ingestion fell under fair use provisions, The New York Times and the Writers Guild of America argued it amounted to widespread copyright violation and intellectual property theft on a massive scale.

OpenAI’s move proved pioneering, as other AI startups soon followed suit by releasing their own large language models trained on similar corpora of copyrighted data. At the heart of this growing furor was a fundamental question: can existing copyright polices and laws accommodate the modern era of artificial intelligence, where language technologies attain human-like abilities by consuming and learning from vast troves of proprietary writings and intellectual property?

Source: In December of 2023, The New York Times sued Microsoft, Open AI and others for copyright infringement.

What is Copyright? How Have Conceptions of Copyright Evolved Over Time?

Copyright refers to intellectual property laws that grant an exclusive legal right to creators to control the copying and public exhibition of their original creative works. The origins of modern copyright law trace back to Great Britain’s Statute of Anne in 1710 — the first legal framework recognizing authors’ rights over their works. In the U.S. the Copyright Act of 1790 established the first federal copyright system, granting authors exclusive rights for a period of 14 years, with the possibility of renewal for another 14 years so long as the creator published the work with a copyright notice. The Copyright Act of 1909 expanded these rights, extending the term to 28 years with a renewal term of 28 years. The Copyright Act of 1976 further extended copyright to a single term of life of the author plus 50 years. Most recently, in 1998, per the Sonny Bono Act, copyright was copyright was extended to life plus 70 years.

Critics of modern copyright laws argue the 1998 extension unduly restricts transformative use of existing works. Legal scholar Lawrence Lessig champions “open copyright” – a flexible system allowing creators to permit certain reuses, like Creative Commons licenses. This fosters sharing while respecting original rights.These licenses allow creators to grant permission in advance for certain uses of their work, thus fostering a more open and collaborative creative environment.

Before modern copyright, many iconic works were born from the practice of remediation. Take Disney, for example. The company built its legacy by transforming classic fairy tales and folk stories—such as those by the Brothers Grimm and Hans Christian Andersen—into beloved animated films. These stories, which were originally in the public domain, were given new life and reached new audiences through Disney’s creative adaptations. Ironically, Disney, a company that built its legacy by transforming public domain works into beloved films, played a significant role in promoting the Sonny Bono Act to protect its own creations from entering the public domain.

What is Fair Use?

Fair use is a legal doctrine that allows limited use of copyrighted material without obtaining permission from the rights holders. It is designed to enable activities such as criticism, comment, news reporting, teaching, scholarship, and research. In the context of AI, fair use becomes a critical factor in determining whether the use of copyrighted materials to train models like GPT-4 falls under permissible activities.

The four factors considered in fair use analysis are:

Purpose and Character of the Use: This factor examines whether the use is for commercial or nonprofit educational purposes. Transformative uses that add new expression, meaning, or purpose to the original work, rather than merely duplicating it, are more likely to be considered fair use. Noncommercial, educational, and transformative uses tend to favor fair use.
Nature of the Copyrighted Work: This factor considers whether the work is primarily factual or creative/fictional. Using factual or informational works is more likely to be deemed fair use compared to highly creative or imaginative works. The scope of fair use is generally broader for factual works.
Amount and Substantiality of the Portion Used: This factor assesses the quantity and quality of the material used in relation to the copyrighted work as a whole. While there are no strict legal rules, using a small portion of the work is more likely to be fair use. As a general guideline, using less than 10% or 250 words from a larger text may favor fair use, but this is not a bright-line rule. Using the “heart” or most valuable portions of a work, even if quantitatively small, could weigh against fair use. The analysis considers both the quantity and the qualitative importance of the portion used. Even if potentially fair use, proper attribution and citation are still required.
Effect of the Use on the Potential Market: This factor evaluates whether the use could reasonably be expected to harm the existing or future market for the original work. If the new work can serve as a substitute for the original, reducing its market value or potential sales, it is less likely to be considered fair use.

Regarding satire, while it is generally viewed as a transformative use that can favor fair use when reasonably parodying the original, it is not considered an outright exception. The other fair use factors still need to be weighed for satire or parody.

It’s important to note that even if a use is potentially fair use based on the above factors, proper attribution and citation of the original work are still required.

Writing Prompt

For this thought experiment, let’s assume OpenAI/Microsoft doesn’t settle the case by sending over boatloads of cash to the New York Times. Instead, assume the Times fights the long game and wins big. Let’s assume the newspaper is victorious in its lawsuit against OpenAI’s alleged copyright infringement through training its language model on a massive corpus of text data, some of which included Times content. And let’s assume a Hollywood ending — that Open AI is required to pay billions to fund content creators moving forward.

In the wake of this legal battle, we find ourselves navigating the evolving landscape of “postplagiarism” and “hybrid writing.” “Postplagiarism”, as illustrated in the video below, refers to Sarah Eaton’s conception of a remediated era where conventional notions of intellectual property and plagiarism are reimagined and redefined to ethically accommodate AI’s capabilities. “Hybrid writing” describes the increasing inevitability of humans and AI language models effectively co-authoring content, blending machine-generated and human-authored elements in seamless, indistinguishable ways.

The pivotal question is whether society should radically reshape copyright, IP and plagiarism policies to explicitly allow AI language models to train on data and generate remixed content? Or should companies be forced to discard current models trained on unlicensed data and start fresh, with a rejection of redefining the established legal frameworks around copyright and plagiarism conventions? Even if large language models originated from datasets compiled through legally dubious means, given their immense potential benefits to creators and students, should their existence be permitted moving forward assuming restitution is made? Or are the copyright violations an unacceptable breach that cannot be overlooked, regardless of AI’s usefulness?

In 500 words, writing for a college newspaper or blog read by student content creators — bloggers, social media managers, videographers, artists and more — stake out a reasoned position on this dilemma. Your goal is to provide nuanced insights into the ethical complexities while arguing either:

“Should companies like OpenAI be forced to discard their LLMs and create new ones based on solely license content?
Or
Should society remediate and redefine intellectual property laws and academic integrity conceptions to allow for “hybrid writing” — i.e., writing coauthored by humans and generative artificial intelligence tools? In other words, even if LLMs are founded on unethical behavior, given their usefulness, should they be permitted or prohibited, assuming that’s possible?

Your 500-word analysis should not aim to judge definitively whether fair use applies, but rather evaluate the trade-offs and competing principles at stake. Draw from sources covering the NYT lawsuit, OpenAI’s counterarguments, Lessig’s proposals for “open copyright” policies, and your own anecdotal experiences with AI tools. Because your response will be published as an editorial in a college student paper, you do not need to have a References section in APA 7. However, if you quote, paraphrase, or cite from these sources you should make that clear in the text of your editorial.

Required Resources/Readings

Lessig’s Laws that choke creativity
New York Times Lawsuit
OpenAI response to the New York Times lawsuit. https://openai.com/index/openai-and-journalism/
“Postplagiarism: transdisciplinary ethics and integrity in the age of artificial intelligence and neurotechnology“

Schedule

Meeting	Due Dates & Topics	Assignments/Activities
Week 8, Tuesday 10/15/2024	Writing Workshop	1. In class read the introduction to Creative Challenge #5 2. Use Persuall to annotate “Postplagiarism: transdisciplinary ethics and integrity in the age of artificial intelligence and neurotechnology“ 2. Collaborative work: Working in groups, complete Step 1 of Creative Challenge #5 — experimenting with summary tools on the lawsuit and OpenAi response
Wednesday, 10/16	Homework	1. Watch Lessig’s Laws that choke creativity 2. Review the summaries the groups produced and shared at the Course Sandbox 3. Write a draft of your response to the writing prompt, Should companies like OpenAI be forced to discard its LLMs and create new ones based on solely license content? Or, should society remediate and redefine intellectual property laws and academic integrity conceptions to allow for “hybrid writing” — i.e., writing coauthored by humans and generative artificial intelligence tools? In other words, even if LLMs are founded on unethical behavior, given their usefulness, should they be permitted or prohibited, assuming that’s possible?”
Week 8, Thursday 10/17/2024	Peer Collaboration (not necessarily critique)	Break into small groups. Each author should share their arguments/drafts. If they want, they can seek critical feedback.
Sunday, 10/20/24	Project Due

Step 1 – Collaborative Summary

Working in groups, each group member should choose a different summarizing tool to summarize both the NYT lawsuit against OpenAI and OpenAI’s response:

Ilene Frank’s List of Summarizing Tools
The 10 Best AI Summarizers Tested and Compared for 2024

Once each group member has created an independent summary, collaborate in google docs to combine the summaries into a single, one-page summary of NYT’s lawsuit and OpenAI’s response. At the top of each summary, in APA 7, provide the relevant bibliographical information. Do this for both documents. Link your group’s summary at the Course Sandbox

Grading Criteria for Summaries

Accuracy: How accurately does the summary represent the main points of both the NYT lawsuit and OpenAI’s response?
Clarity: Is the summary clear and easy to understand?
Conciseness: Does the summary effectively condense the information into 250 words?

Step 2 – Writing @ Home

Write a draft of your response to the writing prompt, “Should companies like OpenAI be forced to discard its LLMs and create new ones based on solely license content? Or, should society remediate and redefine intellectual property laws and academic integrity conceptions to allow for “hybrid writing” — i.e., writing coauthored by humans and generative artificial intelligence tools? In other words, even if LLMs are founded on unethical behavior, given their usefulness, should they be permitted or prohibited, assuming that’s possible?” To substantiate your analysis, reflect on what you’ve learned from completing the first four challenges. Recall, e.g., your argument about the benefits of writing without AI. Remember how you used Adobe firefly or Express to create the infographic. Think about all of the custom chatbots you explored and your own efforts to create a chat bot.

Step 3 – Peer Collaboration

Working in groups, share your draft of your response to the question, “Are potential copyright violations an acceptable cost for realizing the societal benefits of advanced AI language technologies?”

Step 4 – Submission Instructions – Deliverables

Upload to Canvas a gDoc link to your two, collaboratively-authored summaries. Be sure your link enables edit-view privileges.
Upload to Canvas a .pdf version of your analysis of the question, “Are potential copyright violations an acceptable cost for realizing the societal benefits of advanced AI language technologies?”

Related Resources

Open AI, Microsfot, and NYT will likely reach a settlment. CNBC Television

The Future of Writing: Postplagiarism & Hybrid Writing?

Summary

Introduction

A Brief Outline of the Debate: The New York Times vs. Open AI

What is Copyright? How Have Conceptions of Copyright Evolved Over Time?

What is Fair Use?

Writing Prompt

Required Resources/Readings

Related Readings

Schedule

Step 1 – Collaborative Summary

Grading Criteria for Summaries

Step 2 – Writing @ Home

Step 3 – Peer Collaboration

Step 4 – Submission Instructions – Deliverables

Related Resources

Academic Writing – How to Write for the Academic Community

Professional Writing – How to Write for the Professional World

Credibility & Authority – How to Be Credible & Authoritative in Research, Speech & Writing

The Future of Writing: Postplagiarism & Hybrid Writing?

Summary

Introduction

A Brief Outline of the Debate: The New York Times vs. Open AI

What is Copyright? How Have Conceptions of Copyright Evolved Over Time?

What is Fair Use?

Writing Prompt

Required Resources/Readings

Related Readings

Schedule

Step 1 – Collaborative Summary

Grading Criteria for Summaries

Step 2 – Writing @ Home

Step 3 – Peer Collaboration

Step 4 – Submission Instructions – Deliverables

Related Resources

Featured Articles