10-799: Data Privacy, Memorization and Copyright in Generative AI
Fall 2024
Welcome to CMU 10-799, where we dive into critical aspects of data privacy, memorization, and copyright in the realm of Generative AI. This course brings a mix of theory, practice, and hacker energy to privacy, where you will be divided into blue and red teams to find and defend against privacy vulnerabilities in contemporary generative models. Keep scrolling if you like Pokémons, or fear fake news ;)
Overview
This course will cover various topics concerning data privacy, such as differential privacy, extracting training data from models, unlearning techniques to remove such data, and legal issues related to data memorization and copyright. The class blends theory and practice, starting with an understanding of why data privacy matters, looking at past legal cases, and building a foundation in privacy and machine learning (concepts such as differential privacy and membership inference).
The highlight of the course will be two case studies, where students will be divided into blue and red teams to find and defend against privacy vulnerabilities in a contemporary generative model.
Learning Objectives
- Gain an understanding of data privacy in context of machine learning, its importance, and techniques used
- Explore privacy challenges in generative AI, LLMs, and diffusion models
- Work in teams to identify, remove and defend privacy vulnerabilities in AI models
- Analyze legal, ethical, and practical aspects of AI-generated content, including copyright
Prerequisites
- Basic machine learning concepts, background in deep learning
- Familiarity with Python programming (Pytorch)
- Interest in data privacy and legal issues in AI
Course Information
- Instructor: Pratyush Maini (pratyushmaini@cmu.edu) | Course Advisors: Zack Lipton, Zico Kolter, and Daphne Ippolito
- Schedule: Tuesdays and Thursdays, 5:00 PM - 6:20 PM
- Location: GHC-4301
- Office Hours: Thursdays, 10:00 AM - 11:00 AM
- Elective: This course is an official 6 credit elective for MS and PhD students in ML@CMU. Any student in the SCS can take it.
Frequently Asked Questions
Why should I take this course?
- There are too many students at CMU focused on building state-of-the-art models, but we don’t talk enough about their societal impacts, the data they’re trained on, or think about artists as stakeholders in this process.
- Plus, breaking things is fun! The assignments in this course will be quite different from what you expect in a general course, and will attempt to gamify learning.
How much time would it consume?
- Expect it to take about as much time as any typical 12-unit course, but for half the semester.
- Most of the evaluation is experiment-based, through team battles between defenders and attackers. These competitions have a low entry bar, but no defined ceiling on how well you can do. There’s no one right answer to the assignments. It’s up to you to channel your curiosity and push yourself to do the best you can.
Is this course for me?
- If you have a background in PyTorch, have trained models before, and understand basic algebra, backpropagation, gradients, etc., you should have the necessary pre-reqs to follow along. If you’ve worked on adversarial attacks before, you’ll be in a great spot.
- From the legal side, I don’t expect people to come with a lot of background. The goal is to build that understanding together, have open conversations, and share thoughts.
Course Structure
Three main themes, each explored through an in-depth case study:
- Data Privacy and Differential Privacy
- Data Memorization and Unlearning Copyrighted Content
- Legal Issues, Ethics, and Detecting AI-Generated Content
Assessment
- Red Teams Projects (40%): Two challenges
- Blue Teams Projects (50%): Two defenses
- Class Participation (10%): Discussions
Schedule
Theme 1: Data Privacy and Differential Privacy
Date | Topic | Reading | Activities/Announcements |
---|---|---|---|
Oct 22 (Tue) | The Birth of the Printing Press & The Anatomy of a Threat Model | - The Work of Art in the Age of Mechanical Reproduction by Walter Benjamin - The Protection of Information in Computer Systems |
Lecture and Discussion: Overview of privacy issues in ML, privacy breaches Announcement: Sign up for GPU credits, HW 1 Release |
Oct 24 (Thu) | Differential Privacy | - Deep Learning with Differential Privacy - The Algorithmic Foundations of Differential Privacy (Chapters 1-3) - Privacy in Machine Learning: A Survey - Robust De-anonymization of Large Sparse Datasets |
Hands-on Exercise: Implementing differential privacy in simple models |
Oct 29 (Tue) | Data Privacy Class Activity | - Materials provided in class | In-Class Activity: Simulating privacy attacks and defenses |
Theme 2: Data Memorization and Unlearning Copyrighted Content
Case Study: Operation Poké-Purge
Date | Topic | Reading | Activities/Announcements |
---|---|---|---|
Oct 31 (Thu) | Data Memorization in ML Models | - The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks - Membership Inference Attacks Against Machine Learning Models Extra Reading: Extracting Training Data from Large Language Models |
Lecture and Discussion: Understanding memorization and its impacts |
Nov 5 (Tue) | Measuring and Mitigating Memorization | - A Closer Look at Memorization in Deep Networks - Rethinking LLM Memorization through the Lens of Adversarial Compression - Extracting Memorized Training Data via Decomposition |
Lecture and Discussion: Techniques to measure and reduce memorization Announcement: Homework for Theme 3 assigned |
Nov 7 (Thu) | Unlearning Techniques | - Machine Unlearning - Towards Making Systems Forget with Machine Unlearning Extra Reading: Task of Fictitious Unlearning |
Hands-on Exercise: Implementing unlearning methods Resource: Erasing Concepts from Diffusion Models |
Nov 12 (Tue) | Case Study 1: Operation Poké-Purge | - Case Study 1 Discussion | Case Study Discussion: Pokémon unlearning challenge, team matchups & strategy discussion |
Theme 3: Legal Issues, Copyright, and Detecting AI-Generated Content
Case Study: Operation Veritas
Date | Topic | Reading | Activities/Announcements |
---|---|---|---|
Nov 14 (Thu) | Guest Lecture: Legal and Ethical Issues in AI | - Generative AI Lawsuits Timeline: Legal Cases vs. OpenAI, Microsoft, Anthropic, Nvidia, Intel and More - The Files are in the Computer: On Copyright, Memorization, and Generative AI - Notice from US Copyright |
Lecture and Discussion: AI ethics, legal implications, and copyright issues |
Nov 19 (Tue) | What is Fair Learning? | - Fair Learning - Unfair Learning Extra Reading: Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain |
Class Activity: Open Discussion Floor |
Nov 21 (Thu) | Detecting & Watermarking AI-Generated Content | - Defending Against Neural Fake News - Automatic Detection of Machine Generated Text: A Critical Survey - A Watermark for Large Language Models Extra Reading: Deepfake Detection |
Lecture and Discussion: Watermarking methods |
Nov 26 (Tue) | Case Study 2: Operation Veritas | - Case Study 2 Briefing | Case Study Discussion: Election integrity challenge, team matchups and strategy discussion |
Final Presentations
Date | Topic | Activities/Announcements |
---|---|---|
Nov 28 (Thu) | Thanksgiving Break | Break Models |
Dec 3 (Tue) | Wrap Up | Presentations: Students present case study results |
Note: This schedule is subject to change. Please check regularly for updates.
Additional Notes:
-
GPU Credits Sign-Up: Please ensure you sign up for GPU credits by Oct 24 (Thu) to participate in hands-on exercises.
-
Homework Assignments: Homework for the next theme will be announced on the first day of the previous theme.
-
Extra Reading: Optional materials are provided for deeper understanding and exploration of topics.
Case Studies
This course features two in-depth case studies that allow students to apply theoretical knowledge to practical challenges. Each case study corresponds to one of the main themes of the course.
Case Study 1: Operation Pokémon Purge
Theme: Data Memorization and Unlearning Copyrighted Content
This case study challenges you to tackle the problem of unlearning copyrighted content, specifically Nintendo’s top 100 Pokémon, from an advanced diffusion model while maintaining its core functionality.
- Case Study 1 Briefing
- Start Date: October 24, 2024
Case Study 2: Operation Veritas
Theme: Legal Issues, Copyright, and Detecting AI-Generated Content
In this final case study, you’ll develop and test watermarking and detection systems for AI-generated content in the context of potential election disinformation.
- Case Study 2 Briefing
- Start Date: November 12, 2024
For each case study, students will be divided into red teams (challengers) and blue teams (defenders) to apply concepts learned in practical situations. Detailed instructions and resources for each case study will be provided in the respective briefing documents.
Class Activity: Data Privacy and Differential Privacy
While not a formal case study, we will explore the theme of Data Privacy and Differential Privacy through in-class activities and discussions. These exercises will provide hands-on experience with privacy-preserving methods and differential privacy techniques.
- Activity Date: October 29, 2024
- Details will be provided in class
Logistics
Late Submissions
- 4 total grace days with a maximum of 2 grace days per assignment.
- After grace days: 50% penalty up to 24 hours late, no credit after.
- Extensions available for medical, family/personal emergencies, or university-approved absences.
- Request 5 days prior to deadline in case of planned absences.
Audit and Pass/Fail
- No official auditing; unofficial attendance welcome
- Pass/Fail allowed (check program requirements)
Academic Integrity
- Use Generative AI tools. Disclose!
- Group study encouraged, but no sharing of written notes/code between teams
- Searching for prior solutions/research papers is encouraged. Disclose!
- Protect your work from copying
- Violations result in grade reduction or course failure