10-799: Data Privacy, Memorization and Copyright in Generative AI

Fall 2024

Welcome to CMU 10-799, where we dive into critical aspects of data privacy, memorization, and copyright in the realm of Generative AI. This course brings a mix of theory, practice, and hacker energy to privacy, where you will be divided into blue and red teams to find and defend against privacy vulnerabilities in contemporary generative models. Keep scrolling if you like Pokémons, or fear fake news ;)

Course Banner Image

Overview

This course will cover various topics concerning data privacy, such as differential privacy, extracting training data from models, unlearning techniques to remove such data, and legal issues related to data memorization and copyright. The class blends theory and practice, starting with an understanding of why data privacy matters, looking at past legal cases, and building a foundation in privacy and machine learning (concepts such as differential privacy and membership inference).

The highlight of the course will be two case studies, where students will be divided into blue and red teams to find and defend against privacy vulnerabilities in a contemporary generative model.

Learning Objectives

  • Gain an understanding of data privacy in context of machine learning, its importance, and techniques used
  • Explore privacy challenges in generative AI, LLMs, and diffusion models
  • Work in teams to identify, remove and defend privacy vulnerabilities in AI models
  • Analyze legal, ethical, and practical aspects of AI-generated content, including copyright

Prerequisites

  • Basic machine learning concepts, background in deep learning
  • Familiarity with Python programming (Pytorch)
  • Interest in data privacy and legal issues in AI

Course Information

  • Instructor: Pratyush Maini (pratyushmaini@cmu.edu) | Course Advisors: Zack Lipton, Zico Kolter, and Daphne Ippolito
  • Schedule: Tuesdays and Thursdays, 5:00 PM - 6:20 PM
  • Location: GHC-4301
  • Office Hours: Thursdays, 10:00 AM - 11:00 AM
  • Elective: This course is an official 6 credit elective for MS and PhD students in ML@CMU. Any student in the SCS can take it.

Frequently Asked Questions

Why should I take this course?

  1. There are too many students at CMU focused on building state-of-the-art models, but we don’t talk enough about their societal impacts, the data they’re trained on, or think about artists as stakeholders in this process.
  2. Plus, breaking things is fun! The assignments in this course will be quite different from what you expect in a general course, and will attempt to gamify learning.

How much time would it consume?

  1. Expect it to take about as much time as any typical 12-unit course, but for half the semester.
  2. Most of the evaluation is experiment-based, through team battles between defenders and attackers. These competitions have a low entry bar, but no defined ceiling on how well you can do. There’s no one right answer to the assignments. It’s up to you to channel your curiosity and push yourself to do the best you can.

Is this course for me?

  1. If you have a background in PyTorch, have trained models before, and understand basic algebra, backpropagation, gradients, etc., you should have the necessary pre-reqs to follow along. If you’ve worked on adversarial attacks before, you’ll be in a great spot.
  2. From the legal side, I don’t expect people to come with a lot of background. The goal is to build that understanding together, have open conversations, and share thoughts.

Course Structure

Three main themes, each explored through an in-depth case study:

  1. Data Privacy and Differential Privacy
  2. Data Memorization and Unlearning Copyrighted Content
  3. Legal Issues, Ethics, and Detecting AI-Generated Content

Assessment

  • Red Teams Projects (40%): Two challenges
  • Blue Teams Projects (50%): Two defenses
  • Class Participation (10%): Discussions

Schedule

Theme 1: Data Privacy and Differential Privacy

Date Topic Reading Activities/Announcements
Oct 22 (Tue) The Birth of the Printing Press & The Anatomy of a Threat Model - The Work of Art in the Age of Mechanical Reproduction by Walter Benjamin
- The Protection of Information in Computer Systems
Lecture and Discussion: Overview of privacy issues in ML, privacy breaches
Announcement: Sign up for GPU credits, HW 1 Release
Oct 24 (Thu) Differential Privacy - Deep Learning with Differential Privacy
- The Algorithmic Foundations of Differential Privacy (Chapters 1-3)
- Privacy in Machine Learning: A Survey
- Robust De-anonymization of Large Sparse Datasets
Hands-on Exercise: Implementing differential privacy in simple models
Oct 29 (Tue) Data Privacy Class Activity - Materials provided in class In-Class Activity: Simulating privacy attacks and defenses

Theme 2: Data Memorization and Unlearning Copyrighted Content

Case Study: Operation Poké-Purge

Date Topic Reading Activities/Announcements
Oct 31 (Thu) Data Memorization in ML Models - The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
- Membership Inference Attacks Against Machine Learning Models
Extra Reading: Extracting Training Data from Large Language Models
Lecture and Discussion: Understanding memorization and its impacts
Nov 5 (Tue) Measuring and Mitigating Memorization - A Closer Look at Memorization in Deep Networks
- Rethinking LLM Memorization through the Lens of Adversarial Compression
- Extracting Memorized Training Data via Decomposition
Lecture and Discussion: Techniques to measure and reduce memorization
Announcement: Homework for Theme 3 assigned
Nov 7 (Thu) Unlearning Techniques - Machine Unlearning
- Towards Making Systems Forget with Machine Unlearning
Extra Reading: Task of Fictitious Unlearning
Hands-on Exercise: Implementing unlearning methods
Resource: Erasing Concepts from Diffusion Models
Nov 12 (Tue) Case Study 1: Operation Poké-Purge - Case Study 1 Discussion Case Study Discussion: Pokémon unlearning challenge, team matchups & strategy discussion

Case Study: Operation Veritas

Date Topic Reading Activities/Announcements
Nov 14 (Thu) Guest Lecture: Legal and Ethical Issues in AI - Generative AI Lawsuits Timeline: Legal Cases vs. OpenAI, Microsoft, Anthropic, Nvidia, Intel and More
- The Files are in the Computer: On Copyright, Memorization, and Generative AI
- Notice from US Copyright
Lecture and Discussion: AI ethics, legal implications, and copyright issues
Nov 19 (Tue) What is Fair Learning? - Fair Learning
- Unfair Learning
Extra Reading: Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain
Class Activity: Open Discussion Floor
Nov 21 (Thu) Detecting & Watermarking AI-Generated Content - Defending Against Neural Fake News
- Automatic Detection of Machine Generated Text: A Critical Survey
- A Watermark for Large Language Models
Extra Reading: Deepfake Detection
Lecture and Discussion: Watermarking methods
Nov 26 (Tue) Case Study 2: Operation Veritas - Case Study 2 Briefing Case Study Discussion: Election integrity challenge, team matchups and strategy discussion

Final Presentations

Date Topic Activities/Announcements
Nov 28 (Thu) Thanksgiving Break Break Models
Dec 3 (Tue) Wrap Up Presentations: Students present case study results

Note: This schedule is subject to change. Please check regularly for updates.


Additional Notes:

  • GPU Credits Sign-Up: Please ensure you sign up for GPU credits by Oct 24 (Thu) to participate in hands-on exercises.

  • Homework Assignments: Homework for the next theme will be announced on the first day of the previous theme.

  • Extra Reading: Optional materials are provided for deeper understanding and exploration of topics.

Case Studies

This course features two in-depth case studies that allow students to apply theoretical knowledge to practical challenges. Each case study corresponds to one of the main themes of the course.

Case Study 1: Operation Pokémon Purge

Theme: Data Memorization and Unlearning Copyrighted Content

This case study challenges you to tackle the problem of unlearning copyrighted content, specifically Nintendo’s top 100 Pokémon, from an advanced diffusion model while maintaining its core functionality.

Case Study 2: Operation Veritas

Theme: Legal Issues, Copyright, and Detecting AI-Generated Content

In this final case study, you’ll develop and test watermarking and detection systems for AI-generated content in the context of potential election disinformation.

For each case study, students will be divided into red teams (challengers) and blue teams (defenders) to apply concepts learned in practical situations. Detailed instructions and resources for each case study will be provided in the respective briefing documents.

Class Activity: Data Privacy and Differential Privacy

While not a formal case study, we will explore the theme of Data Privacy and Differential Privacy through in-class activities and discussions. These exercises will provide hands-on experience with privacy-preserving methods and differential privacy techniques.

  • Activity Date: October 29, 2024
  • Details will be provided in class

Logistics

Late Submissions

  • 4 total grace days with a maximum of 2 grace days per assignment.
  • After grace days: 50% penalty up to 24 hours late, no credit after.
  • Extensions available for medical, family/personal emergencies, or university-approved absences.
  • Request 5 days prior to deadline in case of planned absences.

Audit and Pass/Fail

  • No official auditing; unofficial attendance welcome
  • Pass/Fail allowed (check program requirements)

Academic Integrity

  • Use Generative AI tools. Disclose!
  • Group study encouraged, but no sharing of written notes/code between teams
  • Searching for prior solutions/research papers is encouraged. Disclose!
  • Protect your work from copying
  • Violations result in grade reduction or course failure