Publications

Selected Publications Memorization Data Curation Others All

2025

  1. arXiv
    Safety pretraining: Toward the next generation of Safe AI
    Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Zachary C. Lipton, and J. Zico Kolter
    arXiv preprint arXiv:2504.16980, 2025

2024

  1. ACL
    Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly
    In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024
  2. NeurIPS
    Oral
    Private-NLP Workshop
    LLM Dataset Inference: Did you train on my dataset?
    Pratyush Maini, Hengrui Jia, Nicolas Papernot, and Adam Dziedzic
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Aug 2024
  3. COLM
    Oral
    Set-LLM Workshop
    TOFU: A Task of Fictitious Unlearning for LLMs
    Pratyush Maini*, Zhili Feng*, Avi Schwarzschild*, Zachary C. Lipton, and J. Zico Kolter
    In , Aug 2024
  4. CVPR
    Best Paper
    DPFM Workshop
    Scaling Laws for Data Filtering—Data Curation cannot be Compute Agnostic
    Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter
    In Conference on Computer Vision and Pattern Recognition, Aug 2024

2023

  1. ICML
    Can Neural Network Memorization Be Localized?
    Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, and Chiyuan Zhang
    In International Conference on Machine Learning, Aug 2023

2021

  1. ICLR
    Spotlight
    Dataset Inference: Ownership Resolution in Machine Learning
    Pratyush Maini, Mohammad Yaghini, and Nicolas Papernot
    Aug 2021
    Spotlight Award

2025

  1. ICML
    Memorization Sinks: Isolating Memorization during LLM Training
    Gaurav R. Ghosal, Pratyush Maini, and Aditi Raghunathan
    In International Conference on Machine Learning, 2025
  2. ICML
    Oral
    Dig-BUGS Workshop
    Unlocking Post-hoc Dataset Inference with Synthetic Data
    Bihe Zhao, Pratyush Maini, Franziska Boenisch, and Adam Dziedzic
    In International Conference on Machine Learning, 2025
  3. arXiv
    OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics
    Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C. Lipton, J. Zico Kolter, and Pratyush Maini
    arXiv preprint arXiv:2506.12618, 2025
  4. ICML
    STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
    Saksham Rastogi, Pratyush Maini, and Danish Pruthi
    In International Conference on Machine Learning, 2025
  5. Workshop
    Oral
    MemFM Workshop
    MAGIC: Diffusion Model Memorization Auditing via Generative Image Compression
    Gunjan Dhanuka, Sumukh K. Aithal, Avi Schwarzschild, Zhili Feng, J. Zico Kolter, Zachary C. Lipton, and Pratyush Maini
    In The Impact of Memorization on Trustworthy Foundation Models: ICML 2025 Workshop, 2025
  6. arXiv
    Safety pretraining: Toward the next generation of Safe AI
    Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Zachary C. Lipton, and J. Zico Kolter
    arXiv preprint arXiv:2504.16980, 2025

2024

  1. ICLR Blogpost
    Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for MIAs Hold Up?
    Pratyush Maini, and Anshuman Suri
    The Fourth Blogpost Track at ICLR 2025, 2024
  2. NeurIPS
    Oral
    Private-NLP Workshop
    LLM Dataset Inference: Did you train on my dataset?
    Pratyush Maini, Hengrui Jia, Nicolas Papernot, and Adam Dziedzic
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
  3. NeurIPS
    Best Paper
    CONDA Workshop
    Rethinking LLM Memorization through the Lens of Adversarial Compression
    Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary C. Lipton, and J. Zico Kolter
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
  4. NeurIPS
    Understanding Hallucinations in Diffusion Models through Mode Interpolation
    Sumukh K. Aithal, Pratyush Maini, Zachary C. Lipton, and J. Zico Kolter
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
  5. COLM
    Oral
    Set-LLM Workshop
    TOFU: A Task of Fictitious Unlearning for LLMs
    Pratyush Maini*, Zhili Feng*, Avi Schwarzschild*, Zachary C. Lipton, and J. Zico Kolter
    In , 2024

2023

  1. ICML
    Can Neural Network Memorization Be Localized?
    Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, and Chiyuan Zhang
    In International Conference on Machine Learning, 2023

2021

  1. ICLR
    Spotlight
    Dataset Inference: Ownership Resolution in Machine Learning
    Pratyush Maini, Mohammad Yaghini, and Nicolas Papernot
    2021
    Spotlight Award

2025

  1. arXiv
    BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
    Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, and Matthew Leavitt
    arXiv preprint arXiv:2508.10975, 2025
  2. arXiv
    Safety pretraining: Toward the next generation of Safe AI
    Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Zachary C. Lipton, and J. Zico Kolter
    arXiv preprint arXiv:2504.16980, 2025
  3. ICLR Blogpost
    Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators
    Harshay Bansal, and Pratyush Maini
    The Fourth Blogpost Track at ICLR 2025, 2025

2024

  1. ACL
    Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly
    In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024
  2. CVPR
    Best Paper
    DPFM Workshop
    Scaling Laws for Data Filtering—Data Curation cannot be Compute Agnostic
    Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter
    In Conference on Computer Vision and Pattern Recognition, Aug 2024
  3. ICLR
    Contributed Oral
    Datacomp Workshop
    T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
    Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, and Aditi Raghunathan
    In International Conference on Learning Representations, Aug 2024

2023

  1. EMNLP
    Model-tuning via prompts makes NLP models adversarially robust
    Mrigank Raman, Pratyush Maini, J. Zico Kolter, Zachary C. Lipton, and Danish Pruthi
    In Empirical Methods in Natural Language Processing, 2023

2022

  1. NeurIPS
    Best Paper Nominee
    Oral
    SCIS Workshop
    Characterizing Datapoints via Second-Split Forgetting
    Pratyush Maini, Saurabh Garg, Zachary C. Lipton, and J. Zico Kolter
    In Advances in Neural Information Processing Systems, 2022
    Award Nominee
  2. UAI
    Perturbation Type Categorization for Multiple \ell_p Bounded Adversarial Robustness
    Pratyush Maini, Xinyun Chen, Bo Li, and Dawn Song
    In Proceedings of The 38th Uncertainty in Artificial Intelligence Conference, 2022

2021

  1. CVPR
    Data-free model extraction
    Jean-Baptiste Truong*Pratyush Maini*, Robert J Walls, and Nicolas Papernot
    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021

2020

  1. ICML
    Adversarial Robustness Against the Union of Multiple Perturbation Models
    Pratyush Maini, Eric Wong, and J. Zico Kolter
    In International Conference on Machine Learning, 2020
  2. EMNLP
    Why and when should you pool? Analyzing Pooling in Recurrent Architectures
    Pratyush Maini, Keshav Kolluru, Danish Pruthi, and  Mausam
    In Findings of the Association for Computational Linguistics: EMNLP, 2020
    Also presented at BlackBoxNLP’20

2025

  1. ICML
    Memorization Sinks: Isolating Memorization during LLM Training
    Gaurav R. Ghosal, Pratyush Maini, and Aditi Raghunathan
    In International Conference on Machine Learning, 2025
  2. ICML
    Oral
    Dig-BUGS Workshop
    Unlocking Post-hoc Dataset Inference with Synthetic Data
    Bihe Zhao, Pratyush Maini, Franziska Boenisch, and Adam Dziedzic
    In International Conference on Machine Learning, 2025
  3. arXiv
    OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics
    Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C. Lipton, J. Zico Kolter, and Pratyush Maini
    arXiv preprint arXiv:2506.12618, 2025
  4. ICML
    STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
    Saksham Rastogi, Pratyush Maini, and Danish Pruthi
    In International Conference on Machine Learning, 2025
  5. Workshop
    Oral
    MemFM Workshop
    MAGIC: Diffusion Model Memorization Auditing via Generative Image Compression
    Gunjan Dhanuka, Sumukh K. Aithal, Avi Schwarzschild, Zhili Feng, J. Zico Kolter, Zachary C. Lipton, and Pratyush Maini
    In The Impact of Memorization on Trustworthy Foundation Models: ICML 2025 Workshop, 2025
  6. arXiv
    BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
    Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, and Matthew Leavitt
    arXiv preprint arXiv:2508.10975, 2025
  7. arXiv
    Safety pretraining: Toward the next generation of Safe AI
    Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Zachary C. Lipton, and J. Zico Kolter
    arXiv preprint arXiv:2504.16980, 2025
  8. ICLR Blogpost
    Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators
    Harshay Bansal, and Pratyush Maini
    The Fourth Blogpost Track at ICLR 2025, 2025

2024

  1. ACL
    Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly
    In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024
  2. ICLR Blogpost
    Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for MIAs Hold Up?
    Pratyush Maini, and Anshuman Suri
    The Fourth Blogpost Track at ICLR 2025, Aug 2024
  3. NeurIPS
    Oral
    Private-NLP Workshop
    LLM Dataset Inference: Did you train on my dataset?
    Pratyush Maini, Hengrui Jia, Nicolas Papernot, and Adam Dziedzic
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Aug 2024
  4. NeurIPS
    Best Paper
    CONDA Workshop
    Rethinking LLM Memorization through the Lens of Adversarial Compression
    Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary C. Lipton, and J. Zico Kolter
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Aug 2024
  5. NeurIPS
    Understanding Hallucinations in Diffusion Models through Mode Interpolation
    Sumukh K. Aithal, Pratyush Maini, Zachary C. Lipton, and J. Zico Kolter
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Aug 2024
  6. COLM
    Oral
    Set-LLM Workshop
    TOFU: A Task of Fictitious Unlearning for LLMs
    Pratyush Maini*, Zhili Feng*, Avi Schwarzschild*, Zachary C. Lipton, and J. Zico Kolter
    In , Aug 2024
  7. CVPR
    Best Paper
    DPFM Workshop
    Scaling Laws for Data Filtering—Data Curation cannot be Compute Agnostic
    Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter
    In Conference on Computer Vision and Pattern Recognition, Aug 2024
  8. ICLR
    Contributed Oral
    Datacomp Workshop
    T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
    Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, and Aditi Raghunathan
    In International Conference on Learning Representations, Aug 2024

2023

  1. ICML
    Can Neural Network Memorization Be Localized?
    Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, and Chiyuan Zhang
    In International Conference on Machine Learning, Aug 2023
  2. EMNLP
    Model-tuning via prompts makes NLP models adversarially robust
    Mrigank Raman, Pratyush Maini, J. Zico Kolter, Zachary C. Lipton, and Danish Pruthi
    In Empirical Methods in Natural Language Processing, Aug 2023

2022

  1. NeurIPS
    Best Paper Nominee
    Oral
    SCIS Workshop
    Characterizing Datapoints via Second-Split Forgetting
    Pratyush Maini, Saurabh Garg, Zachary C. Lipton, and J. Zico Kolter
    In Advances in Neural Information Processing Systems, Aug 2022
    Award Nominee
  2. UAI
    Perturbation Type Categorization for Multiple \ell_p Bounded Adversarial Robustness
    Pratyush Maini, Xinyun Chen, Bo Li, and Dawn Song
    In Proceedings of The 38th Uncertainty in Artificial Intelligence Conference, Aug 2022

2021

  1. ICLR
    Spotlight
    Dataset Inference: Ownership Resolution in Machine Learning
    Pratyush Maini, Mohammad Yaghini, and Nicolas Papernot
    Aug 2021
    Spotlight Award
  2. CVPR
    Data-free model extraction
    Jean-Baptiste Truong*Pratyush Maini*, Robert J Walls, and Nicolas Papernot
    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Aug 2021

2020

  1. ICML
    Adversarial Robustness Against the Union of Multiple Perturbation Models
    Pratyush Maini, Eric Wong, and J. Zico Kolter
    In International Conference on Machine Learning, Aug 2020
  2. EMNLP
    Why and when should you pool? Analyzing Pooling in Recurrent Architectures
    Pratyush Maini, Keshav Kolluru, Danish Pruthi, and  Mausam
    In Findings of the Association for Computational Linguistics: EMNLP, Aug 2020
    Also presented at BlackBoxNLP’20