- Pascal's Chatbot Q&As
- Posts
- Louis Hunt's LinkedIn post exposes a significant issue: the apparent leakage of test data from widely used benchmark datasets, such as MMLU & GSM8K, into the training datasets of large language models
Louis Hunt's LinkedIn post exposes a significant issue: the apparent leakage of test data from widely used benchmark datasets, such as MMLU & GSM8K, into the training datasets of large language models
If models have already been exposed to the test data during training, their performance metrics are inflated and unreliable, undermining the credibility of these benchmarks.
Question for ChatGPT-4o: Please analyze the LinkedIn post from Louis Hunt and explain what the issue is that he has laid bare. Why is this of concern and do you agree with the views of Vasily Morzhakov and others?
ChatGPT-4o’s answer:
Louis Hunt's LinkedIn post exposes a significant issue: the apparent leakage of test data from widely used benchmark datasets, such as MMLU and GSM8K, into the training datasets of large language models (LLMs). He provided links to thousands of pages of benchmark tests and their regeneration code, suggesting that models like Mistral 7B and Llama 3 have been trained or influenced by this test data.
Why is this concerning?
- Integrity of Benchmarking: Benchmarks like MMLU and GSM8K are meant to evaluate how well models perform on unseen tasks. If models have already been exposed to the test data during training, their performance metrics are inflated and unreliable, undermining the credibility of these benchmarks. 
- Cheating Allegations: Training on test data to improve scores is akin to getting exam questions in advance and then bragging about high grades. This practice creates a misleading representation of the model's capabilities, which could distort public perception and commercial decisions. 
- Erosion of Trust in AI Research: If benchmarks become proxies for how well companies can manipulate outcomes rather than assess genuine model performance, trust in the AI research ecosystem erodes. It also casts doubt on the validity of broader claims about the models' capabilities, including advancements toward AGI. 
- Data Privacy and Ethics: Some comments pointed to aggressive web scraping practices, raising further ethical concerns about how companies acquire training data. 
Views of Vasily Morzhakov and Others
Vasily Morzhakov emphasized that this is a "VERY BIG ISSUE," agreeing with the broader sentiment that using test data in pretraining is unethical and deceptive. Other commentators, such as Salomón Muriel and Lloyd Watts, suggested potential solutions, including:
- Creating benchmarks by third parties to prevent contamination. 
- Running evaluations in neutral sandboxes to ensure test data integrity. 
- Acknowledging systemic risks of test data leaks, either through adversarial actions or inherent infrastructure flaws. 
My Perspective
I agree with the concerns raised. This issue compromises the validity of AI benchmarking, a cornerstone for assessing innovation in machine learning. Rigorous standards are essential to ensure fair competition and genuine progress. Implementing secure, independent, and continuously evolving evaluation mechanisms seems necessary to address these challenges.

·
14 DECEMBER 2024

