Pascal's Chatbot Q&As
Posts
“Much remains to be demonstrated before LLMs can be considered fit for producing or assessing assurance arguments in engineering and certification.”— In other words: Not yet ready.

“Much remains to be demonstrated before LLMs can be considered fit for producing or assessing assurance arguments in engineering and certification.”— In other words: Not yet ready.

LLMs don’t care about truth. They generate text that sounds good — but that doesn’t mean it’s correct. Researchers call this “BS,” meaning it lacks concern for factual accuracy.

Pascal Hetzscholdt
April 04, 2025

Question ChatGPT-4o: Please analyze the report “Examining Proposed Uses of LLMs to Produce or Assess Assurance Arguments” and tell me what it says in easy to understand language.

ChatGPT-4o’s answer:
The NASA report titled “Examining Proposed Uses of LLMs to Produce or Assess Assurance Arguments” (March 2025) explores the idea of using large language models (LLMs) like ChatGPT to help build or evaluate safety-related arguments for critical systems (e.g., aircraft or nuclear systems). Here's a simplified summary of what the report says:

🌟 What is an assurance argument?

It's a structured but informal explanation to show that a system is safe, secure, or reliable. These are important in industries like aviation, where proving a system is safe is critical.

🤖 Why are people considering LLMs for this?

LLMs could:

Generate parts of safety arguments to save time.
Spot flaws (called “defeaters”) in existing arguments.
Translate arguments into formal formats for machines to check.

The idea is that AI might help make these complex tasks faster and less reliant on human labor.

🔬 What does the research say so far?

NASA reviewed 14 studies and found that while there are many ideas for using LLMs in this space, none of them prove that LLMs are actually good enough for high-stakes use like aircraft certification.

Some highlights:

LLMs can imitate safety argument formats reasonably well — but not reliably or consistently.
Some tools built with LLMs produced content that sounded right but was either wrong or misleading.
In many cases, LLMs repeated content they might’ve seen during training, so it wasn’t clear if they truly "understood" anything or were just parroting.

⚠️ What are the major concerns?

LLMs don’t care about truth
They generate text that sounds good — but that doesn’t mean it’s correct. Researchers call this “BS,” meaning it lacks concern for factual accuracy.
LLMs aren’t good at reasoning
They can mimic arguments, but they struggle with applying logic, dealing with corner cases, or producing original, reliable insights.
Trusting LLMs is risky
If regulators or engineers rely on LLMs too much, it could lead to errors in safety-critical systems, which is unacceptable.
Human oversight is still essential
Even the best-performing LLMs only got things mostly right. That’s not enough when lives are at stake.

🧪 How was efficacy tested?

Most studies tested LLMs using:

Quizzes about safety logic notations (like Goal Structuring Notation).
Comparing LLM-generated arguments to human-written ones.
Trying different prompting styles to see what gets the best results.

But many of the tests were flawed because:

They used examples the LLM might’ve seen during training.
Human reviewers disagreed on how “good” the LLM’s answers were.
None of the studies evaluated the impact on real-world certification processes.

🧭 NASA’s overall conclusion:

“Much remains to be demonstrated before LLMs can be considered fit for producing or assessing assurance arguments in engineering and certification.”
— In other words: Not yet ready.

The report warns against overtrusting LLMs in high-risk scenarios. While they can assist as tools in some parts of the process, full automation is currently too risky.