Technology

US DoD AI chief on LLMs: ‘I need hackers to tell us how this stuff breaks’


Head over to our on-demand library to view sessions from VB Transform 2023. Register Here


On the main stage at the DEF CON security conference in a Friday afternoon session (Aug. 11), Craig Martell, chief digital and AI officer at the U.S. Defense Department (DoD), came bearing a number of key messages. 

First off, he wants people to understand that large language models (LLMs) are not sentient and aren’t actually able to reason.

Martell and the DoD also want more rigor in model development to help limit the risks of hallucination — wherein AI chatbots generate false information. Martell, who is also an adjunct professor at Northeastern University teaching machine learning (ML), treated the mainstage DEF CON session like a lecture, repeatedly asking the audience for opinions and answers.

AI overall was a big topic at DEF CON, with the AI Village, a community of hackers and data scientists, hosting an LLM hacking competition. Whether it’s at a convention like DEF CON or as part of bug bounty efforts, Martell wants more research into LLMs’ potential vulnerabilities. Hen helps lead the DoD’s Task Force LIMA, an effort to understand the potential and the limitations of generative AI and LLMs in the DoD.

Event

VB Transform 2023 On-Demand

Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.

 


Register Now

“I’m here today because I need hackers everywhere to tell us how this stuff breaks,” Martell said. “Because if we don’t know how it breaks, we can’t get clear on the acceptability conditions and if we can’t get clear on the acceptability conditions we can’t push industry towards building the right thing, so that we can deploy it and use it.”

LLMs are great but they don’t actually reason

Martell spent a lot of time during his session pointing out that LLMs don’t actually reason. In his view, the current hype cycle surrounding generative AI has led to some misplaced hype and understanding about what an LLM can and cannot do.

“We evolved to treat things that speak fluently as reasoning beings,” Martell said.

He explained that at the most basic level a large language model is a model that predicts the next word, given the prior words. LLMs are trained on massive volumes of data with immense computing power, but he stresses that an LLM is just one big statistical model that relies on past context.

“They seem really fluent, because you can predict a whole sequence of next words based upon a massive context that makes it sound really complex,” he said.

The lack of reasoning is tied to the phenomenon of hallucination in Martell’s view. He argued that a primary focus of LLMs is fluency and not reasoning, and that the pursuit of fluency leads to errors — specifically, hallucinations.

“We as humans, I believe, are duped by fluency,” he said.  

Identifying every hallucination is hard and that’s another key concern for Martell. For example, he asked rhetorically, if he were to generate 30 paragraphs of text, how easy would it be to decide what’s a hallucination and what’s not? Obviously, it would take some time.

“You also often want to use large language models in a context where you’re not an expert. That’s one of the real values of a large language model: … asking questions where you don’t have expertise,” Martell said. “My concern is that the thing that the model gets wrong [imposes] a high cognitive load [on a human trying] to determine whether it’s right or whether it’s wrong.”

Future LLMs need ‘five nines’ of reliability

What Martell wants to happen is more testing and the development of acceptability conditions for LLMs in different use cases.

The acceptability conditions will come with metrics that can demonstrate how accurate a model is and how often it generates hallucinations. As the person responsible for AI at the DoD, Martell said that if a soldier in the field is asking an LLM a question about how to set up a new technology, there needs to be a high degree of accuracy.

“I need five nines [99.999% accuracy] of correctness,” he said. “I cannot have a hallucination that says: ‘Oh yeah, put widget A connected to widget B’ — and it blows up.”

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.