Back to Newsroom
Engineering·8 min read

Is Your Business Running on Hallucinations? The Case for Deterministic AI Guardrails

Patrick Hennis
AI/ML Engineer

Over the past almost three years, we have seen a meteoric rise in the usage of language models in the workplace. Employees everywhere are using LLMs to assist them in their daily tasks. When OpenAI and University of Pennsylvania researchers examined LLM exposure across occupations in 2023, they found the reach was immense: 80% of U.S. workers could see meaningful disruption to their daily tasks, and nearly a fifth could have half their work transformed entirely [1].

With adoption accelerating across enterprise settings, LLMs are increasingly used to make business decisions. While there is a lot of upside to leveraging LLMs correctly, outputs taken at face value without adequate validation of the claims made during generation can lead to decisions based on incorrect information. At the University of Texas at Austin, Zhang & Lee organized a workshop in 2025 as an interactive learning activity to introduce participants to the types of workplace AI, identify gaps in their understanding, and clarify what they need to know to use it responsibly. Through the workshop and follow-up interviews, they discovered that workers “risk over-relying on AI and blindly trusting its outputs, which can lead to harmful outcomes and loss of critical thinking skills” [2].

Without proper guardrails, deterministic toolsets, and monitoring of generations, businesses relying on out-of-the-box LLMs are exposing themselves to making critical decisions based on hallucinations.

Shortcomings of LLMs

Language models are fundamentally probabilistic. The same prompt can produce different outputs across different runs, making it unreliable for tasks that require consistent precision. Cui & Alexander (2026) at the University of Toronto evaluated language models' ability to complete a data analysis task across different combinations of prompting frameworks, models, and settings, totaling 480 attempts. They found “considerable variation in the analytical results even for consistent configurations” [3]. One of the most effective ways to reduce indeterministic responses from language models is the temperature parameter. To get more deterministic outputs from LLMs, setting the temperature to zero can help, but it is not always foolproof. Even when Cui and Alexander set the temperature to zero, they still found that “some estimates lead to different conclusions about the same research question” [3].

Beyond inconsistency, LLMs have been observed routinely generating confident-sounding but factually incorrect responses in a sycophantic attempt to answer the user. This can happen even when they lack the information or tooling to do so correctly. These confident but incorrect outputs have been dubbed “hallucinations,” and according to Tonmoy et al., they are “arguably the biggest hindrance to safely deploying these powerful LLMs into real-world production systems that impact people's lives” [4]. When asked to summarize an article by providing a URL, for example, LLMs have been known to fabricate a summary based on the article's slug rather than inform the user that they cannot access the content. If a user of the system is unaware of the limitations and the responses are plausible and confident, it is easy to see how these errors can compound.

Businesses also have limited visibility into how managed LLMs like ChatGPT and Claude behave within their specific workflows, creating a dangerous gap between expected and actual behavior. LLM providers silently update their models through prompt changes or actual model weight updates without notice, meaning the model you tested is not necessarily the one running in production. A pipeline that works one day may break overnight, and without monitoring, you may have no idea until it is far too late. Chen, Zaharia & Zou monitored GPT-3.5 and GPT-4 over several months in 2023, periodically submitting the same task of identifying prime vs. composite numbers. They found that in March, GPT-4 was 84% accurate; by June, after several silent updates, it had dropped to just 51% on the same task [5]. This degradation in performance over time is known as model drift. Since incremental model updates are not always communicated, without a system consistently checking model behavior, your workflows can lose previously reliable behavior.

The Core Risk
Probabilistic outputs, confident hallucinations, and silent model drift combine into a single business problem: critical decisions made on information that cannot be reproduced, verified, or trusted over time.

How We Address These Issues at Neural

Deterministic tools complete workflows that require deterministic results. We do not rely on LLMs to perform analytical calculations, only to coordinate them. Every measurement, score, and aggregate flows through deterministic infrastructure. Spatial joins, risk scoring, rule engines, and analytics are all executed by components that return identical outputs given the same inputs. Our LLM layer operates above the data plane and is responsible for intent classification, tool selection, and natural language presentation, not for calculating the numbers themselves.

This boundary is enforced through a curated registry of domain-specific tools that wrap the platform's deterministic capabilities. When a user expresses intent, the LLM's role is to select the appropriate tool and supply structured parameters. The tools execute deterministic code given the inputs. The model's allowed area to operate is reduced from infinite outputs to a finite set of tools. For any geospatial or analytical computation, the LLM never returns the answer; it returns the call.

Monitoring for model drift is important to avoid getting caught out by the gradual degradation in accuracy. We created and constantly update a suite of test cases that get run periodically to limit the impact of drift. We also use this suite when testing new models before any switch is made. This helps us ensure any regressions are caught before deploying major changes, keeping your workflow functional. Logging and tracing from the LLM interactions also help us detect changes in expected behavior or issues with the deterministic toolset.

These measures give users the confidence to use results from the Neural ecosystem in real business decisions, backed by data they can actually see and verify.

References

  • 1.GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models — Eloundou, Manning, Mishkin & Rock — OpenAI / University of Pennsylvania (2023)
  • 2.Knowledge Workers' Perspectives on AI Training for Responsible AI Use — Zhang & Lee — University of Texas at Austin (2025)
  • 3.Same Prompt, Different Outcomes: Evaluating the Reproducibility of Data Analysis by LLMs — Cui & Alexander — University of Toronto (2026)
  • 4.A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models — Tonmoy et al. — Islamic University of Technology / University of South Carolina / Stanford / Amazon AI (2024)
  • 5.How Is ChatGPT's Behavior Changing over Time? — Chen, Zaharia & Zou — Stanford / UC Berkeley (2023)

Experience workflow velocity and precision with Prometheus

See how Prometheus turns fragmented geospatial and climate data into decisions your team can stand behind.