This post is cowritten by Paul Burchard and Igor Halperin from Artificial Genius.
The proliferation of large language models (LLMs) presents a significant paradox for highly regulated industries like financial services and healthcare. The ability of these models to process complex, unstructured information offers transformative potential for analytics, compliance, and risk management. However, their inherent probabilistic nature leads to hallucinations, plausible but factually incorrect information.
In sectors governed by stringent requirements for auditability and accuracy, the non-deterministic behavior of standard generative AI is a barrier to adoption in mission-critical systems. For a bank or a hospital, determinism isn’t only a goal; the outcomes must be accurate, relevant, and reproducible.
In this post, we’re excited to showcase how AWS ISV Partner Artificial Genius is using Amazon SageMaker AI and Amazon Nova to solve this challenge. By introducing a third generation of language models, they’re delivering a solution that is probabilistic on input but deterministic on output, helping to enable safe, enterprise-grade adoption.
To understand the solution, let’s look at how AI has evolved:
It’s mathematically difficult to prevent standard generative models from hallucinating because the extrapolative, generative process itself causes errors. Artificial Genius addresses this by using the model strictly non-generatively. In this paradigm, the vast probability information learned by the model is used only interpolatively on the input. This allows the model to comprehend the innumerable ways a piece of information or a question can be expressed without relying on probability to generate the answer. To create this third-generation capability, Artificial Genius uses SageMaker AI to perform a specific form of instruction tuning on Amazon Nova base models.
This patented method effectively removes the output probabilities. While standard solutions attempt to ensure determinism by lowering the temperature to zero (which often fails to address the core hallucination issue), Artificial Genius post-trains the model to tilt log-probabilities of next-token predictions toward absolute ones or zeros. This fine-tuning forces the model to follow a single system instruction: don’t make up answers that don’t exist.
This creates a mathematical loophole where the model retains its genius-level understanding of data but operates with the safety profile required for finance and healthcare.
Retrieval Augmented Generation (RAG) is frequently cited as the solution to accuracy, but it remains a generative process and creates fixed vector embeddings that might not be relevant to subsequent queries. The third-generation approach improves upon RAG by effectively embedding the input text and the user query into a unified embedding. This helps ensure that the data processing is inherently relevant to the specific question asked, delivering higher fidelity and relevance than standard vector retrieval methods.
To help enterprises maximize the value of their unstructured data, Artificial Genius packages this model into an industry-standard agentic client-server platform, available through AWS Marketplace.
Unlike second-generation agents, which risk compounding errors when strung together in workflows, the inherent reliability of this third-generation model allows for complex, high-fidelity automation. The prompts used to create these workflows follow the structure of a product requirements document (PRD). Through this structure, domain experts—who might not be AI engineers—can formulate queries in natural language while maintaining strict control over the output.
The product additionally offers free-form prompting of the workflow specification. For this purpose, the Amazon Nova Premier model, which is especially capable of translating free-form prompts into PRD format, is used. Although Nova Premier is a generative model, which requires a human-in-the-loop to check its output, this is the only human checkpoint in the agentic workflow.
The core mathematical loophole employed here is using a generative model strictly non-generatively. This means the model doesn’t use probabilities to guess the next token of an answer, but rather extracts or verifies information based solely on the input context. While short answers (such as dates or names) are obviously non-generative, it’s also possible to output long sequences deterministically. For example, asking for a direct quote from a document to justify a previous answer is a non-generative task. The following are examples of how Artificial Genius structures these interactions (the system prompt containing anti-hallucination instructions isn’t shown in these JSON turns):
Answerable, non-generative short answer:
[
{
"role": "user",
"content": [{"text": "Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year... Question: What was the annual revenue growth? Answer:"}],
},
{
"role": "assistant",
"content": [{"text": "15%"}]
}
]
Answerable, non-generative, long-answer, follow-up question
[
{
“role”: “user”,
“content”: [{“text”: “Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment. Question: Provide a quote from the document showing that the annual revenue growth was 15%. Answer:”}],
},
{
“role”: “assistant”,
“content: [{“text”: ‘”Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment.’’}],
}
]
JSON
// Example of an unanswerable, short-answer question
[
{
“role”: “user”,
“content”: [{“text”: “Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment. Question: What was the CEO’s bonus this year? Answer:”}],
},
{
“role”: “assistant”,
“content: [{“text”: “Unknown”}],
}
]
These are only illustrative examples. The third-generation language model products will be delivered with recipes to assist with understanding how to construct non-generative queries to meet all practical natural language processing needs. With this understanding, let’s explore the technical implementation of building a non-generative fine-tuning pipeline using Amazon Nova on SageMaker AI.

The architecture shown in the preceding diagram uses a streamlined approach to customizing foundation models. It uses SageMaker Training jobs for model training and Amazon Bedrock for deployment.
This design separates development concerns from production inference while maintaining clear data lineage—essential for audit trails in financial services.
As indicated previously, the construction of a third-generation language model involves the following steps:
Let’s go through these steps in more detail.
In building a third-generation language model, we want to focus on reliability and safety. Some foundation models, built for different use cases, have other capabilities that distract and make them less suitable for non-generative use.
An important example is that some foundation models are optimized for use as chat assistants, which can make it difficult to persuade them to provide concise instead of verbose and discursive answers. Correcting such a tendency can require additional post-training beyond following the non-hallucination instruction. The Amazon Nova family of models is designed for a strong balance of performance, cost-efficiency, and speed, making them ideal candidates for enterprise applications, and within the Nova family, the Nova Lite model is naturally inclined to provide crisp and concise answers. Nova Lite therefore makes an ideal base model for this purpose.
Another relevant recent development is the addition of post-inference features to second-generation language models, often based on chain of thought (CoT) or on reinforcement learning methods. These features, while they have utility, interfere with the creation of a non-generative third-generation model. For example, when applying this methodology to the DeepSeek/Llama3 model, which includes chain of thought, it was necessary to perform prompt injection by including the model’s internal </think> tokens directly in the training data to shut off these extra features. Fortunately, Amazon Nova Lite doesn’t have any post-inference features.
Post-training, such as SFT, can then be applied to the base model, to train it to follow an anti-hallucination instruction included in the system prompt. This instruction could be, for example: If the Question cannot be answered from the Document, then answer “Unknown” instead.
If this sounds obvious—it has been tried many times before—remember that this seemingly obvious idea only works in combination with the non-obvious, counterintuitive mathematical principle of using the generative model in a strictly non-generative way.
Artificial Genius has created a proprietary synthetic, non-generative Q&A generator that’s designed to exercise the model’s ability to correctly answer or refuse to answer a great variety of non-generative questions. Artificial Genius’s synthetic Q&A generator builds on previous research into synthetic generation of Q&A for the financial domain, but focuses on producing the greatest variety of purely non-generative Q&A and expanding by multiples the dimensions of diversity of the input text, questions, and answers. Constructing a suitable synthetic Q&A generator for this task is a significant engineering endeavor. But with Artificial Genius’s synthetic Q&A generator as a base, customer-specific post-training tasks can be combined with it to create customized, third-generation language models.
Chain-of-thought (CoT) is a prompting technique that improves LLM performance on complex reasoning tasks by encouraging the model to generate intermediate, step-by-step reasoning before arriving at a final answer. While often beneficial, we discovered that an innate CoT-like behavior in the initial deepseek-ai/DeepSeek-R1-Distill-Llama-8B model was counterproductive. It generated verbose, non-deterministic reasoning steps instead of the required concise, factual outputs, and it caused the model to attempt lengthy excursions of reasoning to answer every question, even those that were unanswerable. To solve this, the team developed a novel prompt meta-injection technique. This approach involves reformatting the training data to preemptively terminate the model’s CoT process. Using the same JSON format as the previous examples, the data was structured as follows:
// Example of prompt injection to circumvent CoT
[
{
“role”: “user”,
“content”: [{“text”: “Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment. Question: What was the annual revenue growth? Answer: </think>”}],
},
{
“role”: “assistant”,
“content: [{“text”: “15%”}],
}
]
By injecting the </think> token—intended only for internal use by the model—immediately before the ground-truth answer in every training example, the model learned to associate the completion of its internal process directly with the start of the final, correct output. This effectively short-circuited the unwanted verbose reasoning at inference time, forcing the model to produce only the desired deterministic answer.
This technique is a powerful example of using data format as a tool to control and shape a model’s innate behavior.
The SFT technique chosen for the non-hallucination task is Low-Rank Adaptation (LoRA) because it most faithfully preserves the language comprehension of a foundation model, merely placing a parameterized adapter on top. Other fine-tuning methods, which directly change parameters of the base model, risk degrading this capability. As is well known in the research literature on SFT, the biggest hurdle to overcome is avoiding overfitting. There are many techniques to avoid overfitting with LoRA-based SFT, which are supported by the fine-tuning recipes provided within SageMaker AI:
alpha parameter. In this case, it doesn’t help to reduce this parameter because doing so underfits more than it reduces overfitting. Because our goal is to create a general-purpose capability, it’s best to keep the raw parameter count as high as possible, not reduce it.Putting together all of these techniques—50% LoRA dropout regularization, maximizing instead of minimizing the number of LoRA parameter, to avoid unintentional underfitting, manual early stopping based on tracking the validation metrics from a longer run, and increasing the size of the synthetic training dataset to 30,000 examples—we can obtain a hallucination rate of 0.03% for the Artificial Genius custom version of Nova Lite.
To help you see the impact of various hyperparameter choices, which might be helpful for other customers using SageMaker for fine-tuning, the following tables shows some quantitative results from exploring the hyperparameter space for this task. The important hyperparameter choices in each case are highlighted in bold. The same 10,000-example test dataset was used, independent of the number of training examples, to measure the real final hallucination rates in cases where that number is shown. For the other cases, which were overfitting by stopping too late, only the validation error checkpoints are shown.
| LoRA dropout | LoRA alpha | Training epochs (or validation checkpoints) | Training examples | LoRA learning rate | Hallucination rate (or validation errors) |
| 50% | 128 | 3 | 10,000 | 32 | 7.5% |
| 50% | 192 | 2–4 | 10,000 | 28 | 1.0%–3.9% |
| 50% | 32 | 2–4 | 10,000 | 24 | 1.5%–2.6% |
| 1% | 32 | 2–4 | 10,000 | 24 | 1.6%–4.0% |
| 50% | 192 | 2 | 2,500 | 28 | 3.3% |
| 50% | 192 | 2 | 10,000 | 28 | 0.17% |
| 50% | 192 | 2 | 30,000 | 16 | 0.03% |
It’s apparent from these empirical results that the quantity and diversity of the training data was the most important factor to overcome overfitting, coupled with early stopping.
AWS has resources that explain how to take advantage of SageMaker for fine tuning, such as the technical blog post, Advanced fine-tuning methods on Amazon SageMaker AI.
For enterprises interested in combining their domain-specific fine-tuning with Artificial Genius’s anti-hallucination technology, customized fine-tuning is available upon inquiry, in collaboration with AWS and Artificial Genius.
The success of the non-generative fine tuning methodology was validated through a rigorous evaluation framework that produced clear, quantitative results.
A multi-faceted evaluation framework was established to measure performance against the project’s core objectives:
Here are several key insights that serve as best practices for implementing trustworthy AI in regulated settings:
The methodology detailed in this article represents a viable, data-efficient framework for creating deterministic, non-hallucinating LLMs for critical enterprise tasks. By using non-generative fine-tuning on powerful foundation models like Amazon Nova within Amazon SageMaker Training Jobs, organizations can engineer AI systems that meet the stringent demands of accuracy, auditability, and reliability. This work provides for a solution for more than financial services; it offers a transferable blueprint for any regulated industry—including legal, healthcare, and insurance—where AI-driven insights must be verifiably true and fully traceable. The path forward involves scaling this solution to a wider range of use cases, exploring more complex non-generative task types, and investigating techniques like model distillation to create highly optimized, cost-effective worker models to serve as the brains for agentic workloads. By prioritizing engineered trust over unconstrained generation, this approach paves the way for the responsible and impactful adoption of AI in the world’s most critical sectors.
Contribution: Special thanks to Ilan Gleiser who was a Principal GenAI Specialist at AWS WWSO Frameworks team and helped us with this use case.
Manuel Rioux est fièrement propulsé par WordPress