Blog

We asked AI to assess quotes. It confidently made things up.

May 25, 2026

We asked AI to assess quotes. It confidently made things up.

Stratabid is a platform for strata jobs. A building has a problem, the committee or manager asks for quotes, tradies submit proposals, and then someone has to make sense of them. That last part is where things get messy. Because comparing quotes is not just about price.

One quote may be cheaper because it quietly excludes half the work. Another may be more expensive because it actually includes materials, warranty, emergency attendance, and a proper scope. Another may look polished but somehow avoid answering the actual problem. And sometimes the best quote is not the cheapest one, which everyone knows in theory and then forgets the second they see the final price.

So we thought: this is exactly the kind of thing an LLM should help with. Take the quotes, ask the AI to compare them, and produce an assessment. Simple and beautiful, but completely wrong.

The naive version

The first version was basically: send the quotes to the LLM and ask it to classify them and assess their quality.

Something like: Here are three quotes for this job. Tell me which one is better and why. That was the entire grand architecture.

And, of course, the results were terrible. Not “slightly rough around the edges” terrible. More like “this sounds intelligent until you actually read the quotes” terrible.

The model would make claims that were not in the documents. It would infer things that were not inferable. It would confidently say a quote included something that it did not include. Sometimes it would invent differences between quotes just because the prompt clearly expected differences to exist.

Basically, it was producing confident answers that were not grounded in the documents.

The problem was not that the model was useless. The problem was that we were asking it to do a job without giving it the structure required to do that job. We were asking the LLM to make a judgement without giving it enough structure to judge anything properly. That is where things went wrong.

The missing ingredient was context

The first lesson was painfully simple: the model cannot properly evaluate a quote if it does not understand what the quote is supposed to answer. A quote is not good or bad in isolation. It is good or bad relative to a job, a scope, a set of requirements, and a set of questions that matter for that type of work.

So we changed the approach. Instead of asking the model to “assess the quote”, we started giving it the job questions.

For example, for a strata maintenance job, the important questions might be things like:

Does the quote clearly describe the scope of work?
Does it include labour and materials?
Does it mention exclusions?
Does it provide warranty information?
Does it respond to the actual problem, or just provide a generic service description?
Are there any obvious risks, missing assumptions, or unclear conditions?

That improved the output immediately. Not perfectly, but noticeably.

The model finally had something to evaluate against. It was no longer producing generic AI opinions. It had questions to answer, criteria to follow, and a clearer job to do.

Then came document parsing and summarisation

Sometimes they are PDFs. Sometimes they are scanned documents. Sometimes they are emails, attachments, tables, disclaimers, legal terms, payment conditions, warranty notes, and three paragraphs of text that look important but are mostly there because someone copied them from a template in 2014.

So we moved to a small document pipeline. First, the system reads the document. If the quote is a scanned PDF or an image-based file, it needs OCR before the text can be used properly. Then the extracted text is parsed and converted into a cleaner summary of the quote: provider name, price, included work, excluded work, assumptions, warranty, timing, payment terms, risks, unclear items, and any relevant notes from the original document.

That summary then becomes the thing the rest of the AI workflow uses. This matters for two reasons. First, the assessment is no longer based on a messy raw PDF. It is based on a cleaner intermediate version of the quote. Second, we do not need to keep sending the full PDF content into every subsequent LLM call. Once the quote has been parsed and summarised, later steps can work from that smaller representation, which saves tokens and makes the rest of the workflow easier to control.

That sounds obvious now, but it was not obvious when we started. Like many AI features, the first instinct was to build one clever prompt. The better solution was to build a pipeline.

Designing the assessment framework became part of the product

Once we saw how much the output depended on the questions being asked, we realised those questions could not just live inside a hardcoded prompt somewhere. They were not random prompt instructions. They were the assessment framework.

For each quote, Stratabid needs to ask things like: does it clearly define the scope? Does it include labour and materials? Does it mention exclusions? Does it provide warranty information? Does it actually respond to the job, or does it just describe a generic service? Those questions define what “good” looks like. And different job types need different definitions of good.

A plumbing quote should not be assessed the same way as a fire safety quote. A locksmith quote is different from waterproofing. Building management is different again. Strata management proposals have their own weird universe of details, fees, exclusions, and soft promises.

So we added an interface to administer the assessment questions. That changed the feature from “AI compares quotes” into something more useful: Stratabid can define what good looks like for a specific type of job, and then use AI to assess submitted quotes against that definition.

That is a much better mental model. The AI is not there to magically know everything. It is there to apply a structured assessment framework to messy supplier documents. And that distinction matters.

Haiku 3.5 was okay. Sonnet 4.5 made it click.

At this stage, the system was finally becoming useful.

We were running the AI workflow through Amazon Bedrock, using Anthropic models for now. One practical advantage of that setup is that changing models is relatively easy. The pipeline does not need to be redesigned every time we want to test a different model. The same parsed documents, summaries, assessment questions, quote-specific evaluations, and final comparison can be run through another model and compared.

With Haiku 3.5, the results were okay. The model could follow the structure, answer the questions, and produce something that looked like a reasonable assessment. But it still felt a bit fragile. Sometimes it missed nuance. Sometimes it stayed too generic.

Then we tried Sonnet 4.5. That was the point where the feature started to feel like what we originally wanted.

The same general pipeline suddenly produced much better results. The answers were more grounded. The reasoning was sharper. The model was better at noticing missing details and explaining why those missing details mattered. It was also better at not over-claiming, which is probably one of the most underrated qualities in an AI system that is supposed to help with real decisions.

This was the moment where the feature crossed the line from “interesting demo” into “okay, this could actually help someone”. Not replace the human decision, but help the human understand the options faster and make a better decision because of it.

The real lesson

The interesting part of this feature was not simply “we used a better model”. The better model helped a lot, obviously. But the real lesson was that the model only became useful once the system around it became more intentional.

The naive version was: Here are the quotes. Tell me which one is best.

The current version is closer to: Here is the job context. Here are the questions that matter for this type of work. Here are the parsed and summarised quote documents. For each quote, assess how well it answers the important questions. Identify missing information, risks, exclusions, and strengths. Then compare the quotes without inventing facts that are not present in the source material.

That is a completely different task. And unsurprisingly, it produces a completely different result.

What I would design first next time

If I had to build this again, I would not start by asking the model for the final answer. I would start by designing the intermediate objects.

What does a parsed quote look like? What does a good assessment question look like? What does evidence look like? What should the model be allowed to say when the document does not provide enough information?

That last one is important. A lot of AI systems fail because they are designed to always produce an answer. But in this use case, “the quote does not say” is often the correct answer. And it is much more valuable than a confident answer that is not supported by the source material.

Where this is going

The feature is now much closer to what we wanted. A user can upload or receive quotes, Stratabid can parse and summarise them, apply job-specific assessment criteria, and produce a comparison that is more useful than “this one is cheaper”.

The output can highlight things like unclear exclusions, missing warranty details, weak scope, stronger coverage, or a quote that looks good only if the price is acceptable. That is the kind of information that can help a strata committee or manager have a better conversation.

And for me, that is where AI starts to become interesting in software. Not as a magic oracle. Not as a chatbot bolted onto the side of an app. But as a structured reasoning layer inside a workflow, where the product gives the model enough context, constraints, and source material to do something useful.

That is what finally made the quote assessment feature work. It was not one prompt. It was the system around the prompt.