All posts
Education & Resources

Can Claude Do My R&D Tax Credit for Me?

By Firas Abuzaid
Can Claude Do My R&D Tax Credit for Me?

These days, it feels like Large Language Models (LLMs) are improving constantly. OpenAI released GPT 5.5 in late April; one month later, one of their internal models solved a problem in discrete geometry that had bamboozled mathematicians for the last 80 years. Not to be outdone, Anthropic started Project Glasswing to secure codebases all over the world by leveraging their private Mythos model. They also countered GPT 5.5 with Opus 4.8, which dropped a week ago, and it’s been receiving rave reviews.

More importantly, enterprises have begun to fully embrace Claude Code and OpenAI Codex in their day-to-day operations, and their widespread adoption portends a future where knowledge work will always be AI-assisted, if not fully automated. The appetite for AI has never been greater.

Of course, the accounting and tax world is not immune to these trends. Which is why lately we’ve been asked the same question over and over in our sales calls:

Why can’t I just ask Claude/Codex/Copilot/OpenClaw to file my R&D Tax Credit for me? If it can build and deploy an app from scratch, can it take care of that, too?

The short answer is No. The longer answer gets to the heart of what we’re building at Neo.Tax.

Tokens are Cheap1. Correctness is not.

First off, we should acknowledge: the question is not a crazy one. In fact, as a company that exists at the center of the AI + Tax revolution, we understand it better than anyone. LLMs can now automate lots of easy accounting responsibilities—and even some more complicated, long-running tasks, too. With the right prompts, you can coax Claude or Codex into handling your personal tax return; it’ll certainly do a good enough job, especially for basic W-2 returns.

But the problem is, “good enough” is simply not enough for business taxes. When it comes to maximizing credits while minimizing risk, customers demand perfection. An off-the-shelf LLM can get impressively close to an answer—but will it get it right every single time? And, for certain details that can’t always be found, will it hallucinate? Will it fill in the blanks with guestimates or full-on fabrications to complete your R&D Tax Credit?

That obviously doesn’t fly for our customers; they need something much more reliable and—more importantly—auditable. And a log of every message and tool call from a session with Claude or Codex won’t suffice for the IRS. They need hard evidence that justifies every aspect of the credit filing, down to each employee who worked on R&D.

Mind the Context Window

LLMs are ingenious at solving problems within a narrow context. And the frontier labs are hard at work to expand the context windows of their models, so they can handle bigger and bigger problems.2

Unfortunately, though, there’s no such thing as a free lunch: as you feed the LLM more and more input tokens to take advantage of that larger context window, its performance starts to degrade, and the LLM fails to understand the connective tissue and patterns that are required to arrive at the correct answer. This “context rot” phenomenon has been well-documented in various venues online, and it’s especially prevalent for needle-in-the-haystack queries that often come up in the world of tax and accounting.

As we explained earlier, perfect is the only good enough for the IRS, and as the input data scales, simply dumping all that data into an LLM will take you further from perfection. That’s why context engineering is key.

It’s not just the volume of data, however; it turns out that project management data poses a unique set of qualitative challenges at scale, too. Here’s a smattering of the problems we’ve faced at Neo.Tax when analyzing the ticketing data of our customers:

1. Finding Signal in the Noise: If you ask an LLM to summarize a large3 set of Jira tickets into a title and description, it will do an excellent job of finding the “average” of those tickets, so to speak, and generating text that’s coherent and acceptable to the untrained eye. But that’s not what customers need. For example, if a bug fix ticket has a much longer description because it contains a lengthy stacktrace, should that be given more weight? What if it only took one day to resolve—should that matter in the summarization output?

There’s a basic but critical intuition at play here that is hard for any LLM to comprehend: some tickets are more important than others. And, to generate descriptions that are semantically meaningful to the customer, you have to discern the important from the insignificant. Otherwise, you’ll just end up with AI slop. The problem only gets worse as the number of tickets increases—which is precisely when the value of using AI ought to be the highest.

2. Handling Business Idiosyncrasies: The kind of information that’s relevant for determining what’s eligible for an R&D tax credit varies greatly from one company to another. This means that any system that wants to automate this process must be adaptive and flexible in fetching the relevant context. Even within the same company, different teams will use their project management systems in different ways and adopt different conventions. And, at Neo.Tax, we have a strict rule: we do not ask the customer to follow a new set of conventions to use our software. They shouldn’t adapt to us—we adapt to them.

To accomplish that, there has to be some level of intelligence around the LLM, so that we can tailor our outputs to take the customer’s unique usage patterns into account. That way, we can determine when to focus within teams and adjust accordingly, and when to look across teams to leverage broader patterns that should be captured for a whole organization.

3. Categorizing Requires Broader Context: It may not seem like this at first glance, but ticketing data is actually network data. There are all sorts of linkages between tickets: parent-child, blocking/blocked-by, and related-to, to name a few. And these relationships and hierarchies matter quite a bit to our customers; ignoring the edges in the network and treating all tickets as atomized inputs turns out to produce poor results.4

This may be the most challenging aspect of analyzing, categorizing, and grouping tickets, which off-the-shelf LLMs are not always well suited for. For each ticket, the amount of potentially relevant information can explode, which means you need a specialized strategy to identify which related tickets are relevant for determining R&D projects.

A Complex Problem Calls for a Specialized Approach

As you start to understand the challenge of digesting and analyzing hundreds of thousands of tickets into relevant projects—which then must be judged as qualified or unqualified for the sake of an R&D tax credit—you realize why a specialized approach is not just preferred but necessary.

To that end, our engineers and data scientists have built a pipeline of AI models and algorithms to analyze all this project management data using a bottoms-up approach. Our pipeline includes both proprietary models and off-the-shelf LLMs, and, from end to end, our harness guides the models to the correct results while also allowing for customization to the business’s needs. When a customer onboards to Neo.Tax, we learn how their project management systems are organized, and use that information to adapt our algorithms to their data. By taking the time to understand the customer’s specific concerns, Neo.Tax can solve R&D at scale reliably and accurately.

Our approach was built with four principles in mind:

1. Infinite scalability. Our system becomes more reliable as it ingests more data.

If that sounds like it contradicts the context rot problem we described above, that’s the point! A naive setup degrades with scale because a single model has to hold the entire haystack in its head at once. Our pipeline never does that—it never asks a single model to hold the whole haystack in its context window.

That means the broader dataset becomes a source of statistical grounding rather than noise. The more tickets we see from a given team, the more confidently we can establish their baseline—what routine maintenance looks like for them versus genuine technical experimentation—and the sharper our judgment about what actually qualifies. At scale, volume stops being a liability and becomes the very thing that makes us accurate.

2. Reproducibility and auditability: Your favorite LLM might produce a “good enough” filing. Prompt it again and you’ll get a different one, with no way to explain the difference.

Auditability isn’t a feature we bolt on; it’s an essential property of how the pipeline is built. Every project we surface traces back to the specific tickets, the specific employees, and the specific time allocations that justify it, so a tax team can defend any line of the filing down to its source. And because the pipeline is deterministic where it counts, running it twice yields the same result. When the inputs do change, you can diff two runs and see exactly which tickets moved the dollar figure and why. A chat transcript with Claude can’t give you that, but a structured, versioned audit trail can—and that’s the difference between a number you filed and a number you can defend in an examination.

3. Maximizing quality while minimizing cost: When there’s a hammer in your toolbelt (i.e., a frontier-level LLM), everything starts to look like a nail.

The reason dumping a year of data into a frontier model is so expensive isn’t just the token bill. It’s that you’re paying premium rates for the most powerful model in the world to do work a cheap, deterministic algorithm could handle perfectly. Our pipeline routes the bulk of the job through fast and inexpensive deterministic steps, and reserves expensive model calls for the genuinely ambiguous judgment calls where these powerful tools actually earn their keep. The result is that cost scales with the number of hard decisions, not the raw volume of data. You get frontier-model precision exactly where it matters, and nowhere it doesn’t.

4. Privacy and security first and foremost: We have to earn the customer’s right to share their proprietary data with us.

No business is going to hand over a year of payroll and project data to a general-purpose chatbot, and they shouldn’t have to. That’s why we never pass any PII data to any of our models. Each Neo.Tax customer’s data is isolated, encrypted both in transit and at rest, and accessible only on a scoped, need-to-know basis—and we retain only what a defensible filing actually requires. We don’t ask you to take our word for it, either: Neo.Tax is SOC 2 Type II and ISO 27001 certified, so our controls are independently audited against the standards your security team already knows. You can read more about our security posture here or visit our Trust Center.

The frontier labs have built genuinely astonishing tools, and where they make sense, we use them, too. But “general purpose” and “correct every single time, and defensible to an auditor” are two very different problems. A model that can write you a sonnet, plan your honeymoon, and ship a web app is a marvel; it is not, on its own, an R&D tax credit engine. Turning hundreds of thousands (or even millions) of messy tickets into a filing you can stake a number on is the problem we set out to solve—and the only one we’re trying to solve. (Except for software capitalization! We built a second dedicated solution just for that.) So when a customer asks whether Claude can file their R&D credit for them, our answer is the same one we’d give about building it in-house: you could get close. We will get it right.

And if problems like these—context engineering at scale, deterministic pipelines wrapped around frontier models, network-structured data that refuses to behave—sound like the kind of thing you’d want to spend your days on, we’re hiring! We’re looking for Senior Software Engineers and Data Scientists to help us build it. Come join us.

Footnotes

  1. Although if you read the news, you’ll notice that’s changing, too!

  2. Claude Opus 4.8 has a context window of 1M tokens, enough to digest entire books!

  3. How large? Think on the order of thousands, or even tens of thousands of tickets.

  4. We know because we tried!

READY TO TALK?

See Neo.Tax on your data.

Bring an engineering team and a payroll export — we'll show you what your R&D credit looks like.

Schedule a demo