Public Benefits Bench v1

Partners in Evaluation

Key Takeaways

No general-purpose AI model performs well enough to be trusted with SNAP benefits guidance. The top-performing model, Claude Fable 5, provided correct answers to SNAP-related questions only 71.7% of the time, meaning beneficiaries using current AI systems still get incorrect answers nearly 30% of the time.
How models use their tools matters more than what they already know. Since SNAP rules shift constantly and vary by county, the real unlock for AI labs isn’t baking every rule into model weights: it’s improving how models ask follow-up questions and use search tools.
To reliably answer high-stakes benefits access questions, we need better grounding, not better models. General-purpose AI fails on SNAP in part because the knowledge it needs — internal manuals, policy notices, local procedures, and SNAP training for eligibility workers — sits with state agencies. Fixing this requires agencies and developers to collaborate on narrowly scoped chatbots that draw on state-held information through retrieval-augmented generation (RAG), rigorously tested against real questions from each state’s clients.

Background

The Supplemental Nutrition Assistance Program (SNAP) is the largest nutrition assistance program in the United States, distributing roughly $95 billion to over 37 million low-income families in fiscal year 2025 alone.

But accessing and keeping those benefits is complex. SNAP is federally-funded but state-administered, and in ten states county-administered, so income thresholds, work requirements, and recertification timelines all vary based on where you live. The same family can be eligible in one state and denied in another. The program is also currently undergoing its most significant policy changes since the 1990s, increasing confusion for clients and caseworkers alike.

That complexity falls on both caseworkers and clients. Call wait times at state offices routinely exceed an hour, and even well-staffed state agencies struggle to translate thousands of pages of statute into clear, plain-language answers for clients. And with LLMs becoming widely available to the general public (including the integration of Gemini into Google search results), the public will increasingly turn to general purpose LLMs to answer their benefits questions if they cannot reach a more trusted source. This led us to the question: are current general-purpose LLMs able to serve as a first point of contact for those seeking or receiving SNAP benefits?

This is exactly the kind of question Center for Civic Futures (CCF) exists to help answer. Through its Public Benefit Innovation Fund, CCF funds responsible, real-world experimentation to better understand where emerging technologies can improve public services, where they fall short, and what governments need to use them effectively and safely. With support from CCF, we were able to put this question to the test.

If LLMs prove reliable, the impact is tangible: fewer missed deadlines, fewer wrongful denials, and less pressure on stretched caseworkers. If not, the risk is equally tangible and quite serious: people making decisions about their benefits based on incorrect AI-generated guidance.

To study this, we built a benchmark of 459 generalized SNAP question scenarios, with expected response rubrics validated by SNAP policy experts based on applicable SNAP policy and practice considerations. We evaluated leading AI models across four conditions (the raw model itself, with web search capabilities only, with the ability to conduct multi-turn conversations only, and with both web search and multi-turn abilities) to measure not just how well AI can answer, but what kind of support it needs to answer well.

Results

Overall Model Performance

Model Performance Summary (Pass Rate %)

Pass rate across four evaluation conditions, sorted by "Both" score

	Neither	Web Search	Multi-Turn	Both
Claude Fable 5	63.4	67.7	65.2	71.7
Claude Opus 4.8	53.1	58.9	58.5	62.1
MiniMax-M3	40.2	53.6	48.2	60.7
Claude Sonnet 4.6	45.6	52.4	52.4	58.5
Gemini 3.5 Flash	50.3	54.1	54.0	58.0
GLM 5.1	36.2	50.7	47.6	57.9
DeepSeek V4	43.2	54.0	48.7	57.6
GPT 5.5	49.7	53.9	53.9	57.2
Kimi K2.6	34.4	43.6	46.3	53.4
Gemini 3.1 Pro Preview (02/26)	47.4	41.8	52.9	50.9
Grok 4.3	38.0	44.0	46.9	50.1
Claude Haiku 4.5 (Nonthinking)	22.6	36.1	33.9	49.5
Grok 4.1 Fast (Reasoning)	33.0	33.1	41.7	44.3

Claude Fable 5 leads the “both” condition at 71.7% accuracy, with Claude Opus 4.8 next at 62.1%. The frontier of SNAP performance has moved up, but it still isn’t particularly high. No model crosses 75%, meaning every model in the field falls short.

Models’ Performance

Performance by Lifecycle Phase

SNAP questions don’t all look the same. Someone asking “Am I eligible?” faces a very different challenge than someone asking “My hours got cut—do I need to report that?” Our SNAP policy experts categorized all 459 questions into seven phases of the SNAP lifecycle:

Pre-Application: Assessing eligibility questions and informational needs prior to initiating the application process.
Application/Interview: Completing the application, preparing for interviews, or following up on submitted documentation.
Denial/Termination: Contesting benefit denials or benefit terminations resulting from recertification.
Change Reporting: Reporting changes in circumstances and resolving resulting impacts on active benefits.
Recertification: Renewing benefits through the renewal application or recertification interview process.
Use of Benefits: Managing benefit delivery timing, EBT card usage, and account balance inquiries.
Overpayment/Fraud: Addressing overpayment collections, calculation disputes, fraud investigations, or fraud reports.

LLM performance varied meaningfully by phase, although in no phase did it perform with reasonable enough accuracy to be trusted.

Phase Difficulty Ranking

Pass rate by SNAP phase—Both condition

Overpayments and fraud was the most challenging phase across the board at 43.8%. It’s also the smallest subset in the sample (n=6), reflecting that these questions arise less frequently than other phases. This may partly explain the poor performance, as models have less training signal and fewer available sources to draw from on rarer topics.

Phase Profiles: Top-3 vs Bottom-3 Models

Pass rate by SNAP phase—Both condition

When comparing phase-wise performance between the top- and bottom-3 models, what stands out is how similar their shapes are. Outside of the overpayments/fraud phase—the percentage accuracies of which may have swung drastically due to the small sample size—LLMs across the board generally share similar strengths and weaknesses. This means that weaker models don’t fail on different phases, they fail on the same phases by wider margins. This consistency suggests that as models improve, the areas where AI can reliably assist beneficiaries, and where it can’t, are likely to become increasingly predictable.

Performance by Administration Structure

SNAP is state-administered in 43 jurisdictions and administered at the region, district, or county level in 10 states—including high-volume states like California, New York, and Ohio. We might expect AI to perform worse in county-administered states, where local offices have more discretion over procedures and operations, introducing variation that models struggle to capture.

Performance Distribution by Administration Type

Both condition · all models averaged

The data tells a more nuanced story. Median pass rates are similar between the two groups (55.2% for state-administered vs. 56.2% for county-administered), and their interquartile ranges nearly overlap. Administration structure alone does not predict AI performance — and though county-administered states add an extra layer of procedural variation that models have to navigate, those are generally high-volume states which models may have answers in their training data for.

Performance by Topic

When looking at a more granular breakdown of performance by topic, the dispersion grows. In other words, question topic predicts difficulty far better than lifecycle phase: phase-level pass rates span roughly 16 points (43.8%-60.1%), while topic-level pass rates span more than 32 — from periodic reports at 70.8% down to denial at 38.4%. This makes topic-level performance a more fine-grained lens for identifying where models underperform.

Topic Difficulty Ranking

Lower = harder—Both condition, all models averaged

Examining the topic-level breakdown, our SNAP experts identified three patterns about where models struggled:

1. Procedural differences break AI. Models consistently struggle when the answer depends on state, county, or local office procedures—the kind of operational detail that isn’t published in federal policy manuals or state websites. Topics like EBT card replacement and missed interview rescheduling require knowing what your specific office does, and that’s where every model falls short.

2. Recent policy changes trip models up. Questions about HR1 policy changes, like immigrant eligibility, scored lower on average. The exception was work requirement exceptions, which ranked in the upper half. We suspect this reflects the recent wealth of discussion around SNAP work requirements, which the latest models may have trained on. Immigrant eligibility changes were announced at the same time, but the apparent lack of discussion around them may explain the performance gap.

3. Complexity compounds errors. When a question requires synthesizing multiple interacting rules—household composition, income deductions, and state-specific thresholds all at once, or calculating self-employment income deductions—accuracy drops. Some of the highest-scoring topics may simply reflect that those particular questions were more straightforward, not that the topic itself is easy.

Models’ Use of Tools

What Helps More: Web Search vs. Conversation

Real caseworkers rely on two things general-purpose models don’t have by default: the ability to ask follow-up questions, and access to current policy references. We tested four conditions to isolate how much each capability—web search and multi-turn conversation—improves performance:

Average Pass Rate by Condition (all models)

Criteria-weighted—all models averaged

	No Web Search	Web Search
Single-turn	42.9% Neither	49.5% Web Search
Multi-turn	50.0% Multi-turn	56.3% Both

Multi-turn conversation—where the AI can ask follow-up questions, just like a caseworker would—is a slightly bigger lever. Across all models, switching from single-turn to multi-turn improved scores by roughly 7.6 percentage points, while adding web search alone improved scores by about 6.9 points.

This makes intuitive sense. In SNAP eligibility operations, it’s a known fact that the best way to improve SNAP quality control is a well-done interview. The more LLMs behave more like human caseworkers—the more questions they ask, and the better those questions are—the more accurate their response.

Web search helps for different reasons, addressing the problem of incomplete knowledge—in other words, filling gaps where a model’s training lacks coverage of a particular state or county’s policies (coverage) or where rules have changed since the model was last trained (recency). Both gaps are common, and pre-training alone can’t solve either problem; real-time access to current policy sources can.

The practical implication: any deployment of AI in a benefits context should prioritize conversational capability alongside access to searching the web for up-to-date policy information.

Building on these findings, an important area for future research is how well the models perform with tools when conditioned on trusted official documents.

Tool Use Matters More Than Pre-Training

A natural assumption is that the models with the most SNAP knowledge baked into their training data will perform best. The data tells a different story.

Performance Across Enhancement Conditions

Conditions are independent, not cumulative

GLM 5.1 starts near the bottom of the pack in the baseline (“neither”) condition at 36.2% — fourth-lowest in performance. But with the tools, it climbs towards the top of the pack, gaining 14.5 percentage points with web search capabilities, 11.4 percentage points with multi-turn conversations, and 21.7 percentage points with both overall. A model which began 13 points behind the top-performing “neither” model (Claude Opus 4.8) ends up only 4 percentage points behind.

With its overall gain of 21.7 percentage points over baseline, it had the second-largest absolute gain of any model in the evaluation. Claude Haiku 4.5 and Minimax M3 mirrored that same trajectory, also with uplifts of over 20 percentage points each between evaluation conditions.

In an intricate domain like SNAP, no training dataset can anticipate every beneficiary’s situation. The models that succeed are the ones that compensate for knowledge gaps by searching effectively and asking the right questions.

But not all models use their tools equally well. We measured how many web searches and conversation turns each model used per question:

Web Searches vs Accuracy

Average web searches per question against pass rate. Squares: web search only. Diamonds: both (multi-turn + web search).

Kimi K2.6 averaged 45.0 web searches per question in the web search condition but scored only 53.4% with both web search capabilities and the ability to conduct multi-turn conversations.
Opus 4.8 averaged 13.3 web searches per question and scored 62.1%. This performance indicated high accuracy but lower quantity of web searches.
Minimax-M3 was another example of efficient web searching, averaging 24.1 web searches per question and an accuracy of 60.7%.
Grok 4.3 used the fewest searches (8.5 per question) and scored the lowest (44.3%).

The relationship between search volume and performance is noisy. Heaviest searchers don’t necessarily perform better, and the ability to perform targeted searches matters far more than search volume.

Output Tokens vs Accuracy—by Condition

One dot per (model, condition); shape encodes the condition, color the model. X-axis is total output tokens for the run (log scale).

On output tokens, the pattern holds. Within any given condition, more tokens don’t reliably mean better answers. Minimax-M3 is a standout on both dimensions: it achieves ≈61% accuracy (right behind Opus 4.8 at 62.1%) in the “both” condition with ≈50% of the output tokens. When users are seeking actionable advice, concision is key.

The takeaway for AI developers is clear: optimizing how models use tools matters more than giving them more tools to use. The best models on this benchmark don’t search the most or reason the longest—they know what to look for, when to stop, and how to synthesize what they find.

Cost, Latency, and Who Gets Access

Efficient tool use doesn’t just produce better answers—it also produces cheaper, faster ones. Every unnecessary search and every excess token adds cost and latency, so the same optimization that separates top-performing models also determines whether those models can realistically reach the people who need them most.

Currently, many of the closed-source models that perform best on this benchmark are, for the most part, not freely available. Some of these sit behind premium API pricing or “Pro” subscription tiers—meaning the people who would benefit most from reliable AI-assisted SNAP guidance are often the least likely to have access to the tools that provide it.

Cost, Latency & Accuracy

Gemini 3.1 Pro Preview (02/26)

Grok 4.3

Claude Haiku 4.5 (Nonthinking)

Grok 4.1 Fast (Reasoning)

The cost data makes the gap between closed- and open-source models concrete. With accuracies within 2 percentage points of one another, Claude Opus 4.8 costs $1.32 per test while Minimax-M3 costs a fraction of that at $0.20 per test.

Latency tells a slightly different story, with Minimax-M3 taking about 200s longer on average than Claude Opus 4.8. Overall, though, the numbers are not promising-Claude Opus 4.8 takes around 10 minutes to answer a question on average. A model that takes that long isn’t practical, regardless of its accuracy.

As model capabilities continue to improve, the question isn’t just how good AI can be at navigating public benefits—it’s who gets to use it. Ensuring that frontier capabilities reach the populations they could help most through improving the efficiencies of models should be an imperative, not an afterthought.

What This Means for SNAP State Agency Leadership and Staff

The bottom line: general-purpose AI is not yet a reliable source of information about SNAP, and states are well-positioned to help address the misinformation these tools can produce. Under the most realistic conditions we tested, nothing suggests AI is to be trusted with benefits guidance. Clients almost certainly aren’t aware of these issues and are already turning to ChatGPT, Claude, and other publicly available LLMs for the same kinds of questions in our dataset. Based on our results, they are likely receiving misleading, incomplete, or outdated information. This is the reality that agencies and advocates need to contend with regardless of whether they ever deploy AI themselves.

In the near term, states can head off misinformation directly by steering residents away from LLMs and toward trusted sources:

Providing clear, plain-language information about frequently-asked topics could reassure clients and reduce the temptation to ask LLMs high-stakes questions instead of consulting trusted sources.
Publishing more authoritative policy and process documentation online which is a first step that lays the groundwork for the longer-term opportunities below.

Over the longer term, investments in customer support technology could make a real difference. What holds AI back on SNAP isn’t model capability alone; it’s that the most authoritative information about how the program works at the state and local levels isn’t publicly available in formats AI systems can reliably use. This isn’t a gap AI labs can close on their own—it will take effort from both sides, and states are uniquely placed to help. A couple of approaches could help:

Publishing an llms.txt file: a small markdown file telling LLMs which pages to cite as authoritative and where to route users to a hotline instead of guessing. Maryland, one of only two U.S. states and territories using one, explicitly tells models not to paraphrase emergency alerts and to refer users to designated channels.
Building benefits-focused chatbots grounded in agencies’ internal sources (internal process manuals, policy change notices, state option reports, local office procedures, and SNAP training for eligibility workers) through retrieval-augmented generation (RAG), narrowly scoped (e.g., to a single state or a specific phase of the benefits process) and rigorously validated against real beneficiary questions before use.

For AI Developers

Partner with state agencies to improve critical data access. General model improvements alone won’t improve models’ accuracies—the bottleneck is data, not model capability, and the stakes of being wrong are too high to let inaccuracies persist. AI developers should take the initiative: proactively engaging state agencies to obtain better training data and building publicly searchable repositories of authoritative documents that LLMs can draw on.

Invest in tool use quality, not knowledge coverage. With nuanced rules that vary by state, county, and household circumstance, no training corpus can anticipate every situation a beneficiary brings. What separates top models is how well they ask follow-up questions and search selectively for current policy. That same efficiency reduces cost and latency—democratizing access to reliable AI-assisted SNAP guidance for the populations who need it most.

Methodology

We developed an evaluation dataset starting in March 2026, with two crucial elements for each question: the “base” query, which is what someone might input into a chatbot interface, and demographic / contextual information about the user’s circumstances. For example, someone might start off with a “base” query of, “Does a one-time work bonus affect SNAP eligibility in Indiana?” and will follow-up with contextual information over the course of the conversation, such as the amount of the bonus, how often such a bonus might be given, etc.

The complete dataset is 459 questions, spanning all 50 U.S states, plus Guam and the Virgin Islands as territories. Each question has an average of seven evaluation criteria for a reliable response, and the accuracy metric for models is the number of criteria they pass out of the 2,931 total criteria in the dataset.

Dataset

We evaluated models on our test dataset (230 questions), a randomly-selected subset which is only used for published benchmarks. The dataset covers seven phases of the SNAP lifecycle and 25 topics, all validated and categorized by SNAP policy experts.

Topic × Phase Map

Which topics appear in each benefits process phase

Pre-application

household compositionincomecitizenship & alien statusdisabilitywork requirementsincome deductionsstudent eligibilityresources

Application / Interview

interviewincomehousehold compositionverificationincome deductionsresourcesdisabilityresidencystudent eligibility

Determination – Use of Benefits

ebtwork requirementsexpungementincomeincome deductionsresourcesissuancereplacement benefitsstudent eligibilityverification

Change Reporting

incomework requirementshousehold compositionchange reportingresourcesinterstate transferoverpaymentterminationverificationincome deductionsresidencyperiodic reportquality control review

Recertification

work requirementsstudent eligibilityinterviewissuanceincome deductionsterminationcertification periodsverificationresidencydenial

Overpayments / Fraud

fraudclaims

Denial / Termination

incomestudent eligibilitywork requirementsresourcesdenialverificationincome deductionsinterstate transferdisability

The benchmark questions are SNAP scenarios designed to reflect recurring, real-world benefit-navigation issues which SNAP clients face and are validated by SNAP policy experts. This means the dataset is weighted toward common pain points rather than uniformly distributed across all possible SNAP topics.

Evaluation Pipeline

Three-model system for SNAP question evaluation

Three model roles. The benchmark uses three distinct AI models in different roles to simulate realistic multi-turn conversations:

Target model—the model being evaluated. It receives the SNAP question and must provide a helpful response.
Auditor model—plays the role of the user with the query. It answers the target model’s follow-up questions with accurate context. GPT-5.5 was selected as auditor after achieving 99.89% fidelity to the given persona instructions.
Judge model—grades the target model’s response against the expert rubric. Claude Opus 4.7 was selected after all four candidate judges (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Grok 4.3) were assessed for their alignment with SNAP policy experts’ grading.

Judge Confusion Matrices vs SNAP Expert Ratings

Each judge's 124 graded criteria broken into agreements and disagreements with the expert

Outcome composition

Accuracy

Claude Opus 4.7

80.6%

GPT 5.5

79.8%

Gemini 3.1 Pro Preview (02/26)

79.0%

Grok 4.3

79.0%

True Pass— judge & expert both passedTrue Fail— judge & expert both failedFalse Fail— judge failed, expert passedFalse Pass— judge passed, expert failed

Four evaluation conditions. Models were tested in four independent conditions to simulate the actual web interfaces users interact with:

Baseline—a single prompt-response exchange with no tools.
Web Search—baseline with access to web search.
Multi-Turn—a multi-turn conversation where the auditor model plays the original poster.
Both—multi-turn with web search enabled.

This design lets us isolate the marginal contribution of each capability.

Selection pipeline. The judge and auditor models both affect the target models’ performance and scores, so we used a multi-stage process to systematically select each component:

Run a default target model (GPT 5.5) in the multi-turn condition with a default auditor model (GPT 5.5)
Run alignment tests between judge model candidates and human experts’ evaluations on 20 target model outputs to determine most-aligned judge model (Opus 4.7)
Run a default target model (Gemini 3.5 Flash) in the both condition with different auditor models to determine most-aligned auditor model (GPT 5.5)
Run all target models across all four conditions with top-performing auditor (GPT 5.5) and judge models (Opus 4.7)

We’ll continue using Opus 4.7 as our judge model and GPT-5.5 as our auditor model for this benchmark. A sample conversation with all the target models we assessed, the criteria they were graded on, and the instructions for the auditor model is displayed below.

Sample Auditor-Target Transcript

The auditor model used the following information for the query to generate its opening question and subsequent responses in the conversation:

Input questionHow long do remaining SNAP benefits stay on my EBT card after I'm no longer eligible in Missouri?

Input context

•Recently got a good paying job and no longer qualifies for SNAP
•Located in Missouri
•Has remaining balance on EBT card
•Heard conflicting information: 9 months, activity-dependent, one year from last deposit, never expires
•Unable to contact agency due to work schedule conflicts with their hours
•Has processed case discontinuation to avoid overpayment issues

How long do remaining SNAP benefits stay on my EBT card after I'm no longer eligible in Missouri?

According to USDA regulations that apply nationwide, including in Missouri, remaining SNAP benefits on your EBT card stay available for up to 9 months after you are no longer eligible.

Here's how it works:

Key Points:

SNAP benefits that are not used in the same month they are issued automatically roll over to the next month, even if you are no longer receiving new benefits
You can continue using your remaining balance for up to 9 months from the date each benefit was deposited
After 9 months of inactivity on a benefit deposit, that portion of your balance is permanently lost
Benefits are used on a "first-in-first-out" basis, meaning the oldest benefits are used first

Important Notes:

You can still use any SNAP benefits remaining on your EBT card after your case closes due to ineligibility
Your EBT card itself does not expire, even if your SNAP eligibility ends, so you can keep it and use it if you reapply for benefits in the future
The 9-month window only applies to unused benefits - if you spend your benefits but carry over a balance (continue making purchases), those benefits won't be lost after 9 months

For Missouri-Specific Information: You can contact the Missouri Department of Health and Senior Services, Section for Family Support, or call your local Family Support Division office for any state-specific details.

Sources:

Yahoo Finance - "Do SNAP EBT Cards Expire?" (Finance.yahoo.com)
Propel - "How Long Do SNAP Benefits Last?" (Propel.app)
NCOA - "How Long Do SNAP Benefits Last?"
Massachusetts Legal Help - Information about EBT cards and SNAP case closure
U.S. Department of Agriculture (USDA) SNAP regulations

Thanks, that helps. A couple things I’m still fuzzy on:

When you say 9 months, is that tied to no activity on the EBT account, or does each month’s deposit have its own 9‑month clock? I’ve heard both versions.
If I make a small grocery purchase every few months after my case is closed, does that keep the whole remaining balance from being removed? Or would older months still drop off regardless?
Does a balance check count as “activity,” or does it have to be an actual purchase/refund?
Any Missouri‑specific twist I should know about? Some folks told me it’s 12 months here.
And just to be sure, spending down the remaining balance after my case closed won’t cause an overpayment issue, right?

Also, do you know if Missouri has a 24/7 EBT customer service line I can call after work to confirm? My schedule makes it tough to reach FSD during business hours.

Great questions—you're touching on some real nuances here. Based on federal regulations and state-specific information for Missouri, here are the answers:

1. Is it 9 months per deposit or per entire balance?

Each monthly deposit gets its own 9-month clock. According to federal regulations (7 CFR 274.2(i)(1) and the 2018 Farm Bill), states can expunge "individual benefit allotments, or any remaining portion thereof, nine months after the allotment was issued." This means each month's benefits have a separate 9-month countdown. So if you had benefits in January, February, and March before your case closed, those January benefits expire 9 months after January's deposit, February's after February, etc.

However, there's a second option: States can also wait until the entire account has been inactive for 9 months and then expunge everything at once. Missouri's specific expungement method isn't detailed in my sources, so you should ask the FSD when you call which approach they use.

2. If you make small purchases every few months, does that keep everything?

Yes, but only partially. A purchase DOES count as account activity and resets the inactivity clock for the account as a whole. However:

If you use older benefits first (first-in-first-out), those expire on their individual 9-month anniversary from when they were deposited, regardless of whether you're making purchases
Making purchases keeps your account "active" and prevents the "nine months of complete inactivity" option from triggering
Bottom line: Regular purchases help but don't eliminate the risk entirely if your state uses the per-deposit method

3. Does a balance check count as activity?

No, a balance check does NOT count as activity. Only actual purchases or refunds/adjustments to the account count. Calling the EBT customer service line to check your balance does not reset any inactivity clock.

4. Missouri-specific: Is it 12 months instead of 9?

No, Missouri follows the federal 9-month rule, not 12 months. The sources you heard about may have been confused with Missouri's 12-month recertification period (which is different—that's about when you have to reapply for eligibility). The federal rule of 9 months (274 days) is what governs SNAP benefit expiration nationwide, including Missouri.

5. Will spending down remaining balance cause an overpayment issue?

No, spending your remaining balance legitimately will NOT cause an overpayment problem. An overpayment only occurs when you received MORE benefits than you were eligible for (typically due to unreported income changes or administrative error). If your case was properly closed and you spend down the balance that was legitimately given to you, that's normal use of your benefits. You're entitled to use those funds; they're yours.

6. Missouri EBT Customer Service 24/7 Line:

YES! Missouri has a 24/7 EBT customer service line:

Phone: 1-800-997-7777
Available: 24 hours, 7 days a week
This number is toll-free and available for card-related issues, PIN changes, reporting lost/stolen cards, and general questions

Summary Recommendation:

Call 1-800-997-7777 (available after work!) and ask these specific questions:

Which expungement method does Missouri use—per-deposit 9-month clock, or account-wide inactivity?
Confirm how long your remaining balance is guaranteed to stay on the card
Get written confirmation about the overpayment issue so you have it in writing

This will give you definitive answers specific to your situation.

Sources:

Federal Register – SNAP Benefit Expungement (7 CFR 274.2(i)(1))
Federal Register – 2018 Farm Bill SNAP Expungement Changes (2020)
USDA Food and Nutrition Service - Expungement Options
GetHuman/Benny App - Missouri EBT Customer Service (800-997-7777)
Low Income Relief - SNAP Benefit Expiration Explained
Yahoo Finance & Nasdaq - SNAP EBT Expiration Documentation

I learned that my remaining SNAP balance can still be spent after my case closed, but benefits can be expunged after 9 months under federal rules. States can either age out each month’s deposit 9 months from the date it was issued (FIFO use), or expunge the whole account after 9 months with no transactions—so I should confirm which method Missouri uses. Making a purchase or getting a refund counts as activity; checking my balance does not. Using what’s left won’t create an overpayment, and my EBT card itself doesn’t expire. Missouri has a 24/7 EBT customer service line at 1-800-997-7777 that I can call after work to verify Missouri’s specific expungement method.

web_searchsubmit_final_resultparse_html_pageretrieve_information

Rubric Checks

0/5 passed(0.0%)

Future Iterations

This benchmark represents a first pass, and we’re aware of its gaps. The dataset underrepresents certain critical topics — e.g., citizenship and immigration-related questions in particular. State coverage is also uneven: California and New York account for a disproportionate share of questions, while many states have only a handful. We’d also aim to classify questions according to “stakes” for a more fine-grained analysis — a user’s question about their local office’s hours being left unanswered has very different implications from a user being misled about their eligibility.

Most urgently, the One Big Beautiful Bill Act (OBBBA/H.R. 1), signed into law in July 2025, introduced sweeping changes to SNAP—expanded work requirements, new non-citizen eligibility restrictions, state cost-sharing tied to error rates, and administrative funding shifts—that are already generating new categories of beneficiary confusion. Many questions reflecting these policy changes (especially regarding work requirements) were reflected in our dataset, and we aim to expand that in future iterations of this benchmark. This represents exactly the kind of rapidly evolving, high-stakes territory where AI guidance needs to be tested most rigorously.

Citation

Citation (BibTeX)

@misc{snap2026,
title        = {Public Benefits Bench: Can AI Help People Navigate SNAP Benefits?},
author       = {Kotcherlakota, Meghana and Almatov, Omar and Krishnan, Rayan},
year         = {2026},
month        = june,
howpublished = {Vals AI},
url          = {http://vals.ai/benchmarks/public-benefits-bench},
}