Dallas ISD asked Texas to rescore of thousands of STAAR tests; about one-third went up
Published 5:00 am Thursday, July 31, 2025
- Thomas Jefferson High School in Dallas on Tuesday, July 29, 2025. (Juan Figueroa/Dallas Morning News)
Frustration over the way Texas uses computers to grade students’ essays on the STAAR test is building after Dallas ISD officials again saw more than one-third of the exams they submitted for review come back with higher scores.
When districts ask the Texas Education Agency to rescore an exam, the process is completed by a human rather than a computer. The new results don’t arrive for weeks after school ends, leaving some students in limbo, thinking they failed the state standardized test and might have to take a remedial class or miss out on athletics.
Of the 5,420 STAAR tests the Dallas Independent School District sent for rescoring, 35% showed improvement, according to data provided by the district to The Dallas Morning News. Last year, district officials requested the review of more than 4,600 answers and roughly 43% came back with additional points.
Trending
“It has an impact on trust in the system,” Dallas ISD Superintendent Stephanie Elizalde said. “One time may be an outlier. I consider two times a pattern.”
The state education agency began using computers to score the majority of students’ written responses in December 2023, prompting some district officials to question their accuracy. A group of districts from across the state filed a lawsuit that said scoring by artificial intelligenceSet featured image compromised the validity of the State of Texas Assessments of Academic Readiness.
Texas Education Agency officials said they remain confident in the way they grade STAAR, a process they say has multiple layers of quality control. They emphasize that exams awarded additional points during the rescore process represent a sliver of the overall tests administered.
Texas students took more than 3 million reading tests last school year. So far, roughly 21,600 of those have been submitted for rescoring and nearly 6,200 — 28% — saw changes. The agency is still reviewing requests.
The agency does not allow students’ scores to be lowered during a review. The number of points can either go up or remain unchanged.
Each test carries weight. STAAR results are used to grade campuses in the state’s A-F academic accountability system, plus teacher raises in some places are tied to how well students perform. In the case of the children, there are practical effects.
Trending
At Dallas’ Thomas Jefferson High School, Principal Ben Jones bought medals to celebrate the students who met their STAAR goals. He now knows there are teenagers who should’ve been rewarded but weren’t.
DISD submitted 59 Thomas Jefferson students’ tests for rescoring; 22 came back higher.
For seven students, the improvements boosted them from failing their English STAAR exam to passing it.
Some of those students will get out of taking remedial English next year. One will be able to stay in an athletics period with the basketball team. Another will be able to fit an additional AP class into her schedule. Yet another was put on a simpler track toward graduation.
“These are real, tangible effects for kids,” Jones said.
Computer scoring controversy
While the Texas Education Agency quietly rolled out the use of automated scoring in 2023, the use of technology to analyze essays is not new. Other states have used this model for years, though not without some pushback.
State education agency officials describe their automated scoring engines as tools with narrow abilities that can improve scoring efficiency.
The engines are programmed to emulate how humans would score an essay. The computer determines how to assess written answers after analyzing thousands of students’ responses that were previously graded by people.
Roughly three-quarters of written responses are scored by computers.
“If you call it AI scoring, you have this vision that people are casually throwing student responses into ChatGPT and asking it what score it would give,” said Andrew Ho, a professor at the Harvard Graduate School of Education. “That’s absolutely not the case.”
Ho sits on a committee that advises Texas on its testing program, and he provided expert testimony for the state during the recent legal fight over STAAR.
State officials turned to automated scoring amid a broader STAAR redesign. The test structure now includes essays at every grade level, dramatically increasing the scoring load.
Agency officials estimated this new test format would require four to five times the number of human scorers, costing an extra $15 million-$20 million per year if they were to use people exclusively.
Some district leaders remained unconvinced. They zeroed in on the use of computer scoring in a 2024 lawsuit against Education Commissioner Mike Morath.
The court ruled against them earlier this month, saying the state brought forward experts who “supported the validity and reliability of automated scoring in great detail.”
Part of the district leaders’ concerns were related to declines in student achievement, including a large number of “zeroes” on essay questions.
Chief Justice Scott Brister’s July 3 opinion quoted Ho stating “all the evidence that we reviewed points to the fact that it’s not the scoring system” that is at fault but “a real decline in achievement that we should be concerned about and not try to sweep under the rug.”
Brister was not persuaded by district officials’ arguments related to the high rate of tests that came back with more points after being rescored by humans.
“Constructed response questions do not lend themselves to precise scores; without knowing the questions and the range of answers, one can conclude from rescoring whether the automated system and a human grader agreed, but not necessarily which one was ‘right,’” his opinion reads.
Lawmakers are back in Austin for a special session during which they will consider if and how to replace STAAR. One bill filed in the House would prohibit the use of AI to score written answers on the standardized test.
To Elizalde, the answer isn’t necessarily ending the use of automated scoring engines. She said she wants to see tweaks that give district officials and families confidence the system is working as intended.
Right now, she said, too many data points give her pause.
For example, DISD submitted roughly 450 third grade tests for rescoring and 85% of them came back higher. In nearly 40% of cases, the students’ scores went up by 4 points or more on a 10-point scale.
Statewide, about 70% of the third grade tests submitted for rescoring saw increases.
“What happened in third grade?” Elizalde said.
Campus effect
The state education agency allows parents or district leaders to dispute STAAR scores — at a cost.
The education agency charges $50 per appeal, though districts have to pay only if the score remains the same.
Elizalde decided the financial risk was worth it, inspiring some other district leaders to do so, too.
In Alpine ISD, a West Texas district of about 900 students, roughly 37% of the tests submitted for rescoring came back with additional points, according to Superintendent Michelle Rinehart.
Rinehart sent the agency a sample of 79 tests spanning third grade through high school.
“It definitely undermines our trust in the initial scores that we received back,” she said. “It makes me wonder how many of our students and families got incorrect data in the past, handed to them by the state as if it were definitive and accurate.”
Rinehart emailed TEA about her concerns earlier this summer.
In a written response to Rinehart, which she provided to The News, an agency official said there’s variability in how written answers are scored, adding that districts typically submit appeals only for tests with a reasonable likelihood of being adjusted.
“Rescoring requests have occurred across both human- and machine-scored responses,” the agency official wrote. “These discrepancies are not a function of automated scoring itself, but rather a reflection of the scoring scale and the inherent variability in evaluating writing.”
Rinehart said she understands that point but remains frustrated. STAAR scores fuel the state’s academic accountability system, which assigns A-F grades to every campus and district. The ratings carry major implications for schools.
STAAR scores are “plugged into an accountability system that does not account for variability or error-margin at all,” Rinehart said. “They’re treated as definitive, absolute, fully representative numbers.”
Principal Jones, at Dallas’ Thomas Jefferson High, acknowledges human graders don’t get it right 100% of the time either — and said he’s grateful there’s a path to appeal.
It’s unlikely parents at his school, where the vast majority of teenagers come from low-income families, would put $50 on the line to ask for a second opinion, he said.
“I appreciate that the district takes that risk for families,” Jones said. “It is changing the outlook for individual kids.”
Staffers at Thomas Jefferson recently broke the news to students whose scores went up. One responded, “I knew I should have passed. I worked so hard.”