Statement from Les Perelman, Ph.D., to MA BESE on “Automated Test Scoring”

Dear Ms. Sullivan:

Below is my statement for the Board.  Thank yo for the opportunity.
Regarding Item 5 “Preliminary Information on Automated Test Scoring:
The language of this statement saddens me because it reveals a failure to examine the claims made in what reads like a verbatim statement from Pearson or other purveyors of essay scoring software.
  1.  As The Boston Globe reports today, it is hardly “a number of states” that have adopted automated scoring systems but Utah and, covertly, Ohio.
  2. To my knowledge, the statement “automated scoring engines have become increasingly reliable and refined, particularly over the last several years,” has no basis in fact.
  3. Machines do not understand meaning; they just count.  All automated essay scoring engines operate by counting “proxies” that substitute for higher-level traits.  For example, the frequency of infrequently used words is often used as a proxy for verbal felicity, and, essay length for overall development.  See my book chapter “Construct validity, length, score, and time in holistically graded writing assessments: The case against Automated Essay Scoring (AES)” <>
  4. Machines are extremely poor at identifying grammatical errors in English.  As I note in my 2016 article  “Grammar checkers do not work” <>, when analyzing 5,000 words of an essay by Noam Chomsky originally published in The New York Review of Books, the grammar checker modules of ETS’s e-rater falsely identified 62 grammatical or usage errors, including 15 article errors and 5 preposition errors.  The performance of another grammar checker was similarly flawed
  5. AES engines also appear to privilege some linguistic and / or ethnic groups while unfairly penalizing others.  In two studies by researchers at the Educational Testing Service, essays written by native Mandarin speakers were scored significantly higher the ETS’s engine than they were by human readers while essays by African-Americans were scored significantly lower by machines than they were by humans: Bridgeman, B., Trapani, C., & Attali, Y. (2012). “Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country.” Applied Measurement in Education, 25(1), 27–40; and Ramineni, C., Trapani, C. S., Williamson, D. M., Davey, T., & Bridgeman, B. (2012). “Evaluation of the e-rater ® scoring engine for the GRE® issue and argument prompts.” ETS RR—12-02. <>
  6. Last year the Federal Education Minister of Australia proposed that the essay portion of NAPLAN, the Australian equivalent of MCAS, be scored by AES engines.
    1. I was commissioned by the New South Wales Teachers Federation to write a report on the proposal, “Automated Essay Scoring and NAPLAN: A summary report <> All the arguments made here are elaborated in much greater detail in that document.
    2. The response in Australia was highly supportive of my position.  The editorial in the Sydney Morning Herald  ”NAPLAN robo-marking plan does not compute” <>  is a particularly eloquent and perceptive summary of the defects of machine scoring.
    3. In December 2017, the Australian Education Council, a body consisting of the Education Ministers of all the Australian states and territories, unanimously overruled the Federal Education Minister and prohibited the use of AES machines in scoring the NAPLAN, including a proposal to have the essays read both by machines and human readers. <>
I hope that this information is sufficient to convince the Board that Automated Essay Scoring engines are inappropriate for use in scoring MCAS essays,  If there is to be a continued exploration into the use of AES for the MCAS, I request that I be allowed to participate in the inquiry to add a skeptical but also informed perspective.
With warm regards,
Les Perelman, Ph.D.