As noted by other commenters (and the testers themselves), this was a unique test using (now already) an outdated open-source model. Also, general public comments typically involve a wide range of author literacy levels, reason complexity, and subject nuance. Unlike this real-life public input content testing, most models are tested using coherent academic-level or published material.
College admissions would be based on a standardized knowledge test in which LLMs have already ingested the information. In this test, summaries of unstructured language were evaluated by subject matter experts checking against undisclosed quality criteria—something that is likely not codified, like most human domain-specific knowledge/experience.
The latest models I've been testing are good at summarizing, but humans always need to be involved in the verification process—this means humans have twice the work. So, for now, using LLMs to summarize unstructured human input is more labour-intensive than getting humans to do it.
It's like the dream of AI code generation -- you still need a human to understand and solve complex problems, design and implement solutions and make ethical decisions.
So, for now, LLMs are great human companions but hardly replacements. And knowing that is supposed to make us feel better... for now 😉