Rethinking AI Evaluation: Why Traditional Benchmarks Fall Short

2026-03-31 · MIT Tech Review AI · Original

For many years, the effectiveness of artificial intelligence has been assessed based on its ability to surpass human capabilities. Whether it’s in chess, complex mathematics, programming, or writing essays, AI systems have been measured against individual human performance in specific tasks. This perspective is appealing because it simplifies the comparison into straightforward scenarios. However, this method of evaluation is increasingly seen as inadequate for capturing the true potential of AI. As we move forward, there is a pressing need to develop new benchmarks that reflect the multifaceted nature of AI technologies and their applications in real-world situations. A more comprehensive approach will ensure that we accurately gauge AI's impact and foster its development in ways that benefit society.