News

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs’ capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, ...