News
Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs’ capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results