PECC

We introduce PECC: An extensive benchmark centered on code generation from narrative-embedded problem descriptions. Unlike prior benchmarks that evaluate code generation using specific instructions, our dataset requires models to comprehend, extract requirements, and produce the essential code for problem-solving. This approach necessitates syntactically accurate programs and demands reading comprehension skills to derive the desired solution.

Leaderboard

To this point, we evaluated 10 competitive large language models (proprietary and open source). Depending on instruction or chat-finetuning of some models a whole evaluation on the AoC dataset subset was not possible. We average the accuracy over all 4 subsets.

Model	Num Params	PECC (Pass@3)
Claude Haiku	-	27.67%
GPT-3.5-Turbo	-	23.75%
Codechat-Bison	-	11.39%
Chat-Bison	-	8.48%
Mixtral-8x7B-Instruct-v0.1	56B	8.35%
Phi-3-mini-128k-instruct	3.8B	7.18%
WizardLM-2-7b	7B	3.72%
Llama-3-8B-Instruct	8B	3.1%
WizardCoder-Python-34B-V1.0	34b	12.9%*
Mistral-7B-Instruct-v0.1	7B	1.62%*

*Excludes evaluation on Part 2 of AoC subsets

We compare the PECC scores with commonly used benchmarks for evaluating LLMs. The table below shows the performance of the models on the PECC dataset compared to the performance on the average of ARC, MMLU and HellaSwag datasets.

Table of different error types for different models during evaluation

Accepted at LREC-COLING 2024

Paper Abstract

Recent advancements in large language models (LLMs) have showcased their exceptional abilities across various tasks, such as code generation, problem-solving and reasoning. Existing benchmarks evaluate tasks in isolation, yet the extent to which LLMs can understand prose-style tasks, identify the underlying problems, and then generate appropriate code solutions is still unexplored. Addressing this gap, we introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler, including 2396 problems. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate executable code. A key feature of our dataset is the complexity added by natural language prompting in chat-based evaluations, mirroring real-world instruction ambiguities. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset with GPT-3.5-Turbo passing 50% of the AoC challenges and only 8% on the Euler problems. By probing the limits of LLMs’ capabilities, our benchmark provides a framework to monitor and assess the subsequent progress of LLMs as a universal problem solver.

Example Usage

Clone the repository from Github and setup environment:

python -m venv venv 
source venv/bin/activate
pip install -r requirements.txt

Follow instructions on how to download the original AoC subset here.

Run the following script to evaluate a model on PECC

python main.py --subset euler \
    --output-file gpt3.5-euler-current_results.csv \
    --venv-path venv \
    --model "gpt-3.5-turbo-16k"

For more details, visit the documentation.

Citation

When using the dataset or library, please cite the following paper:

@misc{haller2024pecc,
      title={PECC: Problem Extraction and Coding Challenges}, 
      author={Patrick Haller and Jonas Golde and Alan Akbik},
      year={2024},
      eprint={2404.18766},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Leaderboard

Paper Abstract

Example Usage

Citation

Meet the Contributors

Patrick Haller

Jonas Golde

Alan Akbik