Introducing dataset: Tested-22k-Python-Alpaca

Introducing: Tested-22k-Python-Alpaca

The Tested-22k-python-alpaca dataset is a curated collection of working Python code examples designed to improve the code generation capabilities of large language models by providing examples of correct and up-to-date code usage. It's intended to be used for fine-tuning models that have already been pretrained on relevant documentation and other resources.


Motivation:

The dataset was created to address the limitations of foundational language models (like those underlying Open Chat 3.5) in generating accurate and up-to-date Python code. These models often lack access to current Python and API documentation, leading to code with outdated calls and methods.

Building a Strong Python Code Model

The authors recommend a three-step approach:

Pretraining: Pretrain a model (like Mistral 7b) on up-to-date Python and API documentation. This is crucial for ensuring the model uses current API calls and functions. Incorporating Textbooks: Include programming textbooks in the training data. Fine-tuning: Fine-tune the model using their "Tested-22k-Python-Alpaca" dataset via Supervised Fine-Tuning (SFT).

You can find the dataset here: opendocstring.com/datasets/tmoenicke/Tested-22k-python-alpaca

Thomas