Introducing dataset: Tested-22k-Python-Alpaca
Introducing: Tested-22k-Python-Alpaca
The Tested-22k-python-alpaca dataset is a curated collection of working Python code examples designed to improve the code generation capabilities of large language models by providing examples of correct and up-to-date code usage. It's intended to be used for fine-tuning models that have already been pretrained on relevant documentation and other resources.
Motivation:
The dataset was created to address the limitations of foundational language models (like those underlying Open Chat 3.5) in generating accurate and up-to-date Python code. These models often lack access to current Python and API documentation, leading to code with outdated calls and methods.
Building a Strong Python Code Model
The authors recommend a three-step approach:
Pretraining: Pretrain a model (like Mistral 7b) on up-to-date Python and API documentation. This is crucial for ensuring the model uses current API calls and functions. Incorporating Textbooks: Include programming textbooks in the training data. Fine-tuning: Fine-tune the model using their "Tested-22k-Python-Alpaca" dataset via Supervised Fine-Tuning (SFT).
You can find the dataset here: opendocstring.com/datasets/tmoenicke/Tested-22k-python-alpaca
Thomas