Automating DataScience Using LLMs(pypi.org) |
Automating DataScience Using LLMs(pypi.org) |
Paper https://arxiv.org/abs/2305.03403 Demo https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvy...
CAAFE uses GPT-4 and a textual description of the dataset to iteratively generate Python code that creates new features and explanations for their utility. It thereby harnesses the creative power of LLMs in combination with a systematic verification process that interacts with the LLM in an iterative fashion.
While hyperparameter optimization can profit from domain knowledge as well, we believe that steps of the pipeline that are closer to the data and the user, such as feature engineering, can benefit much more from additional semantic information. This may open exciting possibilities for broadening the range of applications of #AutoML to help practitioners with more of the data science pipeline.
Executing AI-generated code requires careful consideration. We've implemented a whitelist of safe python commands, but risks remain. Also, AI can replicate or even exacerbate biases present in the training data. Much more work is needed to avoid this. Please use CAAFE cautiously and examine its generated features critically, especially with an eye on principles from algorithmic fairness.
Why not let GPT generate your features directly or simply use OpenA’s Code Interpreter? While GPT-4 is a powerful model, it's not specifically designed for ML. CAAFE steps in with a systematic verification process to ensure the generated features are useful for the task at hand, also providing feedback to the LLM.