Automating DataScience Using LLMs

Automating DataScience Using LLMs(pypi.org)

4 points by noahho 2 years ago | 1 comment

noahho 2 years ago |

LLMs meet AutoML: in an effort to integrate user knowledge into AutoML, our new tool CAAFE uses LLMs to generate semantically meaningful features for tabular data (and also explains them). Towards an AI assistant for human data scientists

Paper https://arxiv.org/abs/2305.03403 Demo https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvy...

CAAFE uses GPT-4 and a textual description of the dataset to iteratively generate Python code that creates new features and explanations for their utility. It thereby harnesses the creative power of LLMs in combination with a systematic verification process that interacts with the LLM in an iterative fashion.

While hyperparameter optimization can profit from domain knowledge as well, we believe that steps of the pipeline that are closer to the data and the user, such as feature engineering, can benefit much more from additional semantic information. This may open exciting possibilities for broadening the range of applications of #AutoML to help practitioners with more of the data science pipeline.

Executing AI-generated code requires careful consideration. We've implemented a whitelist of safe python commands, but risks remain. Also, AI can replicate or even exacerbate biases present in the training data. Much more work is needed to avoid this. Please use CAAFE cautiously and examine its generated features critically, especially with an eye on principles from algorithmic fairness.

Why not let GPT generate your features directly or simply use OpenA’s Code Interpreter? While GPT-4 is a powerful model, it's not specifically designed for ML. CAAFE steps in with a systematic verification process to ensure the generated features are useful for the task at hand, also providing feedback to the LLM.