We wanted to test if a smaller model like GPT-4.1-mini could beat its bigger brother 4.1 at the game Tic-Tac-Toe using only context engineering. We put them in a 100-game tournament. For the smaller model, we gave it a few examples of winning moves from past games right before it made its own move. The results were clear. Without the examples, the smaller model struggled against GPT-4.1. With the examples, its effectiveness increased by nearly 200%, and it consistently won. It's a simple demonstration, but it shows that a smaller, faster model with good, timely examples can outperform a more capable base model. The full write up and code are in the repo. |