As I’ve discussed, the architecture and implementation of text-based AI agents (Agentic Applications) are converging on similar core principles. The next chapter for AI agents is now unfolding: AI agents capable of navigating mobile or browser screens, with a particular focus on using bounding boxes to identify and interact with screen elements. Some frameworks propose a solution where the agent has power to open browser tabs and navigate to URLs, and perform agent tasks by interacting with a website.