The emergence of LLMs has opened up a venue for tackling problems that were earlier thought impossible. The plethora of LLM-based applications is proof of this. But the one question still remains a mystery, how to effectively evaluate LLM-based applications?
We will try and solve that mystery through this article by understanding methods used to benchmark LLMs and discussing SOTA methods, available frameworks, and challenges in evaluating LLM-based applications.