[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
It's like saying how can evaluating 5 years of performance at work be better at predicting someone's competency than their SAT scores.
https://huggingface.co/papers/2306.05685
This paper makes the argument that...
"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."
So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.
Phi-2 isn't fine tuned for instruction following yet.
For example, consider my analysis [0] based on observing the progression of Large Language Models (LLMs) in a single text adventure.
[0] https://github.com/s-macke/AdventureAI#evaluation-of-other-m...
-Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one!
-You can continue chatting until you identify a winner.
-Vote won’t be counted if model identity is revealed during conversation.
Do you really need more than this to know which one you’re going to pick? https://i.imgur.com/En37EJD.png
Avatar doesn’t have humans? Seriously?
I only make technical (pytorch) questions though.
The Glicko rating system is very similar to Elo, but it also models the variance of a given rating. It can directly tell you a "rating deviation."
https://www.reddit.com/r/LocalLLaMA/comments/17jrj82/new_mic...
Also very interesting that Mixtral 8x7B ranks in the same neighborhood as Gemini Pro/GPT 3.5 Turbo/Claude 2.1 while being fully open source and Apache 2.0 licensed.
If I was already using GCP and they reduced their price (>10%) and offered tight integration with rest of GCP services it would still be appealing.
Starling:
```
USER list 5 largest capitals sorted by largest population first
AI Here are some countries and their capitals in alphabetical order Australia: Canberra Belgium: Brussels Canada: Ottawa Finland: Helsinki Italy: Rome France: Paris Germany: Berlin Greece: Athens Ireland: Dublin …(lists 50 others)
Gpt3.5
list 5 largest capitals sorted by largest population first
ChatGPT Certainly! To list the 5 largest capital cities by population, I'll provide you with the information as of my knowledge cutoff date in 2022. Please note that population figures may change over time, and it's a good idea to verify the latest data from reliable sources if needed. Here are the 5 largest capital cities sorted by population, starting with the largest:
Beijing, China: Beijing is the capital of China and has a population of over 21 million people as of 2022. New Delhi, India: New Delhi is the capital of India and is part of the larger National Capital Territory of Delhi. It has a population of approximately 16.8 million people as of 2022. Tokyo, Japan: Tokyo is the capital of Japan, and while the city itself is not officially designated as the capital, it serves as the de facto capital. The Tokyo metropolitan area has a population of around 14 million people. Moscow, Russia: Moscow is the capital of Russia and has a population of over 12 million people. Dhaka, Bangladesh: Dhaka is the capital of Bangladesh and has a population of approximately 8.9 million people. ```
(Submitted title was "Gemini Pro achieves accuracy slightly inferior to GPT 3.5 Turbo".)
If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
Someone described LLMs as “blurry JPEGs of the Internet”.
In that sense, maybe GPT 4 is as smart as the hive mind of the Internet gets, and newer models just take sharper pictures but of the same subject. Perhaps GPT 4 trained on one of the best subsets available and everything else is going to be worse or the same…
It’s curious that Sam Altman has publicly stated that OpenAI isn’t working on GPT 5. Why not? Is it because they know it’s a pointless exercise with the current training approaches?
let's not forget where this breakthrough came from, i wouldn't count Google out
Why do you way they will 'probably' do that? Do you have any information to back that up or is this your speculation?
So one of Google lied, this eval has bugs, they borked the deployment is true
The second approach, with RL, is based on immediate feedback and could make a model smarter than us. Just think of AlphaZero or AlphaTensor. But this requires deploying a wide search over possible solutions and using a mechanism to rank or filter the bad ideas out (code execution, running a simulation or a game, optimizing some metric)
So models need both past experience and new experience to advance. They can use organic text initially, but later need to develop their own training examples. The feedback they get will be on topic, both with the human user and with the model mistakes. That's very valuable. Feedback learning is what could make LLMs finally graduate from mediocre results.
DeepMind is saying they are using both, and feedback learning is dialed up.
Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.
I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.
> Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion
The URL in question: https://www.forbes.com/sites/forbestechcouncil/2023/02/17/is...
This article was written by Aleks Farseev, the CEO of SoMonitor.ai, who makes the claim with no source or explanation:
> ChatGPT is not just smaller (20 billion vs. 175 billion parameters) and therefore faster than GPT-3
For new technologies, what matters most might be the universities where people are from, rather than the companies. The founders of Google graduated from Stanford. The founders of Mistra AI graduated from École Polytechnique and École Normale Supérieure, that are renowned in France, notably for their scientific formations.
Edit: I missed the third rule. I wonder how smart their detection is.
Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.
I don’t think it would be a replacement for human rating, but it would be interesting to see.
Where are you seeing that it is "further behind Gemini Pro than Gemini Pro is behind GPT 3.5"?
On the topic of “hardly conclusive” things, Gemini Pro literally told me just a few minutes ago[1] that the Avatar movies did not have humans in them. There was no funny business in the prompting. At least Mixtral knows that Avatar has humans in it. Most of Gemini Pro’s responses have been fine, but not exceptional.
[0]: one random article talking about these issues: https://www.surgehq.ai//blog/hellaswag-or-hellabad-36-of-thi...
Surely you could make a comparison of two unreleased models, but it wouldn't be interesting because you don't have any real data (and benchmarks don't really mean anything).
Gemini Ultra is self-evidently not ready for production. What the issues are? Who knows, but in a game that as of right now is mostly about reducing the amount of brute force required, something as "simple" as not being efficient enough is actually not something to gloss over. If your engines entire stick is having the greatest graphics but you can't make it run at acceptable fps, well, then it's not actually a usable product.
A LLM that is not actually released could very well be in a comparably dire state and fixing it while also delivering on the promised performance might be entirely non-trivial.