OpenAI GPT-4 vs. Groq Mistral-8x7B

OpenAI GPT-4 vs. Groq Mistral-8x7B(serpapi.com)

105 points by tanyongsheng 2 years ago | 133 comments

wruza 2 years ago |

The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases.

  You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

  Data to scrape: 
  title: Name of the business
  type: The business nature like Cafe, Coffee Shop, many others
  phone: The phone number of the business
  address: Address of the business, can be a state, country or a full address
  years_in_business: Number of years since the business started
  hours: Business operating hours
  rating: Rating of the business
  reviews: Number of reviews on the business
  price: Typical spending on the business
  description: Extra information that is not mentioned yet in any of the data
  service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
  is_operating: Whether the business is operating
  
  HTML: 
  {html}

infecto 2 years ago | |

This should be higher up. This whole blog post is mostly worthless because the way they are extracting data is less than optimal.

Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.

ilyazub 2 years ago | | |

> I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Cool, cool. I'm super interested. Please share the process and the results.

wruza 2 years ago | | |

Also, my (limited) experience with prompts tells that you want to invest more into the “You are” part. I’ll share my understanding, corrections are appreciated.

LLMs aren’t people even in a chat-roleplaying sense. They complete a “document” that can be a plot, a book, a protocol of conversation. The “AI” side in the chat isn’t an LLM itself, it’s a character (and so are you, it completes your “You: …” replies too - that’s where the driver app stops it and allows you to interfere). So everything you put in that header is very important. There are two places where you can do that: right in the chat, as in TFA, or in the “character card” (idk if GPTs have it, no GPT access for me). I found out that properly crafting a character card makes a huge difference and can resolve the whole classes of issues.

Idk what will work best in this case, but I’d start with describing which sort of a bot, how it deals with unclear or incomplete information, how amazing it is (yes, really), its soft/tech skills and problem solving abilities, what other people think of it, their experience and so on. Maybe would add few examples of interactions in a free form. Then in the task message I’d tell it more and specific details about that json.

One more note - at least for 8x7B, the “You are” in the chat is a much weaker instruction than a character card, even if the context is still empty. I low-key believe that’s because it’s a second-class prompt, i.e. the chat document starts with “This is a conversation with a helpful AI bot which yada yada” in… mind, and then in that chat that AI character gets asked to turn into something else, which poisons the setting.

Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.

I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.

Same. I think that no matter how good a model is, this prompt just isn’t a professional task statement and leaves too much to decide. It’s a task that you, as a regular human, would hate to receive.

mhuffman 2 years ago | | |

Do you have an example of a more optimal prompt to share?

feintruled 2 years ago |

Brave new world, where our machines are sometimes wrong but by gum they are quick about it.

RUnconcerned 2 years ago | |

I too am a big fan of having my computer hallucinate incorrect information.

darthrupert 2 years ago | | |

Yesterday I asked my locally running gpt4all "What model are you running on?"

Answer: "I'm running on Toyota Corolla"

Which was perhaps the funniest thing I heard that day.

harryf 2 years ago | | |

>> print(“Hello, world!”.ai_reverse()) world, Hello!

RUnconcerned 2 years ago |

Finally, something more offensive than parsing HTML with regular expressions: parsing HTML with LLMs.

AlphaAndOmega0 2 years ago | |

I for one am glad I can offload all the regex to LLMs. Powerful? Yes. Human readable for beginners? No.

cornedor 2 years ago | | |

Why tough? To me, it seems more prone to issues (hallucinations, prompt injections etc). It is also slower and more expensive at the same time. I also think it is harder to implement properly, and you need to add way more tests in order to be confident it works.

RUnconcerned 2 years ago | | |

Personally when I am parsing structured data I prefer to use parsers that won't hallucinate data but that's just me.

Also, don't parse HTML with regular expressions.

okamiueru 2 years ago | | |

Deterministic? No.

retrac98 2 years ago |

There are so many applications for LLMs where having a perfect score is much more important than speed, because getting it wrong is so expensive, damaging, or time consuming to resolve for an organisation.

infecto 2 years ago |

This test is interesting from a general high level metric/test but overall the way they are extracting data using a LLM is suboptimal so I don't think the takeaway means much. You could extract this type of data using a low-end model like 8x7B with a high degree of accuracy.

samus 2 years ago | |

The better way would be to ask it to generate a program that uses CSS selectors to parse the HTML.

emporas 2 years ago |

Mixtral works very well with json output in my personal experience. Gpt family are excellent of course, and i would bet Claude and Gemini are pretty good. Mixtral however is the smallest of the models and the most efficient.

Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].

Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.

With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.

[1] https://github.com/pramatias/groq_test

imaurer 2 years ago |

Groq will soon support function calling. At that point, you would want to describe your data specification and use function calling to do extraction. Tools such as Pydantic and Instructor are good starting points.

I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json

bambax 2 years ago |

Interesting post, but the prompt is missing? How do the LLMs generate the keys? It's likely the mistakes could be corrected with a better prompt or a post check?

Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?

tosh 2 years ago |

I initially thought the blog post is about scraping using screenshots and multi-modal llms.

Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).

crowdyriver 2 years ago |

There's lots of comments here about how stupid is to parse html using llms.

Have you ever had to scrape multiple sites with variadic html?

samus 2 years ago | |

The example here has HTML with a somewhat fixed format. It would indeed have been better to have samples with different format and aiming for a low error rate.

If you are scraping a limited amount of sites, you could for each site ask the LLM for parsing code from some samples, review that, and move on.

malux85 2 years ago |

Sorry to be nit-picky but thats the essence of these benchmarks - Mistral putting "N/A" for not available is weird - N/A is not applicable, in every use I have ever seen, and they DONT mean the same thing. I would expect null for not available and N/A for not applicable

Impressive inference speed difference though

mewpmewp2 2 years ago | |

I have always known N/A as not available.

malux85 2 years ago | | |

Curious, where are you from? If I Google N/A every single hit on the first page is explaining it means "Not applicable"

are you from a non-english country? Maybe its cultural?

throwaway11460 2 years ago | |

It means all of these.

huqedato 2 years ago |

Can somebody explain why this Grok is more performant than Microsoft infrastructure ? LPU better than TPU/GPU ?

kkielhofner 2 years ago | |

LLM performance is about parallelism but also memory bandwidth.

Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].

From the linked reference:

"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."

That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).

If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.

It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:

1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.

2) Nvidia has incredibly high margins.

3) CUDA.

There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.

[0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

tosh 2 years ago | |

The Mistral Mixed Expert model has way fewer parameters active during inference and Groq has special purpose hardware (and probably less concurrent demand).

kkielhofner 2 years ago | | |

> probably less concurrent demand

This is a significant understatement. ChatGPT has an estimated 100m monthly active users.

Groq gets featured on HN from time to time but is otherwise almost completely unknown. According to their stats they have done something like 15m requests total since launch. ChatGPT likely does this in hours (or less).

naiv 2 years ago | |

It's a totally different approach for interference

In short:

Groq - Ai Chip Microsoft etc. - Nvidia Gpu

ttrrooppeerr 2 years ago |

A bit off-topic but maybe not? Any words on GPT-5? Is that coming? Or is OpenAI just focusing on the Sora model?

YetAnotherNick 2 years ago | |

There's no reason for OpenAI to release the model. They have close to 100% market anyways and releasing GPT-5 likely won't increase the total market as it is a incremental leap. And it's a open secret that most other models used GPT-4 synthetic data for training to come close to it.

They would likely wait till any model performs better than GPT 4 for the same price

whiplash451 2 years ago | | |

The same reasoning would have applied for GPT-3.5. In the hindsight, you can say that it was obviously a good idea to build and ship GPT4. But hindsight is 20/20.

chilmers 2 years ago | | |

By any chance did you used to work in leadership at Nokia or Research in Motion? :-D

lewhoo 2 years ago | | |

There is reason to release new models if said models would be capable of grabbing a significant portion of job market currently occupied by humans.

tosh 2 years ago | | |

100%?

Claude 3 Opus is in the capability ballpark of GPT-4, GPT-3.5 has alternatives that are cheaper (Claude 3 Haiku) or cheaper and work offline (Qwen 1.5, Mixtral, …).

burrish 2 years ago | |

I hear it should be dropped this summer

cornedor 2 years ago | | |

According to Sam Altman in a podcast with Lex Fridman this week, there is no real indication that it will be dropped this year. They will release a new model, but it might not be GPT-5

DalasNoin 2 years ago | | |

My understanding from the lex podcast: they will release a lot of new models this year, but they will release intermediate models first before gpt-5

dns_snek 2 years ago |

For all the posturing and crypto hate on HN, we're entering a world where it's socially acceptable to use 1000W of computing power and 5 seconds of inference time to parse a tiny HTML fragment which would take microseconds with traditional methods - and people are cheering about it. Time for some self-reflection? That's not very green.

Generating (1 / 512 tokens) [(pu 4.43%) (The 66.62%) (po 11.96%) (p 4.99%)] Generating (2 / 512 tokens) [(o 89.90%) (op 10.10%)] Generating (3 / 512 tokens) [(P 100.00%)] Generating (4 / 512 tokens) [( 100.00%)]