OpenAI GPT-4 vs. Groq Mistral-8x7B(serpapi.com) |
OpenAI GPT-4 vs. Groq Mistral-8x7B(serpapi.com) |
You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.
Data to scrape:
title: Name of the business
type: The business nature like Cafe, Coffee Shop, many others
phone: The phone number of the business
address: Address of the business, can be a state, country or a full address
years_in_business: Number of years since the business started
hours: Business operating hours
rating: Rating of the business
reviews: Number of reviews on the business
price: Typical spending on the business
description: Extra information that is not mentioned yet in any of the data
service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
is_operating: Whether the business is operating
HTML:
{html}Lower end models do not have the attention to complete tasks like this, GPT4Turbo will generally have the capability. But to have an optimal pipeline you should really be splitting up these tasks into individual units. You extract each attribute you want independently and then combine it back together however you want. Also asking for JSON upfront is equally suboptimal in the whole process.
I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.
Edit: I am not suggesting that an LLM is more optimal than what ever traditional parsing methods they may use, simply the way they are doing it is wrong from an LLM flow.
Cool, cool. I'm super interested. Please share the process and the results.
LLMs aren’t people even in a chat-roleplaying sense. They complete a “document” that can be a plot, a book, a protocol of conversation. The “AI” side in the chat isn’t an LLM itself, it’s a character (and so are you, it completes your “You: …” replies too - that’s where the driver app stops it and allows you to interfere). So everything you put in that header is very important. There are two places where you can do that: right in the chat, as in TFA, or in the “character card” (idk if GPTs have it, no GPT access for me). I found out that properly crafting a character card makes a huge difference and can resolve the whole classes of issues.
Idk what will work best in this case, but I’d start with describing which sort of a bot, how it deals with unclear or incomplete information, how amazing it is (yes, really), its soft/tech skills and problem solving abilities, what other people think of it, their experience and so on. Maybe would add few examples of interactions in a free form. Then in the task message I’d tell it more and specific details about that json.
One more note - at least for 8x7B, the “You are” in the chat is a much weaker instruction than a character card, even if the context is still empty. I low-key believe that’s because it’s a second-class prompt, i.e. the chat document starts with “This is a conversation with a helpful AI bot which yada yada” in… mind, and then in that chat that AI character gets asked to turn into something else, which poisons the setting.
Simply asking the default AI card represents 0.1% of what’s possible and doesn’t give the best results. Prompt Engineering is real.
I have high confidence that I could accomplish this task using a lower end model with a high degree of accuracy.
Same. I think that no matter how good a model is, this prompt just isn’t a professional task statement and leaves too much to decide. It’s a task that you, as a regular human, would hate to receive.
Answer: "I'm running on Toyota Corolla"
Which was perhaps the funniest thing I heard that day.
Also, don't parse HTML with regular expressions.
Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].
Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.
With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.
I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json
Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?
Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).
Have you ever had to scrape multiple sites with variadic html?
If you are scraping a limited amount of sites, you could for each site ask the LLM for parsing code from some samples, review that, and move on.
Impressive inference speed difference though
are you from a non-english country? Maybe its cultural?
Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].
From the linked reference:
"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."
That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).
If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.
It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:
1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.
2) Nvidia has incredibly high margins.
3) CUDA.
There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.
[0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...
This is a significant understatement. ChatGPT has an estimated 100m monthly active users.
Groq gets featured on HN from time to time but is otherwise almost completely unknown. According to their stats they have done something like 15m requests total since launch. ChatGPT likely does this in hours (or less).
In short:
Groq - Ai Chip Microsoft etc. - Nvidia Gpu
They would likely wait till any model performs better than GPT 4 for the same price
Claude 3 Opus is in the capability ballpark of GPT-4, GPT-3.5 has alternatives that are cheaper (Claude 3 Haiku) or cheaper and work offline (Qwen 1.5, Mixtral, …).
The problem I'm finding is that the time I wanted to save mantaining selectors and the like is time that I'm spending writing wrapper code and dealing with the mistakes it makes. Some are OK and can deal with them, others are pretty annoying because It's difficult to deal with them in a deterministic manner.
I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.
I don't really care about inference speed, but I do care about price and correctness.
In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?
That's understandable. The real problem is when the AI lies/hallucinates another answer with confidence instead of saying "I don't know".
TFA shows that groq is many times faster than GPT-4. Up to 18x groq claims. Faster means less energy. So I think it's just a matter of time until these things become ridiculously power efficient (eg run on phones in sub second times)
At least AI & LLMs have large scale practical applications as opposed to crypto (IMO).
Crypto is nearly pure waste.
I don't understand this. This adds bureaucracy and I don't see why different uses need to be charged differently if they all use energy the same.
In other words, if energy costs X per unit, and an inefficient (AI) software takes 30 units and an efficient (traditional) software takes 10 units, then it is already cheaper to run the efficient software, and thus people are already incentivised to do so. There's no need to charge differently. If one day AI turns out to only need 5 units, turning more efficient, then just charge them for 5X. People will gravitate towards the new, efficient AI software naturally then.
It no longer requires an expert human
This was just a blog to generate traffic on the site. Not to showcase some new use case for an llm.
>For all the posturing and forest fire hate on HN, it’s now socially acceptable to run a toy steam engine to power a model car? Not very green of you.
To be fair to GP, they did compare it to alternatives (dumb HTML parsing), but failed to consider versatile HTML parsing or other uses for Groq LLM.
If one cares about the environment, a carbon cap/tax is what you should campaign for. Then carbon-based energy sources will be curtailled, energy costs will go up, and AI like this will be encouraged to become more energy efficient or other methods used instead.
There is a lot of business value happening in the AI space and its only going to get better.
Unless you live in a dictatorship it's definitely up to us to decide... Otherwise you leave your voice to the top 0.0001% business owners and expect them to work for your good and not for their own interests
Also read about the rebound effect. Planes are twice as efficient as they were 100 years ago yet they pollute infinitely more as a whole.
There is nothing ridiculous about the comment you're replying to
There are differences that can be marked on how different models perform against the same raw prompt but generally the workflow is what matters more. The raw text prompt will be dependent on what model you are using as there are those differences but I don't think its a level of "prompt engineering" like we had a year ago.
Used to be easy, when it was ASCII.
Reverse the bytes of utf-8 and it won't always be valid uft-8.
Reverse the code-points, and the Canadian flag gets replaced with the Ascension Island flag.
A competitor will likely need to be 10x better than ChatGPT in order to get significant market share, not just marginally better in certain scenarios.
Secondly, GPT-4 increased overall AI market. According to all the sources, interviews and leaks, GPT-5 won't be a big leap over GPT-4 as the model size and training data won't be significantly larger. I doubt GPT-5 would do that. (I could be wrong in my assumption though that GPT-5 would just be a incremental gain).
One way to combat corruption is to ask an international panel of experts to assess how many extra emissions came from non-official sources in each country and reduce next years cap by that amount. Then countries have an incentive to stamp out corruption.
Basically, carbon tax is the accountant's solution, innovation is the engineer's.
As soon as demand for oil starts to drop, so will oil prices, and I suspect they could go down by a factor of 10 or more and oil-rich nations would still think it worthwhile to exploit at least some reserves.
Whether or not this benefit outweighs the significant problems (cost, speed, accuracy and determinism) is up to the use case. For most use cases I can think of, the speed and accuracy of an actual parser would be preferable.
However, in situations where one is parsing highly dynamic HTML (eg if each business type had slightly different output, or you are scraping a site which updates the structure frequently and breaks your hand written parser) then this could be worth the accuracy loss.
https://web.archive.org/web/20240319224624/https://www.busin...
The other odd thing from Altman was saying that GPT-4 sucks.
I think the context for both announcements is the recent release of Anthropic's Claude-3, which in it's largest "Opus" form beats GPT-4 across the board in benchmarks.
I personally think OpenAI/Altman is a bit scared that any moat/lead they had has disappeared and they are now being out-competed by Anthropic (Claude). Remember that Anthropic as a company was only formed (by core members of the OpenAI LLM team) at the same time as GPT-3 was released, so in same time it took OpenAI to go from GPT-3 to GPT-4, Anthropic have gone from nothing -> Claude-1 -> Claude-2 -> Claude-3 which beats GPT-4 !!
Anthropic have also had quite a bit of success attracting corporate business, quite a bit of which is more long-term in nature (sharing details of expected future model capabilities so that partners can target those).
So, I think OpenAI is running a bit scared, and I'd interpret this non-announcement of some model (4.5 or 5) "coming soonish" to be them just waving the flag and saying "we'll be back on top soon", which they presumably will be, briefly, when their next release(s) do come out. Altman's odd "GPT-4 sucks" statement might be meant to downplay Claude-3 "Opus" which beats it.
> N/A (or sometimes n/a or N.A.) is a common abbreviation in tables and lists for the phrase not applicable, not available, not assessed, or no answer.
Meaning of n/a in English written abbreviation for not applicable: used on a form to show that you are not giving the information asked for because the question is not intended for you or your situation: If a question does not apply to you, please put N/A in the box provided. COMMERCE.
TIL
If I was to code something and for whatever reason some data wasn't available I would use N/A.
"Not applicable" doesn't feel right to me about N/A.
For instance if there is a table of comparison and for whatever reason there is data missing for some entity, while there should be, I would use N/A. So not applicable feels wrong for me for that reason alone.
This all is coming from intuition though.
We will need an LLM as a front end then it will generate a query to fetch the facts from the internet or a database , then maybe format the facts for your consumption.
From what I've tested, all of the current models will see a prompt like "are you sure that's correct" and respond "no, I was incorrect [here's some other answer]", irrespective of the accuracy of the original statement.
Because LLMs don't work in a way for that to be possible if you operate them on their own.
Here is the debug output of my local instance of Mistral-Instruct 8x7B. The prompt from me was 'What is poop spelled backwards?'. It answered 'puoP'. Let's see how it got there starting with it processing my prompt into tokens:
'What (3195)', ' is (349)', ' po (1627)', 'op (410)', ' sp (668)', 'elled (6099)', ' backwards (24324)', '? (28804)', '\n (13)', '### (27332)', ' Response (12107)', ': (28747)', '\n (13)',
It tokenized 'poop' as two tokens: 'po', number 1627, and 'op', number 410.Next it comes up with its response:
Generating (1 / 512 tokens) [(pu 4.43%) (The 66.62%) (po 11.96%) (p 4.99%)]
Generating (2 / 512 tokens) [(o 89.90%) (op 10.10%)]
Generating (3 / 512 tokens) [(P 100.00%)]
Generating (4 / 512 tokens) [( 100.00%)]
It picked 'pu' even though it was only a ~4% chance of being correct, then instead of picking 'op' it picked 'o'. The last token was a 100% probability of being 'P'. Output: puoP
At no time did it write 'puoP' as a complete word nor does it know what 'puoP' is. It has no way of evaluating whether that is the right answer or not. You would need a different process to do that.People have a really hard time catching such bullshitting from humans, which is why free form interviews doesn't work.
Good prompting and certain adjustment to the text generation parameters might help prevent hallucinations, but it's not an exact science since it depends on how it was trained. Also, an LLMs training data frankly said contains a lot of bulls*t.
Think the commenter meant use another model/LLM which could give a different answer, then let them vote on the result. Like "old fashioned AI" did with ensemble learning.
Humans running multishot with mixture of experts is close to perfect. You can't compare a multishot mixture of expert AI to a single human, humans doesn't work in isolation.
We haven't even gotten there yet, have we?
My professor (Sir Michael Brady) at university 14 years ago set up a company to do this very thing, and he already had reliable models back before 2010. I believe their company was called Oxford Imaging or something similar.
Yes and no. Countless teams have solved exactly this problems at universities and research groups across the world. Technically it's pretty much a solved problem. The hard part is getting the systems out of the labs and certified as an actual product and convincing hospitals and doctors to actually use them.
A lot of people seem to be using GPT-4 for tasks like text classification and NER, and they’d be much better off fine-tuning a BERT model instead. In vision, too, transformers are great but a lot of times, a CNN is all you really need.