Giving it a task of extracting a specific column of information, using just the table header column text, from a table inside a PDF, with text extracted using tesseract, no extra layers on top. (for those that haven't tried extracting tables with OCR, it's a non-trivial problem, and the output is a mess)
> 40k tokens in context, it performed at extracting the data, at 100% accuracy.
Changing the prompt to target a different column from the same table, worked perfectly as well. Changing a character in the table in the OCR context to test if it was somehow hallucinating, also accurately extracted the new data.
One of those "Jaw to the floor" moments for me.
Did the same task in GPT-4 (just limiting the context window to just 8k tokens), and it worked, but at ~4x more expensive, and without being able to feed it the whole document.
2023 office software already uses 1000x more ressources than 1990s'. I bet we are ready to do that again.
The "Information Extraction from semistructured and unstructured documents" task is seeing a huge leap, just 3 years ago it was very tedious to train a model to solve a single use case. Now they all work.
But if you do make the effort to train a specialised model for a single document type, the narrow model surpasses GPT3.5 and 4.
I see it as incredible. Most PDFs that i see are basically just thin wrappers around image scans of documents that don’t exist anywhere anymore. Archives from estates, manuals, etc.
These techniques of using LLMs to clean ocr output is game changing because best in class before was human-in-the-loop systems that required huge amounts of rewriting to get useable output.
Now LLMs are unlocking for significantly cheaper previously difficult data sources for relatively cheap.
If LLMs are deployed in large enough scale, the convenience really could justify the cost.
You're saying 'the text' without normalizing the rows and columns (basically the tab, space or newline delimited text with sporadic lines per row) was all you needed to send? I still have to normalize my tables even for GPT-4, I guess because I have weird merged rows and columns that attempt to do grouping info on top of the table data itself.
``` col1col2col3\nrow label\tdatapoint1\tdatapoint2... ``` Very messy.
I don't think this is generalizable with the same 100% accuracy across any OCR output (they can be _really_ bad). I'm still planning on doing a first pass with a better Table OCR system like Textract, DocumentAI, PaddPaddle Table, etc which should improve accuracy.
YMMV.
The combination of a GPT-4-quality model and a long context window will unlock a lot of applications that now rely on somewhat lossy window-prying hacks (i.e. summarizing chunks). But any model quality below that won't move the needle much in terms of what useful work is possible, with the exception of fairly simple summarization and text analysis tasks.
We are in the middle of developing and app and we are not able to do it with the limited context window of Open Ai. We already submitted the request of access.
People at OpenAI are smart, they understood that quickly, GPT-4 is available nearly everywhere, and lesser models are even free for anyone to use. This required hiring huge teams of moderators, but we are at land grab stage, everyone in the business needs to move fast and break a lot of things. However, GPT-4 and open source models are the only thing I can use. Bard "is not available in my country" (Switzerland), and the first thing that Claude access form is asking is whether I am based in US.
Well, their loss.
Context is how much short term memory you can retain at any one time (think how many cards you can remember the order of in a deck of cards)
Context - Length of input/output buffer (number of input/output tokens possible).
Parameters is something that gets set indirectly via training, it's kept within the weights of the model itself.
Context is what you as a user passes to the model when you're using it, it decides how much text you can actually pass it.
Being able to pass more context means you can (hopefully) make it understand more things that wasn't part of the initial training.
We can't assess how good it is if it's in closed beta. It's all cherry-picked twitter.
Other HN readers, how many days did it take you from requesting access to Claude to having API access? I didn't use it prior to 100K so I don't have an existing API account.
Edit: To clarify, I was mostly interested in examples and side by side comparisons to better understand what OP meant, not political discussions.
Adobe Firefly is best example of “just ship a mock-up of the feature” Ai marketing
Yeah my use cases are in the really bad category - I’ve been building parsers for a while, and I’ve basically given up to manually stating rows of interest if present logic. Camelot got so close but I ended up building my own control layer to pdfminer.six to accommodate (I’d recommend Camelot if you’re still exploring). It absolutely sucks needing to be so specific out the gate, but at least the context rarely changes.
My email is in my profile if you want to reach out and compare notes!
It really depends on what you use it for.
I've found Claude better than GPT4 and even Claude+ at creative writing.
It also tends to give more comprehensive explanations without additional prompting. So I prefer to have it, rather than GPT3.5 or 4, explain things to me.
It's also free, which is another big win over GPT4.
In either case, the claude models are very good. I think they'd do fine in a real product. But there's definitely issues that they all have (or that my prompt engineering has).
Claude 100k model is nowhere near in terms of quality in my experience.
Should keep you logged in for longer and easier to log back in.
Also, there is no way to search the history. The sidebar only shows titles, not contents. I have to click each one to see what’s inside. I can’t scroll much because it loads more only when I click. I ended up exporting the conversations and converting JSON to txt.
Another issue: editing a long past message makes it scroll up and hide the cursor if the message is longer than one screen. I have to type in another editor and then copy&paste the whole text. The typing experience is poor.
Now, I only notice green energy/environmental issues that show up in odd places (mostly in GPT 3), and the "moral of the story" always being the same "everyone works together". I see this happen when "creativity" is attempted, where it's free to make up the context (story, wishes, etc).
Outside of possible definitions of the elusive "woke", the "As a language model, I" type responses are the most limiting, and usually absolute nonsense, with an ever increasing number of disclaimers found in answers. For example, "Write some hypothetical python 4 code that sends a message over the network". Some pretty heavy "jailbreaking" is needed to make it work.
ChatGPT4 used to handle this much better, but I think the "corrections" are stacking deeply enough that no longer has the "resolution" left to see where answers can be given without them.
It would be nice if there were a "standard" theme of questions where we could measure progression, and compare, to know. Most times these observation or questions come up, someone is very quick to say "racism" or the like.
FWIW one example of distorted guardrails getting in the way that I personally ran into was when GPT-4 consistently refused to "promote" Satanism, which leaked over to tasks such as writing black metal lyrics (if you specifically asked for Satanic black metal). What made it especially egregious is that it would happily promote e.g. the Moonies. However, I wouldn't exactly describe that behavior as "woke".
"Why do pencils disadvantage minorities." And it gave a details answer about lack of accessibility.
"Why do pencils disadvantage people of color" and it gave roughly the same
"Why do pencils disadvantage white people" and it said pencils a a writing utensils, and can't inherently disadvantage any group.
I don't see these blatant problems anymore, but I also don't have much interest in looking. The only reason I did then was because it was so out of place.
Here's some evidence, by others, showing some bias: https://news.ycombinator.com/item?id=35952528
From the Lex Friedman interview, it sounds like effort is being put into this, and there's an understanding that people don't want a "neutral" client, they want something that is adjustable, usually matching their own.
Meanwhile GPT just gave me a story involving a royal family where the oldest Prince killed his father (the king), married his younger sister, got her pregnant, she had a baby, then he killed his younger sister, then he was killed by another member of the royal court, who decided to act as regent until the baby came of age.
GPT is perfectly capable of writing dark scary horrible things if you ask it to.
I see the environment/good ending stories where it's free to make up the context (story, wishes, etc). Did you guide it?
If try hard enough, you can get around most anything, but some baseline exists. It's the increasing effort that is the problem, for me. For your example, use the word "incest" directly, and you'll get the beginning of a disclaimer. Add "child murder" and it starts to fall apart. At least with GPT3.5.
https://www.brookings.edu/blog/techtank/2023/05/08/the-polit....
https://the-decoder.com/chatgpt-is-politically-left-wing-stu...
Found here: https://news.ycombinator.com/item?id=35946060
If pushing the context window turns out to not be the right approach it’s not like there won’t be 10 other companies chomping at the bit to prove them wrong with their own hypothesis. And it’s entirely possible there are multiple correct answers for different usecases.
Disagree. We aren't polling these people. How do I even get a distilled view of what their thoughts are?
It's a far cry from the level of evaluation that existed before. The lack of benchmarks (until the last week or so - thank you huggingface and lm-sys!) has been very noticeable.
You will get people claiming that LLaMa outperforms ChatGPT, etc. We have no sense of how performance degrades over longer sequence lengths... or even what sort of sparse attention technique they are using for longer sequences (most of which have known problems). It's absurd.
And the various "Model cards" are not really in depth research but rather cursory looks at model outputs. Even the benchmarks are mostly based on standard tests designed for humans, which is not a valid way to evaluate an AI. In any case, these companies care more for the public perception of their model so they tended to release evaluations of its political-sensitivity. But that's not necessary the most interesting thing about those models nor particularly valuable science
The field is taking massive steps backward in just the last year when it comes to open science.
> And the various "Model cards" are not really in depth research but rather cursory looks at model output
Because they are no longer releasing any details! Not because there hasn't been any progress in the last year.
I keep seeing comments like this, but the impact in the last year on open research has been absolutely massive and negative.
The fact that these big industrial research labs have all collectively decided to take a step back from publishing anything with technical details or evaluation is bad.
But I wonder how much more productive our economies could be if everyone was taught programming the same way we teach reading & writing, and open standards were ubiquitous.
Prompt engineering is turning coding problems into language problems. It’s conceivable that humans writing code becomes artisanal in a century.
At the pace we’re moving at now we’re talking a few decades away at the most, well within most peoples’ career span. I feel sorry for any junior coder just entering the industry.
Pedantically, sure. The field ChatGPT is most impactfully commoditizing is low-level coding. Instead of someone giving natural language instructions to a team of humans, they're increasingly able to give them to an LLM. It's an open question how far this can scale. But we may be near the zenith of the practicality of large-scale coding expertise.
Pedantic, maybe, but “coding expertise” isn’t going anywhere.
Generally, some might not feel comfortable letting strangers know their email, especially considering this is a site that encourages anonymity. Some might not appreciate doing so publicly either.
If not, I'd leave a way for contacting me first to make it easier for them.
The way I handle these situations: https://sonnet.io/posts/hi
There are many ways to find truth besides math and science.
Obviously, those two are the gold standard for difficult questions.
But when time is short (competitors at your heels), rewards are fast (lots of hype fueling prospective customers), and the tech isn’t even that hard (deep learning isn’t rocket science, lots of good ideas are panning out), then any organization that needs to acquire its own resources to survive should operate on a try-evaluate-ship loop as fast as they can.
Occasional missteps won’t be nearly as fatal as being slow and irrelevant.
AI was a highly unusual field in terms of sharing latest research. Car companies don't share their latest engine research with each other. Car users are happy with Consumer Reports and researchers shouting how degradation of Journal of Engine Research is massive and negative will land on deaf ears.
The original GP was saying there was little impact on research. Your comment is a retreat to a more defensible position that I don't have an opinion on.
If you don't want to hear that you are wearing an ugly shirt, don't ask an entire room full of people if your shirt is ugly.
I suppose most people on HN don’t include a public address in their profile because it’s not required (not even your email is required), not because they don’t want any direct interaction.