Improving recommendation systems and search in the age of LLMs

Improving recommendation systems and search in the age of LLMs(eugeneyan.com)

408 points by 7d7n 1 year ago | 93 comments

x1xx 1 year ago |

> Spotify saw a 9% increase in exploratory intent queries, a 30% rise in maximum query length per user, and a 10% increase in average query length—this suggests the query recommendation updates helped users express more complex intents

To me it's not clear that it should be interpreted as an improvement: what I read in this summary is that users had to search more and to enter longer queries to get to what they needed.

rorytbyrne 1 year ago | |

We would need to normalise query length by the success rate to draw any informative conclusions here. The rate of immediate follow-up queries could be a decent proxy for this.

Traubenfuchs 1 year ago | |

> a 9% increase in exploratory intent queries

Users struggle to find the right stuff or stuff that‘s so good they don‘t need do do more queries.

> a 30% rise in maximum query length per user, and a 10% increase in average query length

Users need to execute more complex queries to find what they are looking for.

RicoElectrico 1 year ago | |

I can understand tracking metrics for performance (as in speed, server load) or revenue. But I don't see how anyone could make such conclusions as they did with a straight face, apart from achieving some OKR for promotion reasons. There's no substitute for user research, focused mindset and good taste.

I can imagine that's why today's apps suck so much as most of the pain points won't be easily caught by user behavior metrics.

One thing Alex from Organic Maps taught me is how important it is to just listen to your users. Many of the UX improvements were driven by addressing complaints from e-mail feedback.

RamblingCTO 1 year ago | |

100%. I've switched over to Apple Music because you really feel that they are pushing public playlists. Search terms maximize their playlists vs mine. I now had to go to my library to find my playlists because they wouldn't even show up.

singron 1 year ago | |

This is a hard problem. We had similar issues evaluating success with real users. In the literature, there is "abandonment" (i.e. I couldn't find what I wanted and gave up) and "positive abandonment" (I got what I wanted from the SERP and didn't click on anything). A flurry of requests might be a series of positive abandonment, a natural fruitful process of refining the request, or rage querying where the user repeatedly fails to correct a model that is incapable of understanding the query. It's especially devious if they rage query for a while before switching to an easier task and succeeding (e.g. clicking a result) since you might count that whole interaction as positive when it was really quite negative.

barrenko 1 year ago | |

People are just more and more used to interacting with an LLM / GPT, I think that's the why of the long questions + yes, people are not finding what they need.

1oooqooq 1 year ago | |

that's what you get when you have a "search pm".

braiamp 1 year ago | |

Yeah, this should be evaluated in a multivariate/bivariate model. Of the successful queries, how the length changed before and after interventions.

wildrhythms 1 year ago | |

No you don't understand, more queries = more engagement!

MostlyStable 1 year ago | | |

It's relatively easy to construct a scenario where more search is in fact indicative of better search. To stick with Spotify: let's imagine they have an amazing search tool that consistently finds new, interesting music that the user genuinely likes. I can imagine that in that situation, users are going to search more, because doing so consistently gets them new, enjoyable music.

But the opposite is equally possible: a terrible search tool could regularly fail to find what the user is looking for or produce music that they enjoy. In this situation, I can also imagine users searching more, because it takes more search effort to find something they like.

They key is why are users searching. In Spotify's case I imagine that you could try and connect number of searches per listen, or how often a search results in a listen and how often those listens result in a positive rating. There are probably more options, but there needs to be some way of connecting the amount of search with how the user feels about those search results.

And yeah, using nothing other than search volume is probably a bad way to go about it

porridgeraisin 1 year ago | |

A conclusion true to the concerned executive's MBA.

novia 1 year ago |

I started listening to this article (using a text to speech model) shortly after waking up.

I thought it was very heavy on jargon. Like, it was written in a way that makes the author appear very intelligent without necessarily effectively conveying information to the audience. This is something that I've often seen authors do in academic papers, and my one published research paper (not first author) is no exception.

I'm by no means an expert in the field of ML, so perhaps I am just not the intended audience. I'm curious if other people here felt the same way when reading though.

Hopefully this observation / opinion isn't too negative.

curious_cat_163 1 year ago | |

To me, it reads like a survey paper intended for (and maybe written by) a researcher about to start a new project. I am not a researcher in this space but I have dabbled elsewhere, so it is somewhat accessible. The degree to which one leverages existing jargon in their writing is a choice, of course.

I am curious -- what would have made it more effective at conveying information to you? Different people learn differently but I wonder how people get beyond the hurdles of jargon.

novia 1 year ago | | |

Yeah I'm not sure if it's just me and my learning style or if researchers purposefully use terminology that's obstructive to understanding to maintain walled gardens. I don't think my reading comprehension level is particularly low!

Usually the best way to learn about things like this for me is to see some actual code or to write things myself, but the lack of coding examples in the text isn't the thing that I find troubling. I don't know, it's just.. like, excessively pointer heavy?

Maybe if you've been in the field long enough, reading a particular term will instantly conjure up an idea of a corresponding algorithm or code block or something and that's what I'm missing.

7d7n 1 year ago | |

Thank you for the feedback! I'm sorry you found it jargony/less accessible than you'd like.

The intended audience was my team and fellow practitioners; assuming some understanding of the jargon allowed me to skip the basics and write more concisely.

LZ_Khan 1 year ago | |

I work in the field. The amount of jargon is indeed large but it's not out of the ordinary. It's simply how things are referred to. If the author explained what everything is the content would span a textbook.

That being said I do find the content difficult to understand, and I think reading the actual papers would be much more enlightening. But it's a great survey of all the things people have done.

softwaredoug 1 year ago |

A lot of teams can do a lot with search with just LLMs in the loop on query and index side doing enrichment that used to be months-long projects. Even with smaller, self hosted models and fairly naive prompts you can turn a search string into a more structured query - and cache the hell out of it. Or classify documents into a taxonomy. All backed by boring old lexical or vector search engine. In fact I’d say if you’re NOT doing this you’re making a mistake.

syndacks 1 year ago | |

Can you share more, or at least point me in the right direction?

ntonozzi 1 year ago | | |

One place to explore more would be Doc2Query: https://arxiv.org/abs/1904.08375.

It’s not the latest and hottest but super simple to do with LLMs these days and can improve a lexical search engine quite a lot.

jamesblonde 1 year ago |

It is very interesting that Eugene does this work and publishes it so soon after conferences. Traditionally this would be a literature survey by a PhD student and would take 12 months to come out as some obscure journal behind a walled garden. I wonder if it is an outlier (Eugene is good!) or a sign of things to come?

drodgers 1 year ago | |

> a sign of things to come

Isn't this, like, a sign of what's been happening for the last 20+ years (arxiv, blogs etc.)?

jamesblonde 1 year ago | | |

To some extent. But it's hard to find quality. Eugene's stuff is quality. For example, i'm in distributed systems, databases, and MLOps. Murat Demirbas (Uni Buffalo) has been the best in dist systems. Andy Pavlo (CMU) for databases. Stanford (Matei) have been doing the best summarizing in MLOps.

tullie 1 year ago |

The other direction that isn’t explicitly mentioned in this post is the variants of SASRec and Bert4Rec that are still trained on ID-Tokens but showing scaling laws much like LLMs. E.g. Meta’s approach https://arxiv.org/abs/2402.17152 (paper write up here: https://www.shaped.ai/blog/is-this-the-chatgpt-moment-for-re...)

anon8764352 1 year ago |

@7d7n Eugene / others experienced in recommendation systems: for someone who is new to recommendation systems and uses variants of collaborative filtering for recommendations, what non-LLM approach would you suggest to start looking into? The cheaper the compute (ideally without using GPUs in the first place) the better, while also maximizing the performance of the system :)

mhuffman 1 year ago | |

IMHO it depends on the types of things you are recommending. If you have a good way of accurately and specifically textually classifying items it is hard to beat the performance of good old-fashioned embeddings and vector search/ANN. There are plenty of embeddings that do not need GPU like the newer LLM-based ones all crave. Word2Vec, GloVe, and FastText are all high-performance and you wouldn't need GPUs. There are plenty of vector-search libraries that are high-performance and predate the vector-db popularity of late, so also would not depend on GPUs to be high-performance. Most are memory-hungry however, so something to keep in mind. That performance, especially with the embeddings, will come at the cost of loss of some context. No free lunch.

thaumiel 1 year ago |

ah this explains why my spotify experience has gotten worse over time.

UrineSqueegee 1 year ago | |

I have the exact opposite experience, recently when a playlist I have is over, I find that every recommended track that plays after, I love so much I end up putting in my playlist

thaumiel 1 year ago | | |

My taste in music is apparently so varied, that if I want to keep the "daily" Spotify list as I want them, I have to limit myself in variation in what I listen to, otherwise they will get too mixed up and I will not enjoy them anymore. So I use other peoples recommendations or music review sites instead to find new music/bands/artists. I tried the spotify AI dj service a couple of times, but it has not been a good experience, when it tries to push in a new direction it has never really gotten it right for me.

appleorchard46 1 year ago | | |

I liked when you could make a playlist radio and do that manually. That's been removed now of course.

whatever1 1 year ago |

Why we don’t have an LLM based search tool for our pc / smartphones?

Specially for the smartphones all of your data is on the cloud anyway, instead of just scraping it for advertising and the FBI they could also do something useful for the user?

anthk 1 year ago |

Use 'Recoll' and learn to use search strings. For Windows users, older Recoll releases are standalone and have all the dependencies bundled, so you can search into PDF's, ODT/DOCX and tons more.

stuaxo 1 year ago |

Off topic - but I think joining recommendation systems and forums (aka all the social media that isn't bsky or fedi) has been a complete disaster for society.

anonymousDan 1 year ago |

It's interesting that none of these papers seem to be coming out of academic labs....

pizza 1 year ago | |

Checking if a recommendation system is actually good in practice is kind of tough to do without owning a whole internet media platform as well. At best, you'll get the table scraps from these corporations (in the form of toy datasets/models made available), and you still will struggle to make your dev loop productive enough without throwing similar amounts of compute that the ~FAANGs do so as to validate whether that 0.2% improvement you got really meant anything or not. Oh, and also, the nature of recommendations is that they get very stale very quickly, so be prepared to check that your method still works when you do yet another huge training run on a weekly/daily cadence.

memhole 1 year ago |

It looks like a great overview of recommendation systems. I think my main takeaways are:

1. Latency is a major issue.

2. Fine tuning can lead to major improvements and I think reduce latency. If I didn’t misread.

3. There’s some threshold or problems where prompting or fine tuning should be used.

a_bonobo 1 year ago |

Elicit has a nice new feature where given a research question, it seems to give the question to an LLM with the prompt to improve the question. It's a neat trick.

As an example, I gave it 'What is the impact of LLMs on search engines?' and it suggested three alternative searches under keywords, the keyword 'Specificity' has the suggested question 'How do large language models (LLMs) impact the accuracy and relevance of search engine results compared to traditional search algorithms?'

It's a really cool trick that doesn't take much to implement.

bookofjoe 1 year ago |

Perplexity Pro suggested several portable car battery chargers, which led me to search online reviews, whose consensus (five or so review sites) highest-rated chargers were the first two on Perplexity's recommendation list. In other words, the AI was an helpful guide to focused deeper search.

thorum 1 year ago |

In the age of local LLMs I’d like to see a personal recommendation system that doesn’t care about being scalable and efficient. Why can’t I write a prompt that describes exactly what I’m looking for in detail and then let my GPU run for a week until it finds something that matches?

onel 1 year ago |

Another amazing post from Eugene

anon373839 1 year ago |

Terrific post. Just about everything Eugene writes about AI/ML is pure gold.

hackernewds 1 year ago | |

haha this is some solid astroturfing Eugene :)

7d7n 1 year ago | | |

haha that wasn't me ;)