Improving Text Embeddings with Large Language Models

Improving Text Embeddings with Large Language Models(arxiv.org)

48 points by cmcollier 2 years ago | 6 comments

binarymax 2 years ago |

Interesting, but this aspect makes me double-take: "We demonstrate that Mistral-7B, when fine-tuned solely on synthetic data, attains competitive performance on the BEIR [ 40 ] and MTEB [27] benchmarks".

E5/BGE large are an order of magnitude smaller than Mistral-7B. So is this just "bigger model wins" in disguise?

I need to read the whole paper carefully, but this jumped out at me.

huac 2 years ago | |

agree, this is a nice example of generating synthetic data, and I believe that the synthetic data is helpful for generating useful embeddings for RAG, but not including an ablation with fine-tuned E5 or another commonly used embedding model (to control for the 'bigger model wins' effect) is a glaring omission. this paper shares many authors with the E5 paper, why did they not compare on a fair basis?

pama 2 years ago | | |

I thought the main point was that this is a very fast way (in terms of wall time) to beat state of the art, not a fair comparison by size; if one made E5 bigger, then E5 would be even slower to train.

nalzok 2 years ago |

> Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

I'm surprised they didn't put `Machine Learning (cs.LG)` and `Machine Learning (stat.ML)`.

3abiton 2 years ago |

I am confused, aren't LLMs already embeddings of text?

jerpint 2 years ago | |

Yes but they are not trained to explicitly encourage similar texts to be semantically similar, only to do next token prediction. In embedding models a contrastive loss is used to minimize distance between pairs of semantically similar content and maximize distance to all other embeddings