Ask HN: Why aren't there Open source embedding models with context length > 512?

3 points by rawsh 2 years ago | 2 comments

There are as mentioned, but additionally, for many models, you can split content up into several vectors (say one for each sentence or paragraph depending on how the model is trained) and pool the vectors together to get a representation that will span the content overall well.

Since the models trained to work on single sentences (like Mini-V2, the SBERT default) work worse at length, pooling representations of sentences is typically more useful.

For deliberately longer representations, generative model embeddings or document embeddings are the right answer sometimes.

caprock 2 years ago |

There are some:

https://huggingface.co/spaces/mteb/leaderboard