BM25opt is a score-compatible optimized rewrite of the popular https://github.com/dorianbrown/rank_bm25 , which is used by e.g. LangChain and Llamaindex. It's much faster and tries to fix some issues as well. |
BM25opt is a score-compatible optimized rewrite of the popular https://github.com/dorianbrown/rank_bm25 , which is used by e.g. LangChain and Llamaindex. It's much faster and tries to fix some issues as well. |
- easier to use with untokenized corpus and questions
- to fix issues with the tokenizing ( e.g. https://github.com/dorianbrown/rank_bm25/issues/38 ); also rank_bm25 provides no default tokenizer, a naive split-on-whitespace is a wrong choice
- considerably simplify the code (way less SLOC)
- point out the similarities of the algorithms for educational purpuses / further development
In practice, the differences are minimal ( see Example 3: comparison with rank_bm25 ).