Is My Approach to Vectorizing and Storing 1.5 Trillion Tokens Reasonable? I'm planning to index and store 1.5 trillion tokens using Faiss and would love some feedback on my approach: 1. Partitioning: I'm thinking of using distributed k-means and inverted multi-index quantizers for efficient data partitioning. 2. On-Disk Storage: Due to the scale, I'm storing everything on disk using a Compressed Sparse Row format. 3. Distributed Search: I plan to implement a client-server model with multiple servers to handle search operations. Does this approach sound feasible, or am I overlooking something crucial? Any advice or suggestions? I'm mostly working off of this article: Indexing 1T Vectors (https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors). I think the data is too big for AutoFaiss, but I can use that for experiments. |
No comments yet