There's also Toshi: https://github.com/toshi-search/Toshi which is built on top of Tantivy: https://github.com/tantivy-search/tantivy
And for C++, there's Xapiland: https://github.com/Kronuz/Xapiand
And for Go, there's Blast: https://github.com/mosuka/blast built on Bleve: https://github.com/blevesearch/bleve
https://github.com/valeriansaliou/sonic/issues/52#issuecomme...
> Sonic only keeps the N most recently pushed results for a given word
This index discards old entries. This is fine for messages, in which aging items lose relevance. Yet the developer uses it for a help desk, which I think should give equal importance to all items.
In this area I would say the main comperitor is Groonga. It can be integrated into PostgreSQL with the PGroonga extension, and it indexes all of the data. However it consumes way more ram.
Glancing over this it looks like a nearly perfect fit for my use case I just wish I had seen this a couple of weeks earlier!
And if you don't know what you want exactly, you can't find anything.
I was looking for something just like this for a project in my team. We had been using this setup where a huge chunk of the data was being stored in triplicate: some of it in ES, some more of it in another database and finally the whole dataset in our data warehouse.
Hopefully I can use this to only provide the index + full text capability and just use the warehouse itself as the main db because the query performance is similar enough and the warehouse is criminally underused for what we pay for it.
Perhaps the upstream default is 1 GB but this was not the default, and there is much confusion on setting these correctly
https://stackoverflow.com/a/40333263
Not everyone has time to dig into it; of course if you did that’s good for you
If at any moment you feel lost, look through related issues and merge requests, as well as Git history. Perhaps you'll see how things get changed in the project, patterns intrinsic to it.
Also, keep in mind that you can always try to contact the community/author or invite someone to try figuring out your goal with you - once you collaborate, you'll get a solution tailored to the way you think. It does engage other people, but it also makes coding social and (at least to me) more satisfying.
Let me know what you think of this approach!
https://github.com/valeriansaliou/sonic/blob/master/PROTOCOL...
The recommendation is btw: at most half of the systems memory, no more than 31g (due to how the jvm compresses heap pointers), and keep at least as much memory available for disk buffers/caches as you allocate to the heap.
Not large by absolute standards sure, but large enough to cause issues.
I’m sure there’s some kind of solution that involves re-architecting the ES cluster and indices and re-architecting the data flows and stuff. But if our options are go through all that, or seriously slim down our architecture and costs by just running Sonic + our data warehouse, I’m definitely going to give it a go. After all, worst comes to worst we can go down the re-architecting ES route if Sonic doesn’t work out.
¯\_(ツ)_/¯
You can absolutely run 2 data/master-eligible nodes plus a single master-eligible tie-breaker node as a safe configuration. The only constraint is that you should have an uneven number of master-eligibile node to avoid a split brain. You also need to understand what the resilience guarantees are for any given number of replica (roughly: each replica allows for the loss of a single random node) and how many replica you can allocate on a given cluster (at most one per data node). That would allow you to run a 2-datanode cluster in a configuration that survives the loss of one node.
I pretty much specifically avoid 4 node clusters. You’d have to either designate 3 of the four nodes as master-eligibile with a quorum of 2 or have all of them master-eligible with a quorum of 3. Both options allow for failure of a single node before the cluster becomes unavailable. Any other configuration would either fail immediately on a node loss (quorum 4) or be unsafe (quorum of 2, allows for split-brain)
I’d much rather opt for 4 data/master eligible nodes plus a dedicated master eligible node with a quorum of 3.
You also need to pick the number of replica suitably: each replica allows for the loss of a single random(!) data node while retaining all your data. Note that if losses are not random but you want to safeguard against loss of a rack or an availability zone or such, configurations are possible that distribute primary and replica suitably (“keep a full copy on either side”)
I've been doing ES consulting and running clusters since 0.14 and I see very few clusters that run more than a single replica. Most 3 node clusters I see are run with three nodes because you can then have three identical configurations at the cost of throwing more hardware at the problem.
> but do you really want all your traffic to hit this one instance when the other one goes down
Whether that's a problem really really depends on whether your cluster is write-heavy or read heavy. Basically all ELK clusters are write heavy and in that case, loosing one of two nodes would also cut writes in half (due to the write amplification that replica cause). Other clusters just have replica for resilience and can survive the read load with half of the nodes available. Whether that is the case for your cluster(s) depends - that's why I explicitly asked what constraints the OP had.