Elasticsearch node crashes can cause data loss

Elasticsearch node crashes can cause data loss(github.com)

112 points by felipehummel 11 years ago | 50 comments

rdtsc 11 years ago |

Mandatory reading -- Last year's Call Me Maybe : Elasticsearch

https://aphyr.com/posts/317-call-me-maybe-elasticsearch

I've been hearing a lot of people talk about Elasticsearch lately. I get the same gut feeling I was getting about MongoDB back during the "Webscale" days.

bkeroack 11 years ago | |

In my experience, Elasticsearch is the single most common source of infrastructure downtime and service failure. It's basically my arch nemesis.

willejs 11 years ago | | |

I am interested to hear a bit more about this, as I find it hard to believe. I have only ran it at pretty small scale - x8 servers, around 300 million documents indexed a day, peak index rate 30k docs/sec. I found that you have to monitor it correctly, tune the JVM slightly (Mostly GC), give it fast disks, lots of ram, and the correct architecture (search, index & data nodes) to get the most out of it. Once I did that it was one of the most reliable components of my infrastructure, and still is. I would recommend chatting to people on the elasticsearch irc, or mailinglist, everyone was a great help to me there.

riceo100 11 years ago | | |

Same here. A single node failure has lead to the whole cluster crashing down around me on more than one occasion.

AnkhMorporkian 11 years ago | | |

Really? Perhaps I was never running it at a large enough scale, but even pre-v1.0 I've basically never had any troubles with it (outside of operation concerns like occasionally confusing query syntax.) Then again, I never had more than 11 servers in the cluster so again I may just have never run into problems at scale.

flippyhead 11 years ago | | |

While I don't necessarily disagree, I do find that this depends entirely on how ES is used. All too often people dive headfirst into using elastic search in ways it really should not be used.

lobster_johnson 11 years ago | | |

It can't be worse than RabbitMQ... can it?

thejosh 11 years ago | |

I use ES only for search (indexes from a DB), so losing data isn't a massive drama, it's great for my usecase.

rdtsc 11 years ago | | |

That sounds like the indended use. I should qualify my comment, I heard it advocated for a primary data storage.

digitalzombie 11 years ago | |

Elasticsearch is just a text search engine base on lucene. You either use ES, Solr, or Lucene library if you want fuzzy search and such.

You really want to use it in tandem with a storage db PostgreSQL, Cassandra, MongDB. Where ES or any lucene based indexer/db would be use for text searching.

I personally like PostgreSQL and Cassandra, would use it in tadem with ES. Solr, last I check was a bit complicated to cluster.

threeseed 11 years ago | | |

Agreed. Cassandra is especially nice if you have the DataStax Enterprise version which allows for seamless integration between the two.

m-i-l 11 years ago | | |

> Solr, last I check was a bit complicated to cluster

SolrCloud, with Zookeeper, is relatively new and not too difficult to set up.

PhilipA 11 years ago | | |

What about storing data for analytics? Wouldn't it be better to use ES than Postgres for that?

tedchs 11 years ago |

The advice I've heard from serious people using Elasticsearch for serious things indicate that you should definitely not use Elasticsearch as a primary data store (i.e. it should be treated as a cache).

lobster_johnson 11 years ago | |

This is true. On the other hand, even a secondary data store that's considered "lossy" poses a challenge — how do you know if its integrity has been compromised?

In other words, if you're firehosing your primary data store into ElasticSearch, you'll want to know whether it's got all the data you pushed to it at any given time.

I suppose you could use some kind of heuristic to detect this, like posting a "checksum" document occasionally that contains the indexing state and thus acts as a canary that lets you detect loss. On the other hand, this document would be sharded, so you'd want one such document per shard. Is this a solved problem?

fizx 11 years ago | | |

A logical "SELECT COUNT(*) WHERE updated_at < now()" is probably reasonably fast on your primary store and ElasticSearch.

po 11 years ago | |

It is often advocated as a datastore for logging data... which means (in that case) it's usually the primary datastore but perhaps not mission-critical.

alrs 11 years ago | | |

It's a great index for log data.

Spew your log data into a standard syslog server, while also pumping it into Logstash.

Using Elasticsearch as your canonical log storage would be ridiculous.

rodgerd 11 years ago | | |

Once you start relying on it to understand the state of whatever it's logging, it's mission-critical.

tomjen3 11 years ago | |

It would probably be good enough as a store for A/B testing information - losing data here isn't critical but writing speed is.

eclark 11 years ago |

Crashes of a program will not affect the data being written to disk if said data has been written into the FS cache (not using std::ostream::write or other in user space buffering). Dirty pages will eventually be written to disk even if the process dies un-cleanly. Only something that keeps the kernel from flushing to disk can keep the page from being eventually written out. ( driver bug, kernel bug, hardware failure, power failure ).

From reading the code in Jepsen it looks like kill -9 is all that's being used to start failures. So there's a real bug here: https://github.com/aphyr/jepsen/blob/master/elasticsearch/sr...

rdtsc 11 years ago | |

I think Kyle was just going by the documentation. And that is often what he tests -- how does the reality compare to the claims in the documentation and marketing.

So given these claims:

> Per-Operation Persistence. Elasticsearch puts your data safety first. Document changes are recorded in transaction logs on multiple nodes in the cluster to minimize the chance of any data loss.

One would hope they at least flushed the user space buffers.

klapinat0r 11 years ago |