Elasticsearch 1.0.0 released(elasticsearch.org) |
Elasticsearch 1.0.0 released(elasticsearch.org) |
The rails support for it is amazing too. The guy creating the rails integration lib is really talented and active.
I can't recommend qbox.io enough! Point-and-click scaling of managed elasticsearch clusters + Kibana == bliss.
The one drawback ES had in the bad old days was that backup and restore was a nightmare... ESPECIALLY on AWS. The new system they introduced was so simple I was concerned about updating to it because I was SURE something would go south.
But it all just worked.
I still have the Couch to ES replication running because I'm anal like that... but really... yeah... you can do without Couchbase, Mongo et al... ES will probably do everything you need PLUS everything you can't do in the others.
also, i have the exact opposite nitpick. people want to use it to do everything, mail indexers, file system indexers. what's the matter with web developer folks? why is it that when the next database comes around they want to use it for everything?
I help run a large ES cluster (with canonical data in MySQL), and I consider this cautious attitude by the ES developers to be a good thing.
Elasticsearch is brilliant as a NoSQL - and if you were already using elasticsearch as a search system, you dont need to introduce yet another component into your stack.
Other than that (which is just performance tuning, really), ES matches mongodb feature for feature, and obviously has a lot of extra power from its search heritage such as facets and percolate.
So I can't actually think of any limitations, and it's why I said ES makes a better MongoDB than MongoDB.
You need to have a good understanding of how tokenizers and analyzers work to be able to create good results for your data. I have difficulties matching documents with the exact title being searched for. On MongoDB that just works, on ElasticSearch you need to configure it.
ElasticSearch has some advantages and MongoDB others. I think they are great together. One for storage and the other for searching.
http://docs.mongodb.org/manual/applications/geospatial-index...
You create a number of shards for each index(database) that you can't later expand.
It's search capabilities and scalability and fantastic - were throwing GB of data into it weekly and it just soaks it up.
That said, it's definitely worth looking into both, depending on what your needs are.
(IMHO) Unfortunately for most of the people, old habits to be made. Indeed a nice project and great release.
A few days before launch, things were not looking good. As admins manipulated articles in preparation for the launch, the servers kept crashing.
In a time-constrained major launch like this, a lot of nasty little hacks build up in the codebase. Our search system for admins was a complete mess. It was a custom solution that worked fine when admins managed a handful of database records, but now that they were managing thousands of articles, it was not scaling at all.
At the 11th hour, we dropped elasticsearch into our infrastructure. It worked like a charm. The servers stopped crapping out, and we launched on time.
Elasticsearch mostly "just works", and we didn't have to worry about complex schema definitions, working with giant complex XML files (hello Solr), or build anything on top to interface between the index and the queries themselves (Lucene). Thanks elasticsearch, you saved us!
1) Dremel clones [2] like Impala & Presto (for near real-time, ad hoc analytic queries over large datasets)
2) Lambda Architecture [3] systems (where queries are known up- front, but need to run against a large dataset)
Does anyone here have experience ES in such usecases, beyond the free text searching one ES is well-known for?
[1]: https://groups.google.com/forum/#!topic/elasticsearch/iTy9IY...
[2]: http://static.googleusercontent.com/media/research.google.co...
[3]: http://jameskinley.tumblr.com/post/37398560534/the-lambda-ar...
Pick your favourite users group here: http://elasticsearch.meetup.com/
Full disclosure: I started and run the Berlin UG. We set ourselves apart by always providing a small introduction into ES for those that are completely new and would have a hard time following the main talk.
I don't see many tutorials covering usage of ES here: http://www.elasticsearch.org/tutorials/
Could you maybe provide a link to yours?
Yep, tutorials is a huge problem, but there are people working on that.
Otherwise, we love ES. The other comment about it being a better Mongo than Mongo rings true. With the backup/restore API and the some of the circuit breakers, I'm hopeful that my fears will be abated.
Another example are disk-based doc values [2], which are essentially pre-computed field data structures that are stored on disk. This moves Field Data off heap and allows the OS to manage memory evictions, to help minimize GCs and OOM blowouts.
[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc...
[2] http://www.elasticsearch.org/blog/disk-based-field-data-a-k-...
Having supported Solr/ES/Lucene in production for 4+ years now (websolr.com / bonsai.io) I would be pretty hesitant to trust Lucene in general as a primary data store. Beautiful for secondary indexing, but otherwise, Why Not Postgres?™ ;)
EDIT: I don't think MongoDB is there yet either. There are definite benefits and drawbacks between Postgres and ES, tipping heavily towards Postgres for structured heavy write data. But for ES and MongoDB? I think MongoDB falls a bit short there.
“Geo queries used to use miles as the default unit. And we
all know what happened at NASA because of that decision. The
new default unit is meters.”
I like this release already.It's these little details I love, when a project actually cares about operations and not just "well here's the API"
I've been using ElasticSearch only for Logstash, but i've been blown away so far as how easy it is to deal with.
Congrats to the ElasticSearch team, and all the supporters around it. Once I get back into more of a coding role, I'll definitely be contributing back to the ES project.
There are other reasons, but that is like 90% of GC issues. To solve it, you need to make sure your faceted fields are configured well (usually not_analyzed) and assess how much memory is available. You may be able to index and even full-text search ten billion docs on a single machine, but faceting it may just be too much to ask for a single node.
Omiting norms, disabling bloom filters on old indices and enabling doc values are other ways to help alleviate field-data pressure.
Other GC culprits can be: too large bulk requests, unbounded threadpool queues, or something like parent/child/scripts/filter cache keys eating all your memory. Also don't go above 30gb heaps, the JVM becomes unhappy :)
What I'm doing is slamming the full text output of OCRed PDFs into a MyISAM table, the entire document in a text field.
What I'm afraid I'm not doing right is creating the web interface to search elasticsearch. What I'm using filters with the query string syntax[1] in the search box, pointing directly at that fulltext column. I'm also using the highlight functionality so that I can specify how many highlight blurbs to return with the result. The query string syntax works great with the OCR'd text, because most of it is near-garbage (as most ocr is) so you can search for something like "net sales"~50 to find those two terms within 50 words of each other. I think the results were something like: net sales 15,000 results "net sales" 120 results "net sales"~50 550 results
Can anyone point me at a good web based search implementation using elasticsearch that explains how they're doing it?
What I have works pretty good, I just want to... check my work, I guess.
[1]: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...
The main thing for good stability and performance is to be very good at batching your updates. You don't want to sling a ton of highly-parallel single-document updates at Lucene, lest you thrash the JVM and start garbage collecting like crazy.
From there, on the query side, you'll want to get a good working knowledge of the different tokenization and analysis options. There are a lot of subtle and interesting combinations to be had in there that influence performance and relevance of your search results.
ES adds distribution (multimaster-replicated cluster of nodes connected via a gossip protocol), sharding, defines a document model and schema (the mapping of arbitrary JSON documents to index structures), faceting, aggregation (ie., roll-up-type calculations), various types of scoring (eg., geographic distance), ETL ("rivers"), backup/restore, performance metrics, a plugin system (eg., for indexing different file formats) and a bunch of other things -- and of course a REST-based API on top of the whole thing.
The github lays it out well.
What makes it so special to have hundreds of votes and tweets all around within 2 hours?
I don't understand. A DB engine engineer.
1. It handles human written language. Any language. The same technology that let's it handle strings written in human language provides a lot of flexibility in handling string in other applications. Particular when handling logs.
2. Non-string data it also handles very fast and cleanly (numbers, dates, geo).
3. Lucene has an inverted index that has been optimized over many years. ES scales that pretty seamlessly across many servers. All decisions in the project seem to be made around whether a feature can scale to 100s of nodes.
The devs have also been really smart to focus on the "out of box experience". Very well thought out defaults.
More on our experience with ES at scale: http://gibrown.wordpress.com/2014/01/09/scaling-elasticsearc...
https://lucene.apache.org/core/
"index size roughly 20-30% the size of text indexed"
That seems excessive for an index.
and many other stuff
"Sales data search: Writing a query parser / AST using pyparsing + elasticsearch"
Part 1: http://blog.close.io/sales-data-search-writing-a-query-parse...
Part 2: http://blog.close.io/sales-data-search-writing-a-query-parse...
Lucene is one of those projects which hardly has any real competition. That's surprising given how many real world software projects have a search requirement. While Lucene is excellent, it's not without flaws and competition is always great.
Solr, ElasticSearch, etc. are mostly concerned about the index/search features, and they do quite a good job there. But this still leaves a huge amount of space for commercial offerings, as core search is only a part of the problem. I'm thinking about connectivity with complex enterprise systems, support for the specific security models of those systems, integration in other systems, etc. Believe me, those problems are not easy to solve.
So, even if we have an index that can most probably match Lucene's feature for feature and quite a lot of things beside, we typically won't go after deals where simple search is the only requirement. Instead we focus on larger deals with more complex requirements. And we're doing quite well, thank you :)
https://groups.google.com/d/msg/elasticsearch/Rb7Lei4gaaE/7I...
Disclaimer: I’m the founder of a hosted Search As A Service and we use ES in a few critical parts of our infrastructure.
At my last place of work, ES was beautiful and required little work to get a very fast, workable search in place.
For our current project we went with ElasticSearch and we're quite happy. One of the contributing factors was that one of our most experienced guys was unable to get the damn thing installed, even with the help of one Endeca consultant.
If you were using Solr there are a few operational modes to run in. Config file based or SolrCloud[0]. The latter is more akin the ES in terms of cluster management.
I agree though from an simplicity of deployment perspective at scale ES is has a much lighter learning curve.
[0] https://cwiki.apache.org/confluence/display/solr/SolrCloud
`java -jar elasticsearch.jar` does a better job and that's basically all it takes. I'm planning to switch as soon as https://github.com/elasticsearch/elasticsearch/issues/256 lands.
Do you mean that you use ES to do indexing on the backend of your documents and make it available on the web? Or do you mean that you use ES to index documents available on the web and let people to search for them?
I fetch a steady stream of FOIA documents, close to the maximum possible each week, and PDF/OCR them. I expose a web interface to the analysts I work with, to help them gather up documents for further analysis.
The second guess would probably be more interesting to most people.
Also, Lucene at its core is an Index. Changing the query strategy might require reindexing. It is perfectly valid to throw data at it, build the index and throw away the source. You will just never get it back again.
While ES can be used and tuned as a store just fine, it is not necessarily its raison d'etre.
Personally, I think of disk space as cheap, and am far more concerned with having options to improve speed and quality of search results.
- Pause indexing
- Issue a flush request
- Rsync data directories somewhere
- Resume indexing
This is technically a very naive approach, since a simple rsync of the data dirs will include replicas too. If you were more diligent you could check the state files in each shard directory and only copy out the primaries.
You can just google "elasticsearch rsync" to get information, and even scripts, that will do this for you. The thing is... you REALLY need to know what you're doing when you go this route.
Also, you can try the gateway feature. Gateway is actually pretty straightforward. Restore WILL be slow though. And for many scenarios ... it is not ideal. (You don't want to take a day, or even a few, to restore after a failure.)
I think the best advice is...
Update to 1.0.
Just go to 1.0 and do snapshots... you will save yourself A LOT of headaches.
Side note: Happy Found customer here...you guys have made it much easier to run our ES index!
The point of that section is exactly that "NoSQL" (or to make things even more confusing "NOSQL" (Not only) doesn't have a very specific meaning. Some think it rules out ACID, other's don't. Thus, you'll need to know what you need.
And database marketing tend to not be very good at pointing out what they're not good at, or actually deliver what they promise. See also: http://aphyr.com/tags/jepsen
NoSQL was in large part about precisely what the name implies - giving up relational (SQL) data in exchange for better performance and the ability to have a distibuted store. Yes, part of this is also about being willing to trade off consistency for availability. But Elasticsearch is an example of a NoSQL store which does focus on consistency (in this case at the expense of availability and, to some extent, partition tolerance).
Because they like a simple web stack. KISS means a faster time to market. Faster time to iterate. Faster time to fix bugs because there are fewer places those bugs can be. All of that doesn't even factor in the productivity benefits gained by not having to switch technologies from project to project.
But to be fair, ES is not some brand new database... ES has been around for a LONG time.
For example, Elasticsearch has poor availability characteristics - both because it is master-slave and because it focuses on ensuring consistency - relative to, for example, something like Riak.
There's no holy grail of data storage... ElasticSearch is really nice, and if it fits your needs, more power to you.
I have my doubts mongodb would scale up that well to 20+ servers without some maintenance as well. So I'm not sure how that's really a limitation anyone should use for choosing mongodb or ES. If you're expecting that kind of data, just make a large number of shards in your index creation as it will work fine on fewer servers too?
larger number of shards=slower searching (unless you distribute the shards to multiple nodes)
The benefit of this is the as your app scales, you'll search only the shards needed. So if you have just 1 shard w/ data, u can tell ElasticSearch to just search in that 1 shard.
It's also possible to download and install ES locally and run any number of front-end interfaces, some of which include query builders. ElasticHQ seems like a decent option for that. The venerable Elasticsearch-head is another.
I think now that ES 1.0 has shipped, more experimental tools will start to emerge that help people learn and interact with ES itself. (If anyone out there is a front-end whiz and wants to help me build something like that, please email nz@bonsai.io!)
1. http://www.elasticsearch.org/guide/en/elasticsearch/referenc...
Internally it is still reindexing the entire document, but from your application's perspective, the Update API is a lot friendlier.
http://www.elasticsearch.org/guide/en/elasticsearch/referenc...
This is really important. Creating a proper searching experience with auto-complete which works "just like you want" can be a very painful experience with ES, specially if you are new to ES. It bite me some time ago when I was trying to achieve just that.
For example, Postgres lets you reason about integrity, atomicity and transactional boundaries, and whether things are really safely stored with synchronous replication. If Postgres returns after a commit, I trust it. However, that requires me to have two servers working, which is harder to keep highly available.
ZooKeeper, on the other hand, I can rely on being available. But that's not really something you want to be putting lots of load on, nor try to do anything but trivial "queries". And the more servers you add, the slower writes get.
I don't trust Elasticsearch enough for those tasks, yet I wouldn't want to do searches in Postgres (Yep, I'm familiar with tsearch) even though it can. Elasticsearch is simple to scale out and awesome for searching.
Logs and metrics we shove straight into Elasticsearch, however. Other things go from ZooKeeper to Postgres and then to Elasticsearch, or from just Postgres to Elasticsearch.
Separate tools for separate jobs. I'm one of the co-founders of www.found.no, one of the hosted Elasticsearch providers . We absolutely love Elasticsearch and find new use cases for it all the time, but it's not going to be the one store to rule them all, at least not very soon.
Indeed!
That said, it's great that more people are picking up Elasticsearch for new exciting things.
Elasticsearch has really pushed what constitutes a "search problem", and deserves lots of kudos for that! :)
But I personally suspect Lucene won't ever get away from the dreaded "just reindex." And to the larger point, I think recent resurgent interest in data stores and distributed systems have shown pretty clearly that there is no holy grail. No single data store can provide all the semantics necessary for all use cases. Maybe not even for most use cases. There are just too many tradeoffs to consider.
Believe me, I earn a living hosting Elasticsearch, so I'd love to see it become a robust primary data store. There are some use cases where it actually does make sense—just look at the amazing traction ES is experiencing for storing and indexing time-series data.
But as a general-purpose primary store, I'm not really holding my breath. Maybe I'm just becoming battle-worn and bitter. I would love to be proven otherwise over the next few years!
ES is suitable for full-text based document indexing for enterprise level or any websites, which means they have a reasonable amount of data to be indexed in a given timeframe. A complete re-indexing won't not take for a couple of days.
So the basic idea behind the NoSQL database is to dump the data into the database quickly and return, so you can see very fast response for insert and delete. Then it will load the data into the memory to process for real-time retrieval which also produces fast response from select. I'm not sure about update.
If the data volume grows, they quickly add shards or make the number of pre-shards big enough to allocate enough memory resources to handle the queries or let the OS to swap the memories by adding more server nodes.
So if you want to use NoSQL database, you must be bound with the system requirement and make your application fit into that and take the most advantage from it. Otherwise, if you are running high structured data store, better to use relational database.
Another point is: if the documents are collected from the web like search engine, NoSQL will not fit for the large volume of data and relational database is also used to store the indexed data for fast retrieval. I guess this is what you meant "general-purpose primary store".
Correct me if I'm wrong.
Geohash Grid: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...
Geodistance: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...
http://www.elasticsearch.org/guide/en/elasticsearch/referenc...
I'm more interested in the second case, but I don't think ES fits due to the huge volume of data to be indexed.
[1]: http://www.christopherbiscardi.com/2014/02/07/geospatial-ind...
Edit: Apparently my Riak knowledge is dated now anyway. It looks like I have some research to do myself, but it's pretty exciting stuff.
http://www.slideshare.net/rklophaus/riak-search-erlang-facto...
> Format in [lon, lat], note, the order of lon/lat here in order to conform with GeoJSON.
.. the data example below is not actually geojson. See the spec:
Elasticsearch can map any kind of JSON, so you can, without problems, write a mapping for proper GeoJSON points. (map "type" as unanalyzed string, map "coordinate" as GeoPoint). Arrays of values are generally supported in ES.
The biggest problem is that Elasticsearch probably does not provide all kinds of queries you'd like if you are working with complex shapes. Basically, only distance and simple location queries with polygons are supported.