My laptop is faster than your Elastic cluster(haybatov.com) |
My laptop is faster than your Elastic cluster(haybatov.com) |
I remember a case where we needed to pull all incoming IP addresses for a given 24 hour period for certain common user-facing queries, and the specialized FPGA driven scalable database appliance was returning those queries in minutes. So instead we ran the query once a day, and then just flipped bits in a big bit array (IPv4 is half a GB of bits) and just let each array act as a standing query. A query for a certain day just mapped to a specific file, and some simple math would answer the query 0 for IP never seen 1 for IP seen. It was nearly instant.
It transformed a certain team's workflow from asking a few times a day for yesterday's IP visits to many times a day and then they started asking for the data over a period of months instead to look for patterns. Queries against this structure for the same thing over a year took less than a second. It all ran on a 4GB VM with a few GB of spinning rust storage. Pretty soon this was built into a bunch of custom tools across a bunch of teams and was feeding the data tens of thousands of times per day, then we started enriching it with geodata, reputation data, and other things. It also reduced the query load on the main dB substantially.
Very simple engineering, transformative, and required simply trying to find a way to scale deeper instead of building a bigger multimillion dollar dB.
50% of your system memory should be Heap, so realistically the sweet spot for VMs is actually 64GiB of memory for each node.
I wrote a lot of these best practices as notes years ago: https://dijit.svbtle.com/elasticsearch-notes
Then I found projects like Meilisearch and TypeSense (which are in the same class of search software as Algolia) and they're so much faster than a standalone Elastic instance.
Again, different space from Elastic, but I'm not sure if most people truly need Elastic's scale.
Of course, if you need advanced search features, APIs for indexing and searching (Kibana) or some fault tolerance don't reinvent the wheel.
So, I wouldn't use this in any production (i.e. non single-instance) setup, but it is a useful reminder of how modern hardware and software is powerful enough to do things that would intuitively seem resource-intensive.
Scalability! But at what COST?
https://www.usenix.org/system/files/conference/hotos15/hotos...
We offer a new metric for big data platforms, COST,
or the Configuration that Outperforms a Single Thread.
The COST of a given platform for a given problem
is the hardware configuration required before the platform
outperforms a competent single-threaded implementation.