Analytics 101: Choosing the right database(reflect.io) |
Analytics 101: Choosing the right database(reflect.io) |
I was excited when I saw 'choosing the right data model' as one of the rules, but they are talking about the data models the DB uses internally. The important data model is choosing how you model the data you have to analyze. I have my biases, but I'd argue that a dimensional model would be a good starting if we're really at a 101 level and extensions to the model are for future classes/development.
Starting with the end in mind is very important. When I look at this, I think in terms of data culture. Who needs to be able to do what in your organization? What types of question will you need to answer most often? What types of questions will you need to support in an ad-hoc manner?
To many organizations "analytics" means arithmetic, but with complex filtering logic and business logic, traditional BI, essentially. To others, "analytics" means R code monkeys. To others it may mean specifically visualizations and the presentation layer. There are many interpretations of the word. Regardless of the interpretation, process and culture are more important to understand before the technology.
For a rough analogy, it's like saying "Software development 101: Choosing the right programming language". Sure it matters, but knowing what your software needs to support and what the primary use cases are are more important to understand.
Ninja edit: Grammar.
What this has taught me over time is that the best systems, whether for data modeling and data storage, or for software, are systems that go to great lengths to ensure it is extremely cheap and easy to reconfigure, redesign, scrap everything and try something new, and adapt your designs to the real life pain points you didn't anticipate.
I've been very frustrated in the last several years because you see so many people trying to shoehorn this sort of idea into so-called "Agile" methods, but those methods put the emphasis in the wrong place. They depict agility as a property of the team of humans beings, and the particular tasks and schedules of the humans involved. They don't do anything to improve the agility of the underlying software systems, and you can have the most Agile team ever but still get burnt by committing to a bad and hard-to-change design, even if you're generating gobs of story points or other nonsense.
One of the most important and powerful planning tools is to prototype your architecture. Begin doing actual work with it. "Beta test" it with a limited set of actual business users. Collect data, like you're profiling, and make these decisions with evidence about your actual use case, not trite analogies or generalizations of it.
And place a premium on tools that demonstrably make it easy to scrap an underperforming design and replace it with a different design.
It used to be the standard in data warehouses but now the trend is to leave the data unstructured and use query tools e.g Drill or do multiple ETL into structured versions. But even the structured versions would not be relationally modelled to any great extent. Data scientists typically want access to data right now not in a few months when your database guy has finishing modelling it.
It remains the most understandable model to the largest population of end users. Analytics is a very broad term, as I mentioned in my original post, and the audience is huge. If "analytics" to you implies an audience of primarily data savvy end users, then dimensional modeling may not hold as much value.
I tend to find that the data scientists at our clients still do a lot more data wrangling than data science when they don't have clean models to work with as a baseline.
Additionally, there's a big difference between exploratory analysis of new data sources where access and low latency are key, and well-known domains that have fairly predictable needs. The former have a habit of transforming into the latter. Dimensional models remain one of the most efficient physical structures of data for a read-dominant workload.
Long story short, I think it's worthwhile to both of us to look outside our bubbles. I work for a BI and data science consultancy. My focus specifically is in core BI workloads, and so I'm definitely overexposed to more traditional modeling techniques. I can guarantee you, though, that the cycle time for a usable pilot that includes a fully realized dimensional model is more on the order of a handful of weeks than months in a typical delivery.
What's your bubble?
(I'm not affiliated with them, just think they get proportionally little coverage given their interesting product)
You will also need to think about the schema, partitioning, compression and other parameters, and those are not trivial decisions.
But the query engines are far more important in terms of performance. Just spend any time with SparkSQL and then Hive and you'll know what I mean.
Github: https://github.com/citusdata/citus
HN: https://news.ycombinator.com/item?id=11353322 "Citus Unforks from PostgreSQL, Goes Open Source (citusdata.com)" ( 24th March, 2016 )
"What is Citus?
- Open-source PostgreSQL extension (not a fork)
- Scalable across multiple hosts through sharding and replication
- Distributed engine for query parallelization
- Highly available in the face of host failures "
"Citus provides users real-time responsiveness over large datasets, most commonly seen in rapidly growing event systems or with time series data . Common uses include powering real-time analytic dashboards, exploratory queries on events as they happen, session analytics, and large data set archival and reporting." https://www.citusdata.com/blog/17-ozgun-erdogan/403-citus-un...
Unlike several of the databases mentioned, it has a data model particularly appropriate for analytics: While only apparently schemaless, its schema is extensible (no need to pre-declare it), and by default every column is indexed. Which means that there's no extra work on the client to assert the existence of indexes for new fields.
More importantly, it does complex, nested, distributed aggregations (top-K, date histograms, etc.) out of the box, and is incredibly fast at it, owing to the columnar-store-like Lucene index model. You can do complicated aggregations across millions of values over several dimensions in milliseconds.
Elasticsearch has consistency issues, though, and even with 2.x and the recent translog support you should probably never use it as a primary data store.
Some of the other databases mentioned (Cassandra, Riak and so on) are useful mostly as primary datastores that get processed into something that can do aggregations. For example, Cassandra -> Elasticsearch is probably a great combo.
The only issue is that its write performance isn't great and there have historically been questions about how to get official support. It's definitely got a bright future ahead of it.
I can represent almost everything as an int. Like time, cpu usage, line number, etc. Even just a single byte is enough for most things like which server number or custom error was thrown.
I don't want to have to edit your CSS just so I can read the text.
Otherwise, I'd just assume the designers disrespect all of us with poor vision and take it for granted the writing is done with equal care. Rather than ranting about it, I'd rather not think about it.
With that said, for some reason, blacken doesn't automatically convert this site to black, meaning I have to manually convert the text to black. Even after the author fixed the color, it is still impossible for me to read. Please don't override user-assisted plugins. Being on the fort page of HN, I'm assuming this is a decent article, but I have no desire to read it now. thanks.
Try using #4c4c4c
You can use this tool to test. http://webaim.org/resources/contrastchecker/
A bad contrast ratio will always look very bad on older screen, making text hard to read. It is also hard to read by people with eyes issues.
"WCAG 2.0 level AA requires a contrast ratio of 4.5:1 for normal text and 3:1 for large text (14 point and bold or larger, or 18 point or larger). Level AAA requires a contrast ratio of 7:1 for normal text and 4.5:1 for large text."
Some simple "group by" selects that would take ~3 seconds with Postgres or Elasticsearch would take several minutes with RethinkDB, unless it died of RAM starvage first. It looks to me like RethinkDB is not optimized for sequential read access, nor is its caching algorithm tuned to such workloads.
I believe it also lacks many of the aggregations that you'll want to use, like multi-level bucketing on different dimensions.
This was ~1 year ago, though, so it may have massive improved since then, who knows.
We're constantly improving performance and a lot has happened within the past year. I think that at this point RethinkDB is as good a database for analytics as many of the other general-purpose databases when it comes to features and performance.
From what I can tell, there are still two main limitations that apply in some, but not all scenarios:
* Grouping big results without an associated aggregation requires the full result to fit into RAM. I believe this was the limitation that you ran into a year ago, which lead to RAM exhaustion. This limitation is still there ( https://github.com/rethinkdb/rethinkdb/issues/2719 in our issue tracker). However we're shipping a new command `fold` with the upcoming 2.3 release of RethinkDB, which can be used in the vast majority of cases to perform streaming grouped operations (in conjunction with a matching index). See https://github.com/rethinkdb/rethinkdb/issues/3736 for details.
* Scanning data sets that don't fit into memory on rotational disks is still inefficient. Most SQL databases deploy sophisticated optimizations to structure their disk layout in order to minimize the effects of high seek times. RethinkDB's disk layout it built with a stronger focus on SSDs. This limitation hence doesn't apply if the data is stored on SSDs.
Rotational disks are on their way out. For example, Samsung recently introduced a 15TB SSD[1], able to compete even with the largest rotational disks.
[1] https://news.samsung.com/global/samsung-now-introducing-worl...