Analytics 101: Choosing the right database

Analytics 101: Choosing the right database(reflect.io)

96 points by turoczy 10 years ago | 29 comments

greggyb 10 years ago |

Analytics 101: Choosing the right database is the wrong first step.

I was excited when I saw 'choosing the right data model' as one of the rules, but they are talking about the data models the DB uses internally. The important data model is choosing how you model the data you have to analyze. I have my biases, but I'd argue that a dimensional model would be a good starting if we're really at a 101 level and extensions to the model are for future classes/development.

Starting with the end in mind is very important. When I look at this, I think in terms of data culture. Who needs to be able to do what in your organization? What types of question will you need to answer most often? What types of questions will you need to support in an ad-hoc manner?

To many organizations "analytics" means arithmetic, but with complex filtering logic and business logic, traditional BI, essentially. To others, "analytics" means R code monkeys. To others it may mean specifically visualizations and the presentation layer. There are many interpretations of the word. Regardless of the interpretation, process and culture are more important to understand before the technology.

For a rough analogy, it's like saying "Software development 101: Choosing the right programming language". Sure it matters, but knowing what your software needs to support and what the primary use cases are are more important to understand.

Ninja edit: Grammar.

p4wnc6 10 years ago | |

I agree with all of this and I think it also carries over to software design too: People tend to think about which design pattern, from some preconceived menu of design patterns, they will need, instead of applying common sense to how the business workflow will look with the desired outputs, and then working backwards from there, usually with some Occam's Razor sprinkled in to make sure you don't overdo it with design mumbo jumbo.

What this has taught me over time is that the best systems, whether for data modeling and data storage, or for software, are systems that go to great lengths to ensure it is extremely cheap and easy to reconfigure, redesign, scrap everything and try something new, and adapt your designs to the real life pain points you didn't anticipate.

I've been very frustrated in the last several years because you see so many people trying to shoehorn this sort of idea into so-called "Agile" methods, but those methods put the emphasis in the wrong place. They depict agility as a property of the team of humans beings, and the particular tasks and schedules of the humans involved. They don't do anything to improve the agility of the underlying software systems, and you can have the most Agile team ever but still get burnt by committing to a bad and hard-to-change design, even if you're generating gobs of story points or other nonsense.

One of the most important and powerful planning tools is to prototype your architecture. Begin doing actual work with it. "Beta test" it with a limited set of actual business users. Collect data, like you're profiling, and make these decisions with evidence about your actual use case, not trite analogies or generalizations of it.

And place a premium on tools that demonstrably make it easy to scrap an underperforming design and replace it with a different design.

threeseed 10 years ago | |

Modelling your data dimensionally is becoming less and less popular in the analytics space.

It used to be the standard in data warehouses but now the trend is to leave the data unstructured and use query tools e.g Drill or do multiple ETL into structured versions. But even the structured versions would not be relationally modelled to any great extent. Data scientists typically want access to data right now not in a few months when your database guy has finishing modelling it.

greggyb 10 years ago | | |

Dimensional modeling, though often associated with a traditional waterfall methodology, is not tied to this delivery methodology.

It remains the most understandable model to the largest population of end users. Analytics is a very broad term, as I mentioned in my original post, and the audience is huge. If "analytics" to you implies an audience of primarily data savvy end users, then dimensional modeling may not hold as much value.

I tend to find that the data scientists at our clients still do a lot more data wrangling than data science when they don't have clean models to work with as a baseline.

Additionally, there's a big difference between exploratory analysis of new data sources where access and low latency are key, and well-known domains that have fairly predictable needs. The former have a habit of transforming into the latter. Dimensional models remain one of the most efficient physical structures of data for a read-dominant workload.

Long story short, I think it's worthwhile to both of us to look outside our bubbles. I work for a BI and data science consultancy. My focus specifically is in core BI workloads, and so I'm definitely overexposed to more traditional modeling techniques. I can guarantee you, though, that the cycle time for a usable pilot that includes a fully realized dimensional model is more on the order of a handful of weeks than months in a typical delivery.

What's your bubble?

paxcoder 10 years ago | |

Any recommendations? Maybe start a flowchart for collaborative editing? Please link me.

sandstrom 10 years ago |

Surprised that RethinkDB[1] isn't mentioned. It has support for replication and sharding, plus a query language well suited for analytics.

(I'm not affiliated with them, just think they get proportionally little coverage given their interesting product)

[1] http://rethinkdb.com/

gtrubetskoy 10 years ago |

Actually, you might want to not choose any database at all, but instead focus on deciding on the data format, such as Parquet (http://parquet.io) or Avro (https://avro.apache.org/), etc. Many of the tools such as Hive, Impala, Spark, etc. support these formats natively.

You will also need to think about the schema, partitioning, compression and other parameters, and those are not trivial decisions.

threeseed 10 years ago | |

The data format is important. ORC/Parquet being substantially faster then Text or Sequence files.

But the query engines are far more important in terms of performance. Just spend any time with SparkSQL and then Hive and you'll know what I mean.

pookeh 10 years ago |

Surprised http://druid.io/ wasn't mentioned. This db was made specifically for both real time analytics and batch analytics. It even has a nice front end http://imply.io/

pella 10 years ago |

NEW: Scalable & Open Source PostgreSQL extension https://www.citusdata.com ( based on PG9.4 / PG9.5 )

Github: https://github.com/citusdata/citus

HN: https://news.ycombinator.com/item?id=11353322 "Citus Unforks from PostgreSQL, Goes Open Source (citusdata.com)" ( 24th March, 2016 )

"What is Citus?

- Open-source PostgreSQL extension (not a fork)

- Scalable across multiple hosts through sharding and replication

- Distributed engine for query parallelization

- Highly available in the face of host failures "

"Citus provides users real-time responsiveness over large datasets, most commonly seen in rapidly growing event systems or with time series data . Common uses include powering real-time analytic dashboards, exploratory queries on events as they happen, session analytics, and large data set archival and reporting." https://www.citusdata.com/blog/17-ozgun-erdogan/403-citus-un...

hbcondo714 10 years ago |

This is by no means a comprehensive list of databases and that's not the intent of this article. The real intent is that it's a simple read for many companies still running solely on 'general purpose databases' and showing where newer database technologies can fit in based on their data needs. Upvote.

cdeshpande 10 years ago |

What about ElasticSearch. Even though its search engine, its growing in popularity as schemaless JSON data store

lobster_johnson 10 years ago | |

Surprised Elasticsearch isn't mentioned in the article.

Unlike several of the databases mentioned, it has a data model particularly appropriate for analytics: While only apparently schemaless, its schema is extensible (no need to pre-declare it), and by default every column is indexed. Which means that there's no extra work on the client to assert the existence of indexes for new fields.

More importantly, it does complex, nested, distributed aggregations (top-K, date histograms, etc.) out of the box, and is incredibly fast at it, owing to the columnar-store-like Lucene index model. You can do complicated aggregations across millions of values over several dimensions in milliseconds.

Elasticsearch has consistency issues, though, and even with 2.x and the recent translog support you should probably never use it as a primary data store.

Some of the other databases mentioned (Cassandra, Riak and so on) are useful mostly as primary datastores that get processed into something that can do aggregations. For example, Cassandra -> Elasticsearch is probably a great combo.

threeseed 10 years ago | |

ElasticSearch is very nice as a JSON data store and has great integration with Spark/Hadoop.

The only issue is that its write performance isn't great and there have historically been questions about how to get official support. It's definitely got a bright future ahead of it.

Xeoncross 10 years ago |

I've always wanted a super-compact database for storing integers on smaller setups where I don't have the resources to run a dedicated logging server.

I can represent almost everything as an int. Like time, cpu usage, line number, etc. Even just a single byte is enough for most things like which server number or custom error was thrown.

fweespee_ch 10 years ago |

Please, please don't set your text color to #888 or #999 on a white background.

I don't want to have to edit your CSS just so I can read the text.

dizzystar 10 years ago | |

I use the blacken plugin on my browser, which automatically converts all text to black. I know this isn't your point, but it is much better than altering CSS to fix the web.

Otherwise, I'd just assume the designers disrespect all of us with poor vision and take it for granted the writing is done with equal care. Rather than ranting about it, I'd rather not think about it.

With that said, for some reason, blacken doesn't automatically convert this site to black, meaning I have to manually convert the text to black. Even after the author fixed the color, it is still impossible for me to read. Please don't override user-assisted plugins. Being on the fort page of HN, I'm assuming this is a decent article, but I have no desire to read it now. thanks.

bilmeswe 10 years ago | | |

We're going to darken the text on our next push. Thanks for pointing it out.

fweespee_ch 10 years ago | | |

The problem with blacken is on certain color schemes you don't want black text.

bilmeswe 10 years ago | |

We'll make an adjustment. Thanks for the feedback.

Raphmedia 10 years ago | | |

FYI, your Contrast Ratio is 2.8:1. Aim for 4.5:1

Try using #4c4c4c

You can use this tool to test. http://webaim.org/resources/contrastchecker/

A bad contrast ratio will always look very bad on older screen, making text hard to read. It is also hard to read by people with eyes issues.

"WCAG 2.0 level AA requires a contrast ratio of 4.5:1 for normal text and 3:1 for large text (14 point and bold or larger, or 18 point or larger). Level AAA requires a contrast ratio of 7:1 for normal text and 4.5:1 for large text."

paoloiam 10 years ago |

Turoczy, thank you for sharing! Great insights.