Ask HN: Learning NoSQL, papers and books In your opinion, which papers and books are mandatory to really understand NoSQL subject? |
Ask HN: Learning NoSQL, papers and books In your opinion, which papers and books are mandatory to really understand NoSQL subject? |
After this it is upto you. The papers involve references to lot of distributed systems literature. If you are interested you can go through resources here [4]. If you want to go a more hands-on way, I would also recommend reading AWS DynamoDB best practices (you can read up Cassandra or CouchDB also) documentation [5] to see the practical consideration while using these systems. Then try to use it or any other NoSQL database in a side project and see whether they are good fit. The data modelling would involve thinking hard about use-cases and would also help you compare this to relational systems.
[1] https://static.googleusercontent.com/media/research.google.c... [2] http://www.aosabook.org/en/nosql.html [3] http://www.allthingsdistributed.com/files/amazon-dynamo-sosp... [4] https://github.com/aphyr/distsys-class [5] http://docs.aws.amazon.com/amazondynamodb/latest/developergu...
(There must be something appealing to developers using JSON's style syntax rather than a Structured Query Language.)
There should be a solid reason to pick noSQL in general, and when such appear, picking the right one amongst the available noSQL platform is another job.
This is ranting.
I am a Postgres proponent but saying that PostgresSQL/mySQL/SQLite is the better choice in the vast majority of cases the parent has come across is reckless. The words were well chosen making the rant not that obvious.
There aren't good or bad DBs. Every DB has its strengths and respective trade-offs. As much I like Postgres, there so many use cases to use also other DBs and also NoSQL ones. I am not feeding the troll and starting reasoning why NoSQL can be terrific or SQL can be a big struggle, I am on both sides, both SQL and NoSQL have their place.
It's sad that a thread which is about learning NoSQL gets hijacked by a unrelated top comment opposing NoSQL.
Sorry to latch on I’m very eager to learn. Our stacks of choice are Django and Flask respectively, if that helps
“Trains are usually a better choice. Most people don’t need planes”
A: Not trolling, but X is vastly usually better than noX.
IDK what tolling is.
And it's never about JSON, it's about latency and resilience, about being able to simply add and replace nodes, about just working in a modern distributed environment.
It will not only help you understand what's "SQL" and "NoSQL" data stores, it also covers the differences between each of them, what problems they are designed to solve, how they try to solve it, and if it'll help with your problems as well.
Students seem to find the Dynamo paper to be the single most enlightening resource. It does a great job of explaining Amazon's use case and how the solution fits the problem. I also reference the relevant Red Book chapter and some students value that context.
It's worth noting that students are very comfortable with relational DBMSs by this point, both in theory and in practice. It quickly becomes clear to them that NoSQL is better called "no transactions", as they know the costs and benefits of various isolation levels in a traditional RDBMS. If you don't yet have an undergraduate-level background in database systems I'd encourage you to seek that out either first or at least along the way to understanding NoSQL systems. My recommendations for how to do this as a self-learner are up on https://teachyourselfcs.com.
[0] https://en.wikipedia.org/wiki/Consensus_(computer_science)
[1] https://en.wikipedia.org/wiki/PACELC_theorem
[2] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...
I'm still learning how to determine when I should use NoSQL instead of SQL. My best advice is to carefully consider how to plan on querying your data. If you plan on making complex queries that link multiple relationships, NoSQL is not for you.
After I've optimized my query/indexes to get from 60s to like 4s running through usual stuff and trying to not do anything too stupid, how to get it to <200ms? Maybe better question how to structure data so you don't need the complex query?
Designing Data Intensive applications http://dataintensive.net/
It's slightly dated, but it still gives a strong overview of the different paradigms. The truth is what you want to learn probably differs greatly depending on the paradigm that fits your application. NoSQL databases can broadly be categorized into document-oriented, key-value store, columnar, and graph. This video will help you understand what (at least three) of those are. Then you can focus in on books/articles about the paradigm that makes the most sense for you.
Tutorial from Felix Gessert about NoSQL https://medium.baqend.com/nosql-databases-a-survey-and-decis...
and Slides https://www.slideshare.net/felixgessert/nosql-data-stores-in...
[1] See http://dataintensive.net
Their tips are here, and I think this applies to most/all NoSQL (someone correct me if I'm wrong.) https://firebase.google.com/docs/database/web/structure-data
The tl;dr is:
- Avoid complex queries. Structure data so that you can make simple queries that execute fast.
- Avoid nesting & flatten data as much as is reasonable.
NoSQL is easier to learn & use than SQL, there's lower barrier to entry, but the trade off is that it's less powerful than SQL, so you have to keep your data simple too.
Isn't this contradictory?
This is referring more to schema than data. In part what that means is to avoid nested indexes... subtle but different than avoiding any nesting at all. In other words, if you can treat the nested data as a blob, it's probably okay, but if it's being used for a query, it's adding complexity that can cause trouble.
Some of the reasons for that are Firebase-specific, it has to do with security rules and how security can get too complicated if you're not careful with nesting.
But I'd guess it still applies to other NoSQL data... nesting data as part of the schema is like making another table, and all the complexity that comes with it. Except it's a new table you can only get to by going through the first table.
A common problem with nesting is thinking you got the order right for your use case and later finding out you sometimes want to index by the inner data rather than the outer data. If you only have A/B (B nested in A) and you need to query for As, then you're fine. When you find out you need to query for Bs, you have a problem.
Firebase even recommends duplicating data, if necessary, to have two indexes A/B and B/A, rather than trying to query for nested data.
Then read this book for in-depth details - Designing Data-Intensive Applications : https://dataintensive.net/
and of course the orirginal papers from Amazon and Google.
If you have more questions - contact me at HN AT NoSql dot Com
It's just a small set of problems that really requires a nosql database.
Most (if not all) nosql databases are perceived as less complicated since they hand-wave away all complicated things to the users of the database, while focusing on being fast and simple to use and run in a cloud or cluster.
Anyone running a database system in a fault tolerant configuration immediately hits the CAP theorem, and SQL and nosql databases sacrifies or ignores different aspects of both CAP and ACID in order to scale.
As you write, you really have to know what you are sacrificing before doing that choice. Perceived complexity is probably not a good selector.
One problem is that SQL databases are normally installed in "pet-mode" where you have two or three servers that you really have to take care of. This feels less satisfactory when developing for the cloud, and typically also doesn't scale very well horizontally. Instead of running your own distributed database in the cloud (and fail) there are also PaaS databases, but SQL tends to be flavoured making it hard to change the infrastructure.
Maybe another problem is the model mismatch - relational databases are imposing restrictions on how data is represented, and how it's retrieved that makes no sense from a "rest-interface based" view as there's a mismatch between the relation-entity view (objects and lists) and relational algebra.
There are graph databases, and I personally think that they might be the future. Building strong models within a bounded context is still probably the best way to model complex data and processes that operate on that data.
Unfortunately the future isn't here yet and most graph databases are still slower than my laptop.
The best compromise is probably to use CQRS - Command Query Responsibility Segregation, meaning that queries and commands (modifications) are handled by separate stacks where read-only data might be distributed and updated ("cached") for use, but actual processing is made to a single consistent database running on a few "pet" servers.
This only makes sense for systems that mostly read things, and are updating it's data relatively seldom.
It looks like that might be specific to Firebase's implementation because this can be achieved with Mongodb.[2]
1. https://stackoverflow.com/questions/27207059/firebase-query-...
2.https://stackoverflow.com/questions/15654228/sort-by-embedde...
The bigger issue remains that schema nesting causes a type of complexity that SQL dbs avoid by always being flat. Even that answer you linked to, the very last sentence is: the most important one for people new to NoSQL/hierarchical databases seems to be "avoid building nests".
Schema nesting in mongodb is also best avoided, if you can, e.g.:
https://stackoverflow.com/questions/5108790/mongodb-best-pra...
And some NoSQL databases speak SQL as well - without being relational.
I like the JSON support in PostgreSQL a lot. Very easy to deal with unstructured JSON data while still using common attributes in a relational format. But there are more cases that one might think about - as a relational guy - that benefit from graph databases, document stores or optimized time-series databases.