Your data fits in RAM(yourdatafitsinram.com) |
Your data fits in RAM(yourdatafitsinram.com) |
It's not as easy as just buying more RAM. You'll have to pay more attention to how you make use of the various caches in between your CPU and RAM.
If you're expecting growth in the size of your dataset (beyond growth in RAM size availability), then, well, maybe don't just use a single machine. Same goes for a whole bunch of similar "it's too large for a single machine" considerations.
Storing data should probably still be persisted to disk, and backed up.
Generally though, these posts are geared towards machine learning people that don't really have "live" data as frequently.
Well, well, well.
If you don't have money, you can't. Very few people can afford it.
You can program R in Spark you can now program in R http://blog.revolutionanalytics.com/2015/01/a-first-look-at-...
Now you can work directly with SQL Server as announced this week by MS. http://www.computerworld.com/article/2923214/big-data/sql-se...
I have had a ton of arguments about R's "biggest weakness" being that it uses RAM. I haven't once in the almost 3 years of working in R that I ran into this road block, but I am sure others have. Which there are several good distributed choices that will keep getting better and better.
Using RAM instead of Distributed is better in R as well as really any other language in terms of complexity and flexibility.
My favourite realization of this: Frank McSherry shows how simplicity and a few optimisations can win out on graph analysis in his COST work. In his first post[1], he shows how to beat a cluster of machines with a laptop. In his second post[2], he applies even more optimizations, both space and speed, to process the largest publicly available graph dataset - terabyte sized with over a hundred billion edges - all on his laptop's SSD.
[1]: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
[2]: http://www.frankmcsherry.org/graph/scalability/cost/2015/02/...
There's no a single simple answer, but sure, whenever less computers are enough, less should be used.
The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.
PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.
[1]: Your mileage may vary.
To guy below me: Ah, thanks. I thought the guy above was trying to say it's slower than paging to disk. : )
Most businesses doing big data (like ours) often have multiple disparate data sources that at the start of the pipelines are ETLing into some EDW. Trying to consolidate them into a single integrated view is very difficult and time/resource intensive. Having billions of disconnected nodes in the graph would be very hard to reason with.
And that's a single, standalone, non-RAIDed SSD. When you get a 6-SSD RAID10, magic starts to happen. And if you RAID enough SSDs (10-20?), you can theoretically start to get more bandwidth than you do with RAM.
Is there some off-the-shelf solution to this problem? And, if so, why isn't it talked about more? Every CMS ever, for example, would be very well-served by something like this. My entire website's database, all ~100k comments and pages and issues and all 60k users, is only 1.4GB, and performance is always a problem. I don't care if I lose a couple minutes worth of comments in the event of a system reboot or crash. So, why can't I just turn that feature (in-memory with eventual on-disk consistency, or whatever you'd want to call it) on and forget about it?
The recent addition of SparkR in 1.4 means that now data scientists can leverage in memory data in the cluster that has been put there by output from either Scala or DW developers.
Combine it with Tachyon (http://tachyon-project.org) and it's not hard to imagine petabytes of data all processed in memory.
I haven't used either Spark or Tachyon. I thought the Spark solution was to just put my dataset in memory. But the Tachyon page seems to say the same thing
If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.
So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.
When a problem becomes big enough, moving to a cluster is absolutely the right decision. Meanwhile, RAM is cheap and follows Moore's Law.
Edit: And after those first three months he'll know more about the use and performance demands of the project and will be able to make far more accurate decisions about storage categories.
FB has a nice paper that talks about this problem. https://research.facebook.com/publications/300734513398948/x...
Maybe some bonus category:
0. Spreadsheet is all you need.
1. Python script is good enough.
2. Java/Scala is way to go.
3. Need to manage memory (gc doesn't cut), some custom organization.
4. Actually needs a cluster.
I get who this is aimed at, and why, but just saying that it fits in RAM isn't as useful as it could be. This is an opportunity to teach, not just snark.
I work on something like that.
[1] http://ark.intel.com/products/84688/Intel-Xeon-Processor-E7-...
The SGI one with up to 2048 cores are larger in their single system images than most people have in their clusters.
The benefit of these systems is not really the ease of programming but the speed of interconnect.
List price of the Oracle one was 3 million a few years ago. But most of that is actually in the high density dimms. These days I think the price must be lower, but I won't waste my Oracle sales contact time in figuring out what it is today. Of course it will still be expensive, it is an Oracle product after all.
However, an equivalent dell list price cluster of simple 1U boxes (512 6C/64GB ones!) will go for 1.5 million. The fact that to house 512 boxes i.e. 25 racks or so plus networking. Of course you do get 1/3rd more cores than the SGI one.
For many of us that are between the just use a single normal server and yet too small for the google solutions. These big memory solutions from Oracle and SGI can make sense even if they are not the first thing that comes to mind!
Replacing a large number of nodes with a single machine with a lot of RAM is usually a cost savings measure rather than a larger expense (and it saves power too!), and due to a lack of communications overhead and exploitation of the fact that you now have access to all the data in one go you may very well find that your algorithms run much faster.
A distributed solution should be a means of last resort.
var MAX_SENSIBLE = 6 * TB; function doesMyDataFitInRam(dataSize) { return dataSize <= MAX_SENSIBLE; }
While one can purchase servers with larger memory most likely you will run into limitation on number of cores. Also note that there is at least some overhead in processing data, so you would need at least 2X the size of raw data.
Finally while its a good thing to tweet, joke about and make fun of buzzword while trying to appear smart. The reality is that purchasing such servers (> 255 Gb RAM) is costly process. Further you would ideally need two of them to remove single point of failure. it is likely that the job is batch and while it might take a terabyte of RAM you only need to run it once a week, in all these cases you are much better off relying on a distributed system where each node has very large memory, and the task can easily split. Just because you have cluster does not mean that each node has to be a small instance (4 processors ~16 Gb RAM).
That's assuming that everything needs to be 'high availability' and buying two of everything is a must. This is definitely not always the case. In plenty of situations buying a single item and simply repairing it when it breaks is a perfectly good strategy.
RAM is the new disk: now for some, later for others.
Like right now the L1 cache will have latencies of 1 or 2 cycles, and the L2 cache 15; this is due to the overheads of cache coherency protocols, moving the data around the chip; it's not that the memory's slower, it's all SRAM.
They are probably referring to enterprise workloads. Here you have large working sets (so caches are less useful) and you want maximum throughput. Clever multithreading (finegrained) can reduce effective latency by scheduling many (32?) processes at the same time, executing an instruction from each in round-robin fashion (see Sun Niagara). In that case, you can sometimes dump the L1 cache, and you would be able to get rid of the memory hierarchy.
There's also probably a benefit wrt hard drives/secondary storage; you can obviously make system storage very fast, which might improve random access times considerably. BUT this is probably not going to be transformative; it'll improve certain types of accesses, but current algorithms are already very highly tuned to spatial and temporal locality of reference. Furthermore, you'll still see these structures win out, because they can take advantage of hardware prefetching more easily.
Given a halfway competent I/O scheduler and some cheap SSDs, you can continuously write new data to disk at network wire speed even at 10 GbE while operating on the data in RAM and saturating outbound network. There is no slowdown at all. Even for databases that do not implement a good I/O scheduler (like PostgreSQL unfortunately) your workload is sufficiently trivial that backing it with SSD should have no performance impact. If you are having a performance problem with 1.4GB CMS, it is an architecture problem, not a database problem.
For the thing you described, you probably also want synchronous_commit=off. That means you might lose some commits in the case of a crash, but you won't get data corruption from it, and writes will be much faster.
It simply is not your overall disk IO capacity that is the performance problem with a dataset that small. At least not for a CMS.
"It simply is not your overall disk IO capacity that is the performance problem with a dataset that small."
But, it clearly, and measurably is; there's nothing to argue about there. That which comes from RAM (reads) is fast, that which waits on disk (writes) is slow. Writes take several seconds to complete, thus users wait several seconds for their comments and posts to save before being able to continue reading. That sucks, and is stupid and pointless, especially since it's not even all that important that we avoid data loss. A minute of data loss averages out to close enough to zero actual data loss, since crashes are so rare.
MySQL and PostgreSQL will totally take advantage of all of the RAM you give them.
> including writes?
This is harder, and you might not want it? It's worth noting that this argument is almost certainly directed against things like Hadoop, which claim to trade off performance for low management and easy scalability.
There's also a bunch of databases aimed at this use case (http://en.wikipedia.org/wiki/List_of_in-memory_databases), but I don't have any experience with them.
> I don't really care if I lose X amount of time worth of data (say, five minutes),
MySQL has generally got your back in the 'less safety for more performance' arena:
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.htm...
So, this was never entirely clear to me, but now that I've read a bit more about it, this might actually be exactly what I want (which is to not have the system wait to return when posting new content, and just assume it'll end up on disk eventually). The talk of not being ACID made me nervous and maybe switched off my brain. I guess it just means I don't need or want ACID in this case, all I want is a consistent database on reboot.
So, I guess maybe this does what I want, but just to be clear: In the event of data loss, the database will still be consistent, correct? i.e. we'll lose one or more comments or written pieces of data, but the transaction it was wrapped up in won't be half finished or something in the database? (I recall MySQL had issues with this kind of thing in the very distant past, but I imagine that's just bad memories at this point.)
For example, it is very easy to write badly performing code using ORMs. And yet, ORM is often chosen for good reasons initially to give development speed (e.g. Django forms) The problem with quickly prototyped ORM based apps is that initially the performance is good enough but when the data grows the amount of queries goes through the roof. It is not the amount of data per se, but number of queries. Fixing these performance problems afterwards for small customer projects is often too expensive for the customer, but if there were a plug'n'play in-memory SQL cache/replica with disk-based writes, it would easily handle the problem for many sites.
Configuring PostgreSQL to do something like reading data from in-memory replica is likely possible, but I see that there would be value in plug'n'play configuration script/solution.
Have you proven that your performance issues are SQL related? If you configure mysql correctly and give it enough RAM, a lot of those queries are happily waiting for you in RAM, so you have a defacto RAM disk. Finding your bottleneck in a LAMP based CMS system is fairly non-trivial. Think of all the php and such that runs for every function. Its incredible how complex WP and Drupal are. Lots and lots of code runs for even the most trivial of things.
This is why we just move up one abstration layer and dump everything in Varnish, which also puts its cache in RAM. Drupal and WP will never be fast, even if mysql is. Might as well just use a transparent reverse proxy and call it a day.
It seems like, from another comment, that setting innodb_flush_log_at_trx_commit to a non-1 value is roughly what I want, though it still flushes every second, which is probably more often than I need, but may resolve the problem of the application waiting for the commit to disk, which is probably enough.
But I suspect that it's all already in the disk cache. Is the bottleneck reads or writes?
There is something pathological about our disk subsystem on that particular system, which is another issue, but it has often struck me as annoying that I can't just tell MySQL or MariaDB, "I don't care if you lose a few minutes of data. Just be consistent after a crash."
http://ehcache.org/documentation/2.6/configuration/cache-siz...
effective_cache_size should be set to a reasonable value of course, but it does not affect the allocated cache size, it's just used by the query optimizer.
And many people don't want to deal with physical hardware. Dealing with physical hardware increases operational complexity too. They want to rent a virtual/cloud server. Which provider allows you to rent a virtual server with 1 TB RAM?
A 1TB RAM server is more expensive than 10x 100GB RAM servers, but the hardware cost is often small compared to the business and technical cost of getting a solution to scale across a cluster.
Of course, generalizations are always dangerous—the take-home point here is perhaps that before going to a cluster because “that's the way big data is handled,” it's a good idea to do a proper cost-benefit analysis.
It is true that while the jump from 256GB to 3TB is "just" ~2x -- I could get a server for 1/10 of the price of the original configuration -- but only with 4GB of RAM, and nowhere near even 18 hardware threads.
If you are CPU limited (even at 72 hw threads) you might need more, smaller servers.
But such a monster should scale "pretty far" I'd say. Does cost about half as much as a small apartment, or one developer/year.
Sure, but in the latter case you'd also have to pay for the manpower to build a cluster solution out of a formerly simple application. And people are usually more expensive than servers.
If you work outside the order form, you can get 768 GB, too. 1 TB is possible with their haswell servers, but availability seems limited.
I HATE when people use Spreadsheets to do anything besides simple math.
http://lemire.me/blog/archives/2014/05/23/you-shouldnt-use-a...
TL:DR your work is not reproducible and we can't see what you did to get to your numbers. A million examples of why this is bad.
Also
> 1. Python script is good enough
You mean Python with pandas and numpy?
I use R which is also a great choice
> 2. Java/Scala is way to go.
For you but the vast majority of Data Scientist don't use either and their choice for people is not universal. Julia looks like a great new comer. I again mainly use R.
> 3 & 4 are good points.
Ray Panko, University of Hawaii.
http://panko.shidler.hawaii.edu/SSR/
http://panko.shidler.hawaii.edu/SSR/
Goes back to 1993:
Though in many non-techies things, like daily sales transactions it is a way to go.
Ad 1. pandas/numpy would put it on par with 2.
Ad 2. Would disagree. I know data scientist using Spark. Mostly they like Scala API.
In general, everyone got their favorite weapon of choice and what they feel comfortable. The point is that simpler solutions sometimes are just enough do their job.
Renting r3.4xlarge on AWS for an hour and play with your favorite tool may be an orders of magnitude easier/cheaper/faster than using big data solution.
Actually, I would bet that some 50% of time people are importing numpy or pandas they really don't need it
Like for calculating the square root of a number. Or the average of a short list
Useful knowledge on latency: https://gist.github.com/jboner/2841832
But something still seems amiss to me.
How many new comments are you seeing per second? How many MB/s can the hardware actually write? How any MB/s of new comments are you actually getting?
Given the tiny dataset and small number of users, you really need to be running on abysmally slow (as in servers slow by late 80's standards) hardware or have a really bizarrely high user engagement or ludicrous levels of inefficiencies for this to make any sense with the numbers you've given.
Lets say an average comment is 8KB. Lets double that for indexes. At one comment per second you would be writing 86400 comments a day, replacing almost all old comments with 1.4GB of new comments in one day. Given you mention 100k comments total, presumably your actual rate of comments is much slower.
But even 16KB/second is at least an order of magnitude less than than my old 8-bit RLL harddrive for my 25 year old Amiga 500, and within reach of even the floppy drive on it...
Assuming your commenting rate is a tenth of that, the IO that should be required would in theory be within the reach of the floppy drive on a Commodore 64 home computer. Heck, it wouldn't take much for it to be within reach of the C64 tape drive.
Now, there will be some write amplification due to fsyncs etc. but nothing that could even remotely explain what you're describing unless there's something else wrong.
Maybe you have a degraded RAID array? Or lots of read/write errors on the drive?
Re-do the same thing using an optimal algorithm operating on a compact datastructure and you make it look easy, fast and cheap. Of course you're not going to make nearly as much money.
I've got to remember this one
Expensive in a relative to a low-end server or month of cloud usage, but that's an absurd amount of computational power.
I originally saw it on HN, but almost two years ago. Old comments: https://news.ycombinator.com/item?id=6398650
Basically, Tachyon acts as a distributed, reliable, in memory file system.
To generalise enormously, programs have problems sharing data in RAM. Tachyon lets you share data between (say) your Spark jobs and your Hadoop Map/Reduce jobs at RAM speed, even across machines (it understands data-locality, so will attempt to keep data close to where it is being used).
[1] http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16...
I've never used Tachyon, but based on the wonderful "getting started" experience Spark gives I'd be confident it would be similarly well thought out.
Spark started off as the "in memory map reduce" but has now become a platform for Scala, Java, Python, HiveQL, SQL and R code to run. It is the most active Apache project and is getting more and more powerful by the day.
Given how easy it is to get running it wouldn't surprise me to see it in the years to come being using as the primary front end for all data needs.
10x100GB will have 10x the computing power of 1x1TB server.
Although, that means the 10x setup must cost much more. I think the idea in the comments above was taking 10 cheaper, weaker servers and somehow coming out with roughly the same price...
Well, in any case, things just got too complicated :)
If it really waits on the disk your problems would be solved by using the SSD. Change your hosting plan. Fitting 1.4GB database isn't expensive. If it remains slow that means you use some very badly programmed solution.
Which, it seems like there is in the innodb_flush_log_at_trx_commit option, which is awesome, and which I'm trying out, now.
If your working set is small, say 1TB -- and so fits in RAM -- for what kind of problems would using SQL be so much of an overhead that you need a different approach? And what would that approach be?
I suppose you could have a massive set of linear equations that you might be able to fit into 1TB of RAM, but would be difficult to work with as tables in Postgres?
We have data.tables and dplyr which data.tables is maybe on average 50% faster and on some points multiple faster than Python [http://datascience.la/dplyr-and-a-very-basic-benchmark/]
> mm <- matrix(rnorm(1000000), 1000, 1000)
> system.time(eigen(mm))
user system elapsed
5.26 0.00 5.25
IPy [1] >>> xx = np.random.rand(1000000).reshape(1000, 1000)
IPy [2] >>> %timeit(np.linalg.eig(xx))
1 loops, best of 3: 1.28 s per loop
But where R really stinks is memory access: > system.time(for(x in 1:1000) for(y in 1:1000) mm[x, y] <- 1)
user system elapsed
1.09 0.00 1.11
IPy [7] >>> def do():
...: for x in range(1000):
...: for y in range(1000):
...: xx[x, y] = 1
...:
IPy [10] >>> %timeit do()
10 loops, best of 3: 134 ms per loop
Growing lists in R is even worse with all the append nonsense. Exponential time slower.Surely you just rent an instance of the computing power you need for an hour or two then upload the data, a script and wait.
"If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1, the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues."
From https://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.ht...
The "nice" option is to tune "fsync", "synchronous_commit" and "wal_sync_method" in postgresql.conf
If your overall write load is low enough that you will catch up over reasonable amounts of time, the really dirty method is to set up replication (internally on a single server if you like) and put the master on a ram disk. If your server crashes, just copy the slaves data directory and remove the recovery.conf file, and you'll lose only whatever hadn't been replicated.
But in terms of time investment in solving this, it's likely going to be cheaper to just stick an SSD in there.
> xx <- rep(0, 100000000)
> system.time(xx[] <- 1)
user system elapsed
4.890 1.080 5.977
In [1]: import numpy as np
In [2]: xx = np.zeros(100000000)
In [3]: %timeit xx[:] = 1
1 loops, best of 3: 535 ms per loop
If the very basics, namely changing stuff in memory, is so much slower, then the entire edifice built on it will be slower too, no matter how much you mess around with do.call. And to address the issue of (slow, but quickly expandable) Python lists, recall that all of data science in Python is built on Numpy so the above comparisons are fair.But yes, it is expensive if you measure per GB. If your dataset is small, though, and you are more interested in cost per IO operation, they can be very cheap.
You don't get a large amount of space (I believe the cards I installed were only 64 or 128GB, but this was ~4 years ago), but for small datasets that you need extremely fast access to, they get the job done.
* Nimble and other storage appliance manufacturers use a bank of SSDs to cache frequently-read data, and have layers of fast- and slow-HDDs to store data relative to its access frequency.
* Hybrid HDDs have been around for years.
There's no a single simple answer, except: don't decide about the implementation first, instead competently and without prejudice evaluate your actual possibilities.
Also don't decide the language like "Python" or "Ruby" first. If you expect that the calculations are going to take a month, then code them in Ruby and wait the month for the result, you can miss the fact that you could have had the results in one day by just using another language, most probably without sacrifying the readability of the code much. Only if the task is really CPU-bound, of course.
On another side, if you have a ready solution in Python, and you'd need a month to develop the solution for other language, the first thing you have to consider is how often you plan to repeat the calculations afterwards. Etc.
There is a lot to be said for confirming that you have the right logic and then looking into something like PyPy/Cython/etc. or porting to a lower-level language.
In a stateless microservice architecture, disk I/O is only an issue on your database servers. Which is why database servers are often still run on bare metal as it gives you better control over your disk I/O - which you can usually saturate anyway on a database server.
In most advanced organizations, those database servers are often managed by a specialized team. Application servers are CPU/memory bound and can be located pretty much anywhere and managed with a DevOps model. DBAs have to worry about many more things, and there is a deeper reliance on hardware as well. And it doesn't matter which database you use; NoSQL is equally as finicky as a few of my developers recently learned when they tried to deploy a large Couchbase cluster on SAN-backed virtual machines.
But that's not really what the discussion is about. In the web services case, your dataset often/generally fits in memory because your data set is tiny. You don't need large servers for that, most of the time. Even most databases people have to work with are relatively small or easily sharded.
In the context of this discussion, consider that what matters is the size of the dataset for an individual "job". If you are processing many small jobs, then the memory size to consider is the memory size of an individual job, not the total memory required for all jobs you'd like to run in parallel. In that case many small servers is often cost effective.
If you are processing large jobs, on the other hand, you should seriously consider if there are data dependencies between different parts of your problem, in which case you very easily become I/O bound.
It has two flaws: One, it has nVidia Optimus (that just suck, whoever implemented it should be shot).
Two, the I/O is not that good, even with a 7200 RPM disk, and Windows 8.1 make it much worse (windows for some reason keep running his anti-virus, superfetch, and other disk intensive stuff ALL THE TIME).
This is noticeable when playing emulated games: games that use emulator, even cartridge ones, need to read from the disc/cartridge on the real hardware, on the computer they need to read from disc quite frequently, the frequent slowndowns, specially during cutscenes or map changes is very noticeable.
The funny thing is: I had old computers, with poorer CPU (this one has a i7) that could run those games much better. It seems even hardware manufacturers forget I/O
This - if you're using this website as your sole calculation then something has gone seriously wrong.
It was more intended to provoke discussion and thinking around overengineering things which can easily be done with, say, awk or a few lines of <scripting language>.
If you have large CPU requirements then sure, use Hadoop/Spark/some other distributed architecture. If you have a >6TB dataset but you only need 1GB of it at a time, then, well, maybe you still only need one computer.
But seriously: the price of RAM for servers is now ~10$ / G.
It would be nice to see an article comparing all the high RAM machines side by side with specs and prices.
The largest machine I have right now will hold 512G and was a run-of-the-mill machine, it was about $5K, I'd expect the more exotic ones to be substantially more expensive but probably not as expensive as the machines linked here.
5 seconds of googling gets you:
http://www.allhdd.com/index.php?target=products&mode=search&...
No idea if that will work in that particular server but technically there is no reason why it should not. You can establish a relationship with a distributor to get better prices than those that you can find listed. Those 'call for quotes' things can have two meanings: you're about to be screwed or 'we will give you a better price if you promise to not divulge that we did that'.
Much easier to request a client provisions 20 of their standard machines, or get them from AWS. People don't like custom hardware, and for good reason.
I just speced out a 6TB Dell server. Price? It is already at $600K, and I haven't fully speced it out yet (just processor, memory, drive). Maybe that memory requirement is high (though it is about what I would need); 1TB is somewhat over $200K.
For the right situation that sort of thing maybe makes sense, though I'm SOL if I need high availability (power out, internet flakey, RAM chip goes bad, etc leaves me dead in the water).
So I would need to stand up a million or more in equipment in several places, or just use AWS and suffer the scorn of someone saying 'you could have put that in RAM'. Yes. Yes I could have.
Well, I keep arguing against that, because you still get 90%+ of the maintenance work, plus some new maintenance work you didn't have before, to avoid some some relatively minor hardware maintenance. And you can get most of the benefits of non-cloud deployment with managed hosting where you never have to touch the hardware yourself.
I work both on "cloud only" setups and on physical hardware sitting in racks I manage, and you know what? The operational effort for the cloud setup is far higher even considering it costs me 1.5 hours in just travel time (combined both ways) every time I need to visit the data centre.
For starters, while servers fail and require manual maintenance, those failures are rare compared to the litany of issues I have to protect against in cloud setups because they happen often enough to be a problem. (The majority of the servers I deal with have uptimes in the multi-year range; average server failure rate is low enough that maintenance cost per server is in the single digit percentage of server and hosting costs). Secondly I have to fight against all kind of design issues with the specific cloud providers that are often sub-optimal and require extra effort (e.g. I lose flexibility to pick the best hardware configurations).
Cloud services have their place, but far too many people just assumes they're going to be cheaper, and proceed to spend three times as much what it'd cost them to just buy or lease some hardware, or rent managed hosting services.
Even if you don't want to maintain your own hardware, AWS is almost never cost effective if you keep instances alive more than 6-8 hours of the day in general. Your mileage may wary, of course.
This assumes that a) You're in the valley where average developer salary is $10k a month or more, b) You're a large company paying developers that salary.
There are lots of other places where a) Developers are cheaper, or b) You're a cash strapped startup whose developers are the founder(s) working for free.
If you could do the same job on a fleet of 8-16GB servers.. you can get a lot more CPU for a lot less dollars. Depends if you really need everything on 1 machine or not (as of course nothing will beat same machine in memory locality)
I am reasonably happy configuring an old school Linux box. Heroku is much more of a pain in the arse to deploy to in my experience, despite much of the work being done for you already. Debugging deployment issues is particularly painful.
Which language do you suggest which matches this?
I don't know many people who use the language "numeric error handling characteristic" but I know the benefit of already having good written high-level primitives (or libraries).
The other area I've seen this a lot is caused by the way floating point math works not matching non-specialists’ understanding – unexpectedly-failing equality checks, precision loss making the results dependent on the order of operations, etc. That's a harder problem: switching to a Decimal type can avoid the problem at the expense of performance but otherwise it mostly comes down to diligent testing and, hopefully, some sort of linting tool for the language you're using.
(For contrast, that $1K if you'd spend it on HP branded RAM would not even get you four 16G dual rank units...).
softlayer, dual Xeon 2000 Series: 512GB, $1,823.00/mo (3.56 $/gb/mo)
these are on-demand prices. pre-pay, or use a term discount, and its cheaper.
Build it yourself: You can build a Dell or similar on a 2-Xeon-proc (E5 series), your main limit is getting good prices on 16x 32GB DIMMS. But lets say you can buy the RAM for ~$6500, then its just dependent on the rest of your kit, lets say $10,000 flat for the whole server. $277.77/mo over 36 months, but you still need network infrastructure, and you might want a new one in 12 months, but you get the general idea.
I appreciate you taking the time to answer -- and I get that there's a reason for why we have graph databases. But I really meant something more concrete, as in here's a real-world example that isn't feasible to do on machine X with postgresql, but easy(ish) with a proper graph structure/db -- rather than "not all data structures are easy to map to database tables in a space-efficient manner".
It's very much dependent on how frequently you update the data and whether or not (re)loading the data or updating your structure in memory can be done efficient or not to determine whether or not such an approach is useful or not but going from 'too long to wait for' to 'near instant' for the result of a query is a nice gain.
In the end 'programmer efficiency' versus 'program efficiency' is one trade-off and cost of the hardware to operate the solution on is another. Making those trade-offs and determining the optimum can be hard.
But a rule of thumb is that a solution built up out of generic building blocks will usually be slower, easier to set up, will use more power and will be more expensive to operate but cheaper to build initially than a custom solution that is more optimal over the longer term.
So for a one-off analysis such a custom solution would never fly, but if you need to run your queries many 100's of times per second and the power bill is something that worries you then a more optimal solution might be worth investing in.
Right. But you've still not made a compelling argument that the "smart encoding of the data" and "clever search strategy" couldn't be done with/in SQL. I'm quite sure you could probably get a magnitude or two of improvement by diving down outside of SQL -- but there's a big difference between using the schema you have, and setting up a dedicated system.
I was perhaps not clear, but I was wondering if it's not often the case with real-world data, that you could create a new, tailored schema in SQL, and do the analysis on one system -- especially if you allow for something like advanced stored procedures (perhaps in C) and alternate storage engines.
I suppose one could argue when something stops being SQL, and start becomming SQL-like. But the initial statement was that "For some problems SQL would already be way too much overhead.". The implication being (as I understood it, at least) -- that you not only have to change how the date is stored and queried, but that SQL databases are unsuitable to that that task -- and that by going away from your dbs, you can suddenly do something that you couldn't do on a system with X resources.
I'm not saying that's wrong -- I'm just asking if a) that's what you mean, and b) if you can quantify this overhead a bit more. Are using bitfields or whatnot with postgres going to be a 20% overhead, or a 20000% overhead in your scenario? Because if it's 20%, that sounds like it can be "upgraded away".