1,000,000 daily users with no cache(slideshare.net) |
1,000,000 daily users with no cache(slideshare.net) |
One thing that struck me as odd from a software design point of view is the "tile table". The name and the absurdly high number of I/Os per second suggest that each tile was stored in a separate database record? The game looks like a Farmville clone, so wouldn't it have been more economical to store the entire farm of a player in one blob?
Is there someone from Wooga here to shed some light on the architectural decisions that went into the game?
Suppose your players are active for an average of, say, 20 minutes - that's a damn generous upper bound for a Facebook game. That would mean that 1M / (24*3) = 13,888 users are active (on average) at any time. Which means that they must each be generating db events at a rate of about 3.6 events / sec / user, which is ludicrous (at the very least, these should probably be somehow combined so that each client only hits the db once in a while, preferably in response to a user action).
And in reality, the average play session is probably quite a bit shorter than 20 minutes, which means the event rate is even higher per active user. What the hell are they doing that is causing so much database traffic?
Is the game actually doing something a lot more sophisticated than I would assume it is based on the looks of it (to be fair, I've never played it), or is the client-side design just really that messed up? I mean, it's cool that they can handle that much traffic, and all, and the server guys should be proud, but IMO they should really be able to do better on the client-side to prevent this level of scaling from being necessary...
You are correct in your assumption that each Tile had been a record in a MySQL table (now it is a value i a Redis hash).
Actually we considered using a "blob" approach. But in the client we cannot batch requests as we cannot foresee when the user will simply kill the Flash client to go to some other site. So when a user request (i.e. a game event) arrives in the server there is no way to know if another request will follow. So we have to persist that change right away.
This is using a stateless server. In a later game called Magic Land we are going for a stateful Erlang server. There we keep the whole user state in RAM while the user plays and persist state changes every minute or so. Here we can do without any database and just use S3 for persistence. Works just great. On the upcoming Erlang User Conference we will give an update on that project and slides will be available at Slideshare next week, too. In the meantime please have a look at this old slide set to explain the concept in detail: http://www.slideshare.net/wooga/erlang-the-big-switch-in-soc...
> Actually we considered using a "blob" approach. But in the client we cannot batch requests as we cannot foresee when the user will simply kill the Flash client to go to some other site. So when a user request (i.e. a game event) arrives in the server there is no way to know if another request will follow. So we have to persist that change right away. <
That's an understandable dilemma. I love thinking about stuff like this and see how other people are dealing with these challenges, so please forgive my Sunday morning quarterbacking ;-) Wouldn't the issue have been solvable by creating an relatively simple persistent software layer between the app code manipulating the tiles and the backend storage? I understand that you moved to this model with your Erlang game, but I'd like to know if a persistence/caching layer was considered for the farming game?
More generally, somewhere in here is an idea for a great Node.js server project that takes coarse grained datasets from a contentious database and serves as an interface for finer grained portions of that data.
Consider their scenario. 100,000 db operations and 50,000 updates per sec. In this case, simple cache costs more than without. Eviction is expensive.
Also, it's no surprise replication doesn't scale because Updates get propagated.
Not sure details, but they succeeded to relax the tight requirements of ACID transactions. So this is a good case when RDBMS(or traditional database) fails.
I guess their design is more like MMO, hope to hear from the guy.
Also, replication has nothing to do with whether or not "without a cache" is a meaningful statement. The point is that by holding their entire data set in RAM, they've nullified the need for a cache. Effectively, their database is their cache.
And considering the data isn't even written to disk for about 15 minutes, it's really more cache than database anyway.
http://www.slideshare.net/wooga/games-for-the-masses-scaling...
Would you have had the (roughly) same database issues if you didn't have that many writes to the db?
I curious because I'm considering recoding a large part of a website and are trying to avoid scaling issues.
Thank you for the answer.
So, why not?
Since hetzner recently upgraded the hardware but also limited the different options it would be very interesting to see what Wooga takes out of this...
Switching from MySQL to PostGRES in the above scenario wouldn't do much for performance, some operational tasks might change, complex queries might change in speed, but the baseline of simple updates/selects per second won't really change.
To get magnitudes more performance, you need to use a different model. Redis is a NoSQL key-value-store which keeps the entire dataset in memory and occasionally flushes changes to disk. Of course that's going to be faster than a system which flushes all changes to disk individually.
Even that is tunable these days. You can have unlogged tables, turn off fsync or have a high commit delay etc.
Regarding your idea: Wouldn't then the Node.js server have to keep the whole user state in memory?
I can imagine that.
> Regarding your idea: Wouldn't then the Node.js server have to keep the whole user state in memory? <
Yes, but I think it would have several advantages:
(1) The Node.js server code could decide which working sets it keeps in memory based on very simple rules. The details of this would be abstracted away from the application code itself, because the app just issues read and write requests on a user's dataset. So in essence, by splitting up the problem in two, it becomes relatively easy to handle (and optimize) on each end.
(2) You just have to keep the active datasets wired in RAM and it wouldn't be necessary for the Node server to know whether a user has disconnected recently or not. All it knows is when the data was last accessed and it can then vacate RAM slots that have become stale. Compare this to Redis, which I believe just keeps everything in memory no matter what. So overall RAM usage would probably be considerably less than what you're doing now.
(3) The idea beats "dumb blob caching" such as memcache, because it makes small operations economical. It seems to me that Node is well suited for this kind of task since it makes it very easy to build small server scripts that handle a huge number of small transactions. This would probably mean you need less machines for the same amount of users.
(4) I believe it's relatively easy to implement replication and scaling.
Anyway, just an idea. I have no clue whether this works in practice ;-)
Getting rid of cache out of app server layer has great benefit on a cloud.
Having cache in app server layer needs synchronizations to keep data consistent. Scale out design was made before cloud era, when we have our own dedicated system on our site. LAN can afford expensive sync communications, but on a cloud?
Still makes sense if a scenario is read intensive. But when update intensive?
That's why I see this interesting. Increasing memory is a cheap option on a cloud. So it's great if it scales by letting database utilize more memory.
Replication is basically just a provision for instant failover. Let's say that by policy the background data store (e.g. MySQL) always has a copy that is at most 10 minutes old. In practice it could probably be much more recent. So in general user data is safe but you want something very simple to prevent data loss and service disruption in the most common failure scenarios.
I believe the best paradigm is a replication buddy system between two given Nodes. Should a Node instance fail, the app can always issue the same request to its "replication buddy" and expect to get the same data. Implementing a replication buddy relationship between two instances should be relatively easy using a persistent connection between them, since Node is all non-blocking but still guaranteed sequentially executed code (=there will be no real consistency problem). Nodes could just notify each other when data changes in the background and they'd both always have the same data state. Granted, there would have to be some code to take care of what happens in different failure modes (probably the most complex aspect of the whole thing), but overall still very doable.
Scaling would be even easier: just put user IDs into different buckets, each bucket is a replicated instance. If this is even necessary.
And the beauty of it is that you have to implement this just once, no matter how many different server-side apps and languages you use. It would be a common piece of infrastructure.