Streaming Databases in Real-Time with Kafka, Debezium, and MySQL

Streaming Databases in Real-Time with Kafka, Debezium, and MySQL(wecode.wepay.com)

110 points by mirceasoaica 9 years ago | 30 comments

zzzeek 9 years ago |

You know I just can't shake the 90's from my architectural perspective. Back then, we tried making domain models for everything; using a relational database is great but you have it as a backing store for the domain model, and that model is your business case that's exposed to the world. The domain model is what you publish; that's what your web services see, that's what your GUIs see, and if you're publishing out replication events to go to totally other kinds of systems that aren't relational databases, that's the model you publish there as well. Back then, one of the dragons we were slaying was the "stored procedure wall on top of SQL Server" API model, where relational databases had half-broken APIs exposed as crappy SPs and views that the whole world would consume directly from an ODBC connection over many network hops. It was a great way to kill any database, and schema changes were enormous beasts since the whole world hardcoded to your schema.

The article here as well as the yelp one seem to be interested in streaming out the MySQL database schema itself, not just data changes but schema changes as well. If you're doing this as part of bigger MySQL database replication scheme, e.g. that the stream is just consumed by other MySQL databases that agree to share the identical schema, that's great. But the context here seems to be more along the lines of a swarm of microservices, which we would expect are each dealing with their own private set of MySQL data, are not sending INSERT/UPDATE/DELETE statements to each other and instead are speaking more in terms of the domain model. My 90's-esque sense is that if you've built up on lots of services, the messages you want flying around the network should be in terms of the domain models of those services, not your raw SQL / DDL.

Disclaimer: this is an "I'm an idiot" comment because I think I'm missing something

Disclaimer: I'm the author of SQLAlchemy so yes, I get the SQL/DDL/ORM thing

SmellTheGlove 9 years ago | |

I was reading your comment and getting ready to tell you that I agreed, and that SQLAlchemy helps do pretty much what you're suggesting. Then I got to the bottom. Thanks for what you do!

But yeah, I might be missing something too, but by and large I've lived through a lot of code that depends on a specific database schema, rather than a domain model or some other abstraction. Even when database changes don't break things, there is so. much. regression. testing. before anyone is confident.

barrkel 9 years ago | |

I think asserting the primacy of a domain model over a relational model - in particular, an OO domain model - is a recipe for pain, especially in a polyglot world. OO is for modelling communicating agents where coordinated behaviour is the desired outcome. Databases are for storing facts, not serialized objects; the ideal relational model would be pure enough that rows are almost Horn clauses.

I think OO is particularly poor at modeling the kinds of problems I've been involved with - all our perspectives are probably path dependent. My domains have involved lots of data transfer, manipulation, ETL, and transmitting tuples from one language to another; the interface between different modules is at the data layer, not the API layer, because the business of our business is working with the customer's data. In so far as there are core objects, they're things like schemas and transformation rules - configuration, metadata, information about the data and its transformations.

I particularly like the way it's possible to express global constraints over sets of data with relational algebra in a declarative style. Databases aren't (usually) built to evaluate these kinds of constraints, and I don't want to build stored procedure APIs or triggers that emulate them, but the conceptual model is powerful, much more powerful and expressive to me than anything I see in OO.

Wirth once wrote: Algorithms + Data = Programs. I'd put Data first; and in fact when I want to understand a big program, it's almost always the data structures and data flow which reveal more, than the control flow.

balfirevic 9 years ago | | |

> I think asserting the primacy of a domain model over a relational model - in particular, an OO domain model - is a recipe for pain, especially in a polyglot world.

It's unfortunate that, in common usage, "domain model" basically means "OO domain model". Relational model, one which properly reflects your domain, is most certainly as worthy of that name as an OO one.

We can, of course, debate if simply sharing relational database among multiple applications is good architecture. It very well might not be, due to missing intermediate layer that can absorb schema changes, provide caching, etc. Or due to crappiness of languages used to write stored procedures and databases generally not being great application servers (although PostgreSQL seems to be getting there...)

But it is not because relational model is not a "domain model".

haddr 9 years ago | |

Totally agree!

Once we were very close to implement a very similar scenario (stream of events, some lightweight processing, kafka for distribution, hbase/solr for storage). If you go for such thing, better be sure that your business analysts are on the same page as you. Otherwise, instead of a clean event sourcing model, you may end up with some sort of message protocol that spans across the schema/data format/business logic/weird requirements. It is just that this thing will become your new "model".

ewbourget 9 years ago | |

Exactly this. This wepay post is well written and interesting, and DBZ sounds pretty neat. However, I think that the use case they present is an antipattern that incents/greases the wheels of leaky abstraction. Hell is having ETLs break because one refactored OLTP...

cookiecaper 9 years ago |

While I respect the process involved in putting so many moving parts together, sometimes I wonder why we have to complicate these things so much.

From my reading of the flow chart, it appears that data jumps through 5 different data layers before it gets written in its final destination. This doesn't appear to be a transformation pipeline, that's just how much it takes to go from user input -> saved in indexed storage.

What unique featureset does WePay provide that justifies this type of complication?

As an industry, there's been an explosion of this massive overcomplication the last few years. We're contorting ourselves pretty bad here, and it seems primarily to be in service of the engineer's desire to, well, engineer. Can you imagine if we let road engineers get away with this kind of stuff?

barrkel 9 years ago |

I recently had cause to work with Kafka Connect; we needed to get data from MySQL into Hadoop. It was not a positive experience, and IMO Kafka Connect is pretty immature, and Kafka as it is currently constituted isn't well featured enough for this purpose.

Kafka Connect is architected around the idea of producers and consumers that either add messages to Kafka, or read messages from Kafka.

The MySQL producer isn't suited for anything other than the most basic of table replication, though; if you need any ETL, you'll be gluing more stuff into your pipeline downstream. And when the producer falls over, the first you'll know about it is when you read the logs or poll status indicators. It didn't give me warm fuzzies about reliability nor visibility nor flexibility. It was very basic stuff.

The Hadoop consumer had an unpleasant surprise: you have zero choice over table name in Hive metastore; your Kafka topic will be the name of your Hive table, no ifs, no buts, no choices. And since Kafka doesn't have any namespaces, either you're going to be running multiple Kafka clusters, or you need global agreement on topics vs Hive metadata (which does have namespaces). We have a multi-tenancy architecture and use namespaces. A non-starter.

Why do I think Kafka doesn't have the right feature set? Because Kafka message expiry has only two policies, as far as I could tell: time or space. Either your message is too old and gets discarded (en bloc, IIRC); or Kafka hits a space limit and starts clearing out old messages. The natural question that arises when you're using Kafka to buffer a consumer vs a producer, then, is flow control / backpressure: how do you get the producer to slow down when the consumer can't catch up? And vice versa? Well, there's knobs you can manually control to throttle the producer, but it's in your own hands. You're dancing at the edge of a cliff if a consumer has died and messages start expiring; there's nothing stopping data loss.

The only way you can start to turn this situation into a win is if (a) you have such a big firehose of data that nothing else can cope, or (b) you can take advantage of network effects and use Kafka as a data bus, not just as a pipe with exactly two ends. But it has to overcome the downsides.

mattbillenstein 9 years ago |

I've been building some bits of this although with I think much simpler components -- our event pipeline is nginx+lua -> nsq -> python consumer (transform, etc) -> bigquery I also have nsq_to_file hanging off of here to archive both event formats - raw+bigquery It is surprisingly little code and I'm happy with its performance although we are at a smaller scale. I've never been happy with the hadoop ecosystem or java tooling around it -- it's all designed to be scalable and fault tolerant, but it seems like it's always broken if you actually run one of these things.

Regarding the ETL, the thing I haven't figured out yet is what to do with the data that's not events. We do daily/hourly exports of those tables in-bulk, but it's not real-time in the data warehouse. This is mostly ok, but I'd love a magic bullet that let me stream these updates into BigQuery as well.

In any case -- nice blog post -- nice to see how others are doing it.

mslate 9 years ago |

- https://github.com/zendesk/maxwell

- https://github.com/confluentinc/bottledwater-pg

luhn 9 years ago |

The "traditional" Kafka infrastructure (i.e. LinkedIn) uses Kafka as the source of truth. Everything gets written directly into Kafka and makes its way from there. This is a powerful concept, but involves giving up a lot of advantages of a RDBMS.

This is the first time I've heard of the approach WePay uses and I really like it. Now MySQL is the source of truth and Debezium pipes it into Kafka, allowing you to leverage Kafka while still using MySQL as your primary data store. (Or Postgres, which Debezium includes support for.)

swah 9 years ago | |

tl;dr but so the simple operations like "update my birthdate" happen with sync MySQL operations?

luhn 9 years ago | | |

Yep. All their applications always write directly to MySQL.