CDC Introduction

Before we go over Maxwell what we need to understand the necessity of softwares like Maxwell. CDC is labeled for change Data Capture which is mostly needed by organizations for applying data analytics/science on their data to get insights of their data, customers and it’s patterns to extract useful information for all their business needs. 

Its an enterprise out of the box solution/feature available in some of the enterprise softwares like Oracle,MSSQL etc. Oracle MySQL doesn’t have a direct solution for this need yet. 

What are MySQL CDC options available ? 

  1. Traditional triggers based CDC – Expensive operation in terms of load on database but works
  2. Maxwell Daemon – Works with both gtid and non-gtid based replication but it needs to run on production read write node or on a slave node where the slave is not in read-only mode and it supports only MySQL.
  3. Debezium – However debezium connector supports wide variety of DBMS softwares like mysql/postgres/mssql/oracle/mongo etc

What is Maxwell ? 

Maxwell’s daemon is a software application that reads MySQL binlogs just like another mysql replica and writes row updates as JSON to Kafka, Kinesis, or other streaming platforms. Maxwell has very low operational overhead as compared to triggers or any other ETL based process however it requires access to maintain its own database so it can track the coordinates it has processes through so far. Just like any other replica we can point Maxwell to any working node in a cluster upon failure. Transition is easy if its gtid based replication if not you might have to bootstrap it from an older position just from the beginning so you don’t miss any data.

How to setup Maxwell ? 

The setup process in very simplified and well documented in the Maxwell documentation, however we will record some key information here. 

Components needed for Maxwell 

  • mysql ( data source )
  • Kafka Connect source ( handled by Maxwell application ) 
  • Kafka ( data streaming pipeline is also part of Maxwell application )
  • Kafka Connect sink ( eg. Maxwell sink – Only for MySQL to MySQL ) [ NOTE: confluent kafka connect connectors are preferred and better solutions ] 

INFO: If you don’t want to keep things simple is handling the CDC environment then just a docker on an existing slave or additional node is enough however you shouldn’t build anything business critical on top of such single point of failure nodes. 

Possible Architecture

MySQL Settings

If you have a standalone instance where you want to enable CDC then you need to have basic replication settings like enabling bin logs with row based replication on that standalone node to act as a primary without any real slaves to it.

[mysqld]
server_id=1
log-bin=master-binlog
binlog_format=row
binlog_row_image=FULL

Similar to replication Maxwell also needs a mysql user to connect to the mysql source database along with write permissions to the maxwell database.

mysql> CREATE USER 'maxwell'@'%' IDENTIFIED BY 'XXXXXX';
mysql> GRANT ALL ON maxwell.* TO 'maxwell'@'%';
mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';

# or for running maxwell locally:

mysql> CREATE USER 'maxwell'@'localhost' IDENTIFIED BY 'XXXXXX';
mysql> GRANT ALL ON maxwell.* TO 'maxwell'@'localhost';
mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'localhost';

Sample Maxwell Config

below are the mandatory config parameters Maxwell needs to know when you do an initial bootstrap. 

user=maxwell_db_user
password=maxwell_db_user_password
host=localhost/hostname
schema_database=maxwell
port=3306
producer=kafka
clientid=maxwell_app
kafka.bootstrap.servers=localhost:9092
log_level=info
kafka_topic=maxwell_kafka_topic
bootstrapper=sync

Start Maxwell Service

Command line

bin/maxwell --user='maxwell' --password='XXXXXX' --host='127.0.0.1' --producer=stdout

Docker

docker run -it --rm zendesk/maxwell bin/maxwell --user=$MYSQL_USERNAME \
    --password=$MYSQL_PASSWORD --host=$MYSQL_HOST --producer=stdout

Docker using Config

docker run -d -e MAXWELLCONF=maxwell.config --restart=always --name maxwell_docker zendesk/maxwell bin/maxwell

Tables in Maxwell Database

mysql> show tables;
+-------------------+
| Tables_in_maxwell |
+-------------------+
| bootstrap         |
| columns           | - Has the list of columns from all schema and tables
| databases         | - has the list of databases for which CDC is enabled
| heartbeats        | 
| positions         | - Latest position till which maxwell has processed
| schemas           |
| tables            | - records all the tables and whether they have primary key or not. 
+-------------------+
7 rows in set (0.01 sec)

Positions

mysql> select * from positions;
+-----------+------------------+-----------------+----------+------------+--------------+---------------------+
| server_id | binlog_file      | binlog_position | gtid_set | client_id  | heartbeat_at | last_heartbeat_read |
+-----------+------------------+-----------------+----------+------------+--------------+---------------------+
|  xxxxxxxx | mysql-bin.xxxxxx |       459956100 | NULL     | maxwell    |         NULL |       1629661069972 |
+-----------+------------------+-----------------+----------+------------+--------------+---------------------+
1 row in set (0.00 sec)

The above position table’s data is from a non-gtid based Maxwell however a GTID based Maxwell would output the gtid set till which Maxwell has caught up. 

One of the most important requirement for Maxwell is all the tables should have primary key. If you have tables without primary key Maxwell wouldn’t work. Maxwell provides json formatted data 

Sample Json data

> db.mysqlcdc.findOne()
{
	"_id" : ObjectId("59c1v783af5c3f0c2bfcd31d"),
	"database" : "employee",
	"xid" : NumberLong(1825940741),
	"data" : {
		"IsPrimaryAccount" : NumberLong(0),
		"Status" : NumberLong(3),
		"CreateByUserID" : NumberLong(1),
		"CountryID" : null,
		"country" : "IND",
		"username" : "Dinesh",
		"FirstName" : "Kumar",
        "LastName" : "Weis",
		"CreateDate" : "2016-08-15 19:52:09",
		"CurrencyID" : null,
		"ModifyDate" : "2017-06-06 10:38:51",
	},
	"commit" : true,
	"type" : "insert",
	"table" : "employee",
	"ts" : NumberLong(1505896234)
}

Conclusion

CDC is a very important feature in today’s world to capture all data changes. Since mysql doesn’t have an out of box CDC solution, Maxwell and Debezium serves the purpose. Which one should be used ? Well this depends on how comfortable are you to work with Kafka components.Under any circumstances Debezium is a better solutions which is the most popular choices it supports more data sources than maxwell. However on the other Maxwell plays a simple CDC solution. Maxwell acts as a source connector which connects to Source and sends the data to Kafka but after that you would need some custom sink connector to send the data to some persistent storage data store.