Should you run Kafka Connect in Distributed or Standalone mode?

#kafka #kafkaconnect

Kafka Connect can be deployed in two modes: Standalone or Distributed.

I usually recommend Distributed for several reasons:

You can run just a single node of it if you want
It can scale
It is fault-tolerant
It can be run on a single node sandbox or a multi-node production environment
It is the same configuration method however you run it

I usually find that Standalone is appropriate when:

You need to guarantee locality of task execution, such as picking up a log file from a folder on a specific machine
You don’t care about scale or fault-tolerance ;-)
You like re-learning how to configure something when you realise that you do care about scale or fault-tolerance X-D

My last snarky point on the list is why even if you’re just playing around with Kafka Connect on a laptop, learning it in Distributed mode means you learn it once, and then you’re all set. If you start with Standalone and its .properties method of passing configuration files to the worker at startup, and then come to use Distributed you have to re-learn how to use the REST interface etc.

🗒️ If you want to learn more about deploying Kafka Connect in distributed mode, go and learn from my list of common mistakes made when configuring multiple Kafka Connect workers.
🎥 To learn more about Kafka Connect in general then check out my Kafka Summit London 2019 talk.

Some follow-ups to this:

Gunnar Morling 🌍

@gunnarmorling

@rmoff @apachekafka Yes, recommending the same. On K8s I see many Debezium users work with single node clusters (leaving scheduling to the orchestrator, no rebalancing) and even single connector (healthcheck friendly).

15:55 PM - 11 Dec 2019