Below document briefs about the Kafka best practices and recommendation for optimum performance.

Document also touches configuration basics and concept of Kafka.

Kafka services disk recommendations

Recommend is using multiple drives to get good throughput and not sharing the same drives used for Kafka data with application logs or other OS filesystem activity to ensure good latency. With the available disks either RAID these drives together into a single volume or format and mount each drive as its own directory. Since Kafka has replication the redundancy provided by RAID can also be provided at the application level. This choice has several tradeoffs.

a) Multiple disk and mount points

With this option If you configure multiple data directories partitions will be assigned round-robin to data directories. Each partition will be entirely in one of the data directories. If data is not well balanced among partitions this can lead to load imbalance between disks.

b) RAID 10 ( 1+0) single mount.

This is a good option as it has good performance. RAID can potentially do better at balancing load between disks (although it doesn’t always seem to) because it balances load at a lower level. The primary downside of RAID is that it reduces the available disk space. Another potential benefit of RAID is the ability to tolerate disk failures.

RAID 10 is the recommended way to configure the Kafka filesystem which reduces the overheads like, balancing data across disk , disk failure tolerance, etc.

Kafka Service Filesystem recommendation

The recommendation is running Kafka on XFS or ext4. XFS typically performs well with little tuning when compared to ext4 and it has become the default filesystem for many Linux distributions.

Recommendation setting is XFS

Kafka consideration and recommended configurations

Below listed setting are main setting an admin will like to tune.

a) Java

Recommended version of JDK is 1.8. With the G1 collector (older freely available versions have disclosed security vulnerabilities).

Below is GC tuning parameter

-Xms6g -Xmx6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20-XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80

b) OS level configuration

1) File descriptor >100K
2) Less swapingness
3) Heap as 6 to 8GB

c) Configuration consideration

Broker.ID :-

Integer id that identifies a broker. No two brokers in the same Kafka cluster can have the same id.

Zookeeper connection:-

The list of ZooKeeper hosts that the broker registers at. It is recommended that you configure this to all the hosts in your ZooKeeper cluster

Log dirs:-

The directories in which the Kafka log data is located, default is under /tmp.

We recommend to relocate the logs from OS directory to dedicated log disk. Log can share the location with other Hadoop service logs.

Num partitions:-

The default number of log partitions for auto-created topics (default value is 1 ). We recommend increasing this as it is better to over partition a topic. Over partitioning a topic leads to better data balancing as well as aids consumer parallelism. For keyed data, in particular, you want to avoid changing the number of partitions in a topic.

Recommended value is 3

d) Replication configs and redundancy

setting can be searched as below in kafka setting.

default.replication.factor:-

The default replication factor that applies to auto-created topics. We recommend setting this to at least 2

Recommended value is 3

unclean.leader.election.enable:-

Indicates whether to enable replicas not in the ISR set to be elected as leader as a last resort, even though doing so may result in data loss

*** Recommended value is to set false. This can lead to data lose. The setting should be enabled for recovery and maintenance by Admin and under certain conditions.

Kafka File descriptors

Kafka uses a very large number of files. At the same time, Kafka uses a large number of sockets to communicate with the clients. All of this requires a relatively high number of available file descriptors.

Sadly, many modern Linux distributions ship with a paltry 1,024 file descriptors allowed per process. This is far too low for even a small Kafka node, let alone one that hosts hundreds of partitions.

You should increase your file descriptor count to something very large, such as 100,000. This process is irritatingly difficult and highly dependent on your particular OS and distribution. Consult the documentation for your OS to determine how best to change the allowed file descriptor count.

Kafka consumer and producer

Kafka concept

Note :- all consumer should use the latest API in order to avoid offset writing in Zookeeper.

Apache Kafka’s MirrorMaker as DR solution for Kafka.

MirrorMaker is a stand-alone tool for copying data between two Apache Kafka clusters. It is little more than a Kafka consumer and producer hooked together.

Data will be read from topics in the origin cluster and written to a topic with the same name in the destination cluster. You can run many such mirroring processes to increase throughput and for fault-tolerance (if one process dies, the others will take overs the additional load).

The origin and destination clusters are completely independent entities: they can have different numbers of partitions and the offsets will not be the same. For this reason the mirror cluster is not really intended as a fault-tolerance mechanism (as the consumer position will be different). The mirror maker process will, however, retain and use the message key for partitioning so order is preserved on a per-key basis.

Download and configure Kafka from link ---- https://kafka.apache.org

******* Up coming Blog of apache kafka admin task

Data Science, Data Analytics, Big data, Data engineering

Debugging Hadoop

KAFKA recommendation and High level understanding of kafka