KAFKA recommendation and High level understanding of kafka
Below document briefs about the Kafka best practices
and recommendation for optimum performance.
Document also touches configuration basics and concept of Kafka.
Kafka services disk recommendations
Recommend is using multiple drives to get good throughput and not sharing
the same drives used for Kafka data with application logs or other OS
filesystem activity to ensure good latency. With the available disks either RAID these
drives together into a single volume or format and mount each drive as its own
directory. Since Kafka has replication the redundancy provided by RAID can also
be provided at the application level. This choice has several tradeoffs.
a) Multiple disk and mount points
With this option If you configure multiple data
directories partitions will be assigned round-robin to data directories. Each
partition will be entirely in one of the data directories. If data is not well
balanced among partitions this can lead to load imbalance between disks.
b) RAID 10 ( 1+0) single mount.
This is a good option as it has good performance. RAID can potentially do better at balancing
load between disks (although it doesn’t always seem to) because it balances
load at a lower level. The primary downside of RAID is that it reduces the
available disk space. Another potential benefit of RAID is the ability to
tolerate disk failures.
RAID 10 is the recommended way to configure the Kafka filesystem
which reduces the overheads like, balancing
data across disk , disk failure tolerance, etc.
Kafka Service Filesystem recommendation
The recommendation is running Kafka on XFS or
ext4. XFS typically performs well with little tuning when compared to ext4 and
it has become the default filesystem for many Linux distributions.
Recommendation setting is XFS
Kafka consideration and recommended configurations
Below listed setting are main setting an admin will like to tune.
a) Java
Recommended version of JDK is 1.8. With the G1 collector (older freely available versions
have disclosed security vulnerabilities).
Below is GC tuning
parameter
-Xms6g -Xmx6g
-XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20-XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50
-XX:MaxMetaspaceFreeRatio=80
b) OS level configuration
b) OS level configuration
1) File descriptor
>100K
2) Less swapingness
3) Heap as 6 to 8GB
2) Less swapingness
3) Heap as 6 to 8GB
c) Configuration consideration
Broker.ID :-
Integer id that identifies a broker. No two brokers in the same Kafka
cluster can have the same id.
Zookeeper connection:-
The list of ZooKeeper hosts that
the broker registers at. It is recommended that you configure this to all the
hosts in your ZooKeeper cluster
Log dirs:-
The directories in which the Kafka log data is located, default is under
/tmp.
We recommend to relocate the logs from OS
directory to dedicated log disk. Log can share the location with other Hadoop
service logs.
Num partitions:-
The default number of log
partitions for auto-created topics (default value is 1 ). We recommend
increasing this as it is better to over partition a topic. Over partitioning a
topic leads to better data balancing as well as aids consumer parallelism. For
keyed data, in particular, you want to avoid changing the number of partitions
in a topic.
Recommended value is 3
d) Replication configs and redundancy
default.replication.factor:-
The default replication factor that applies to auto-created topics. We
recommend setting this to at least 2
Recommended value is 3
unclean.leader.election.enable:-
Indicates whether to enable replicas not in the ISR set to be elected as
leader as a last resort, even though doing so may result in data loss
*** Recommended value is to set false. This can lead to
data lose. The setting should be enabled for recovery and maintenance by Admin
and under certain conditions.
Kafka File descriptors
Kafka uses a very
large number of files. At the same time, Kafka uses a large number of sockets
to communicate with the clients. All of this requires a relatively high number
of available file descriptors.
Sadly, many modern
Linux distributions ship with a paltry 1,024 file descriptors allowed per
process. This is far too low for even a small Kafka node, let alone one that
hosts hundreds of partitions.
You should increase
your file descriptor count to something very large, such as 100,000. This
process is irritatingly difficult and highly dependent on your particular OS
and distribution. Consult the documentation for your OS to determine how best
to change the allowed file descriptor count.
Kafka consumer and producer
Kafka concept
Note :- all consumer should use the latest API in order to avoid offset writing in Zookeeper.
Apache Kafka’s MirrorMaker as DR solution for Kafka.
MirrorMaker is a stand-alone tool for copying data between two Apache Kafka clusters. It is little more than a Kafka consumer and producer hooked together.
Data will be read from topics in the origin cluster and written to a topic with the same name in the destination cluster. You can run many such mirroring processes to increase throughput and for fault-tolerance (if one process dies, the others will take overs the additional load).
The origin and destination clusters are completely independent entities: they can have different numbers of partitions and the offsets will not be the same. For this reason the mirror cluster is not really intended as a fault-tolerance mechanism (as the consumer position will be different). The mirror maker process will, however, retain and use the message key for partitioning so order is preserved on a per-key basis.
Download and configure Kafka from link ---- https://kafka.apache.org
******* Up coming Blog of apache kafka admin task
Comments
Post a Comment