Add km module kafka
243
docs/zh/Kafka分享/Kafka Controller /Controller与Brokers之间的网络通信.md
Normal file
@@ -0,0 +1,243 @@
|
||||
|
||||
## 前言
|
||||
之前我们有解析过[【kafka源码】Controller启动过程以及选举流程源码分析](), 其中在分析过程中,Broker在当选Controller之后,需要初始化Controller的上下文中, 有关于Controller与Broker之间的网络通信的部分我没有细讲,因为这个部分我想单独来讲;所以今天 我们就来好好分析分析**Controller与Brokers之间的网络通信**
|
||||
|
||||
## 源码分析
|
||||
### 1. 源码入口 ControllerChannelManager.startup()
|
||||
调用链路
|
||||
->`KafkaController.processStartup`
|
||||
->`KafkaController.elect()`
|
||||
->`KafkaController.onControllerFailover()`
|
||||
->`KafkaController.initializeControllerContext()`
|
||||
```scala
|
||||
def startup() = {
|
||||
// 把所有存活的Broker全部调用 addNewBroker这个方法
|
||||
controllerContext.liveOrShuttingDownBrokers.foreach(addNewBroker)
|
||||
|
||||
brokerLock synchronized {
|
||||
//开启 网络请求线程
|
||||
brokerStateInfo.foreach(brokerState => startRequestSendThread(brokerState._1))
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. addNewBroker 构造broker的连接信息
|
||||
> 将所有存活的brokers 构造一些对象例如`NetworkClient`、`RequestSendThread` 等等之类的都封装到对象`ControllerBrokerStateInfo`中;
|
||||
> 由`brokerStateInfo`持有对象 key=brokerId; value = `ControllerBrokerStateInfo`
|
||||
|
||||
```scala
|
||||
private def addNewBroker(broker: Broker): Unit = {
|
||||
// 省略部分代码
|
||||
val threadName = threadNamePrefix match {
|
||||
case None => s"Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
|
||||
case Some(name) => s"$name:Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
|
||||
}
|
||||
|
||||
val requestRateAndQueueTimeMetrics = newTimer(
|
||||
RequestRateAndQueueTimeMetricName, TimeUnit.MILLISECONDS, TimeUnit.SECONDS, brokerMetricTags(broker.id)
|
||||
)
|
||||
|
||||
//构造请求发送线程
|
||||
val requestThread = new RequestSendThread(config.brokerId, controllerContext, messageQueue, networkClient,
|
||||
brokerNode, config, time, requestRateAndQueueTimeMetrics, stateChangeLogger, threadName)
|
||||
requestThread.setDaemon(false)
|
||||
|
||||
val queueSizeGauge = newGauge(QueueSizeMetricName, () => messageQueue.size, brokerMetricTags(broker.id))
|
||||
//封装好对象 缓存在brokerStateInfo中
|
||||
brokerStateInfo.put(broker.id, ControllerBrokerStateInfo(networkClient, brokerNode, messageQueue,
|
||||
requestThread, queueSizeGauge, requestRateAndQueueTimeMetrics, reconfigurableChannelBuilder))
|
||||
}
|
||||
```
|
||||
1. 将所有存活broker 封装成一个个`ControllerBrokerStateInfo`对象保存在缓存中; 对象中包含了`RequestSendThread` 请求发送线程 对象; 什么时候执行发送线程 ,我们下面分析
|
||||
2. `messageQueue:` 一个阻塞队列,里面放的都是待执行的请求,里面的对象`QueueItem` 封装了
|
||||
请求接口`ApiKeys`,`AbstractControlRequest`请求体对象;`AbstractResponse` 回调函数和`enqueueTimeMs`入队时间
|
||||
3. `RequestSendThread` 发送请求的线程 , 跟Broker们的网络连接就是通过这里进行的;比如下图中向Brokers们(当然包含自己)发送`UPDATE_METADATA`更新元数据的请求
|
||||

|
||||
|
||||
|
||||
### 3. startRequestSendThread 启动网络请求线程
|
||||
>把所有跟Broker连接的网络请求线程开起来
|
||||
```scala
|
||||
protected def startRequestSendThread(brokerId: Int): Unit = {
|
||||
val requestThread = brokerStateInfo(brokerId).requestSendThread
|
||||
if (requestThread.getState == Thread.State.NEW)
|
||||
requestThread.start()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
线程执行代码块 ; 以下省略了部分代码
|
||||
```scala
|
||||
override def doWork(): Unit = {
|
||||
|
||||
def backoff(): Unit = pause(100, TimeUnit.MILLISECONDS)
|
||||
|
||||
//从阻塞请求队列里面获取有没有待执行的请求
|
||||
val QueueItem(apiKey, requestBuilder, callback, enqueueTimeMs) = queue.take()
|
||||
requestRateAndQueueTimeMetrics.update(time.milliseconds() - enqueueTimeMs, TimeUnit.MILLISECONDS)
|
||||
|
||||
var clientResponse: ClientResponse = null
|
||||
try {
|
||||
var isSendSuccessful = false
|
||||
while (isRunning && !isSendSuccessful) {
|
||||
// if a broker goes down for a long time, then at some point the controller's zookeeper listener will trigger a
|
||||
// removeBroker which will invoke shutdown() on this thread. At that point, we will stop retrying.
|
||||
try {
|
||||
//检查跟Broker的网络连接是否畅通,如果连接不上会重试
|
||||
if (!brokerReady()) {
|
||||
isSendSuccessful = false
|
||||
backoff()
|
||||
}
|
||||
else {
|
||||
//构建请求参数
|
||||
val clientRequest = networkClient.newClientRequest(brokerNode.idString, requestBuilder,
|
||||
time.milliseconds(), true)
|
||||
//发起网络请求
|
||||
clientResponse = NetworkClientUtils.sendAndReceive(networkClient, clientRequest, time)
|
||||
isSendSuccessful = true
|
||||
}
|
||||
} catch {
|
||||
}
|
||||
if (clientResponse != null) {
|
||||
val requestHeader = clientResponse.requestHeader
|
||||
val api = requestHeader.apiKey
|
||||
if (api != ApiKeys.LEADER_AND_ISR && api != ApiKeys.STOP_REPLICA && api != ApiKeys.UPDATE_METADATA)
|
||||
throw new KafkaException(s"Unexpected apiKey received: $apiKey")
|
||||
|
||||
if (callback != null) {
|
||||
callback(response)
|
||||
}
|
||||
}
|
||||
} catch {
|
||||
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 从请求队列`queue`中take请求; 如果有的话就开始执行,没有的话就阻塞住
|
||||
2. 检查请求的目标Broker是否可以连接; 连接不通会一直进行尝试,然后在某个时候,控制器的 zookeeper 侦听器将触发一个 `removeBroker`,它将在此线程上调用 shutdown()。就不会在重试了
|
||||
3. 发起请求;
|
||||
4. 如果请求失败,则重新连接Broker发送请求
|
||||
5. 返回成功,调用回调接口
|
||||
6. 值得注意的是<font color="red"> Controller发起的请求,收到Response中的ApiKeys中如果不是 `LEADER_AND_ISR`、`STOP_REPLICA`、`UPDATE_METADATA` 三个请求,就会抛出异常; 不会进行callBack的回调; </font> 不过也是很奇怪,如果Controller限制只能发起这几个请求的话,为什么在发起请求之前去做拦截,而要在返回之后做拦截; **个人猜测 可能是Broker在Response带上ApiKeys, 在Controller 调用callBack的时候可能会根据ApiKeys的不同而处理不同逻辑吧;但是又只想对Broker开放那三个接口;**
|
||||
|
||||
|
||||
|
||||
### 4. 向RequestSendThread的请求队列queue中添加请求
|
||||
> 上面的线程启动完成之后,queue中还没有待执行的请求的,那么什么时候有添加请求呢?
|
||||
|
||||
添加请求最终都会调用接口`` ,反查一下就知道了;
|
||||
```java
|
||||
def sendRequest(brokerId: Int, request: AbstractControlRequest.Builder[_ <: AbstractControlRequest],
|
||||
callback: AbstractResponse => Unit = null): Unit = {
|
||||
brokerLock synchronized {
|
||||
val stateInfoOpt = brokerStateInfo.get(brokerId)
|
||||
stateInfoOpt match {
|
||||
case Some(stateInfo) =>
|
||||
stateInfo.messageQueue.put(QueueItem(request.apiKey, request, callback, time.milliseconds()))
|
||||
case None =>
|
||||
warn(s"Not sending request $request to broker $brokerId, since it is offline.")
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**这里举一个**🌰 ; 看看Controller向Broker发起一个`UPDATE_METADATA`请求;
|
||||

|
||||

|
||||
|
||||
1. 可以看到调用了`sendRequest`请求 ; 请求的接口ApiKey=`UPDATE_METADATA`
|
||||
2. 回调方法就是如上所示; 向事件管理器`ControllerChannelManager`中添加一个事件`UpdateMetadataResponseReceived`
|
||||
3. 当请求成功之后,调用2中的callBack, `UpdateMetadataResponseReceived`被添加到事件管理器中; 就会立马被执行(排队)
|
||||
4. 执行地方如下图所示,只不过它也没干啥,也就是如果返回异常response就打印一下日志
|
||||

|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
### 5. Broker接收Controller的请求
|
||||
> 上面说了Controller对所有Brokers(当然也包括自己)发起请求; 那么Brokers接受请求的地方在哪里呢,我们下面分析分析
|
||||
|
||||
这个部分内容我们在[【kafka源码】TopicCommand之创建Topic源码解析]() 中也分析过,处理过程都是一样的;
|
||||
比如还是上面的例子🌰, 发起请求了之后,Broker处理的地方在`KafkaRequestHandler.run`里面的`apis.handle(request)`;
|
||||

|
||||
|
||||
可以看到这里列举了所有的接口请求;我们找到`UPDATE_METADATA`处理逻辑;
|
||||
里面的处理逻辑就不进去看了,不然超出了本篇文章的范畴;
|
||||
|
||||
|
||||
### 6. Broker服务下线
|
||||
我们模拟一下Broker宕机了, 手动把zk上的` /brokers/ids/broker节点`删除; 因为Controller是有对节点`watch`的, 就会看到Controller收到了变更通知,并且调用了 `KafkaController.processBrokerChange()`接口;
|
||||
```scala
|
||||
private def processBrokerChange(): Unit = {
|
||||
if (!isActive) return
|
||||
val curBrokerAndEpochs = zkClient.getAllBrokerAndEpochsInCluster
|
||||
val curBrokerIdAndEpochs = curBrokerAndEpochs map { case (broker, epoch) => (broker.id, epoch) }
|
||||
val curBrokerIds = curBrokerIdAndEpochs.keySet
|
||||
val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
|
||||
val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
|
||||
val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
|
||||
val bouncedBrokerIds = (curBrokerIds & liveOrShuttingDownBrokerIds)
|
||||
.filter(brokerId => curBrokerIdAndEpochs(brokerId) > controllerContext.liveBrokerIdAndEpochs(brokerId))
|
||||
val newBrokerAndEpochs = curBrokerAndEpochs.filter { case (broker, _) => newBrokerIds.contains(broker.id) }
|
||||
val bouncedBrokerAndEpochs = curBrokerAndEpochs.filter { case (broker, _) => bouncedBrokerIds.contains(broker.id) }
|
||||
val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
|
||||
val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
|
||||
val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
|
||||
val bouncedBrokerIdsSorted = bouncedBrokerIds.toSeq.sorted
|
||||
info(s"Newly added brokers: ${newBrokerIdsSorted.mkString(",")}, " +
|
||||
s"deleted brokers: ${deadBrokerIdsSorted.mkString(",")}, " +
|
||||
s"bounced brokers: ${bouncedBrokerIdsSorted.mkString(",")}, " +
|
||||
s"all live brokers: ${liveBrokerIdsSorted.mkString(",")}")
|
||||
|
||||
newBrokerAndEpochs.keySet.foreach(controllerChannelManager.addBroker)
|
||||
bouncedBrokerIds.foreach(controllerChannelManager.removeBroker)
|
||||
bouncedBrokerAndEpochs.keySet.foreach(controllerChannelManager.addBroker)
|
||||
deadBrokerIds.foreach(controllerChannelManager.removeBroker)
|
||||
if (newBrokerIds.nonEmpty) {
|
||||
controllerContext.addLiveBrokersAndEpochs(newBrokerAndEpochs)
|
||||
onBrokerStartup(newBrokerIdsSorted)
|
||||
}
|
||||
if (bouncedBrokerIds.nonEmpty) {
|
||||
controllerContext.removeLiveBrokers(bouncedBrokerIds)
|
||||
onBrokerFailure(bouncedBrokerIdsSorted)
|
||||
controllerContext.addLiveBrokersAndEpochs(bouncedBrokerAndEpochs)
|
||||
onBrokerStartup(bouncedBrokerIdsSorted)
|
||||
}
|
||||
if (deadBrokerIds.nonEmpty) {
|
||||
controllerContext.removeLiveBrokers(deadBrokerIds)
|
||||
onBrokerFailure(deadBrokerIdsSorted)
|
||||
}
|
||||
|
||||
if (newBrokerIds.nonEmpty || deadBrokerIds.nonEmpty || bouncedBrokerIds.nonEmpty) {
|
||||
info(s"Updated broker epochs cache: ${controllerContext.liveBrokerIdAndEpochs}")
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. 这里会去zk里面获取所有的Broker信息; 并将得到的数据跟当前Controller缓存中的所有Broker信息做对比;
|
||||
2. 如果有新上线的Broker,则会执行 Broker上线的流程
|
||||
3. 如果有删除的Broker,则执行Broker下线的流程; 比如`removeLiveBrokers`
|
||||
|
||||
收到删除节点之后, Controller 会觉得Broker已经下线了,即使那台Broker服务是正常的,那么它仍旧提供不了服务
|
||||
|
||||
### 7. Broker上下线
|
||||
本篇主要讲解**Controller与Brokers之间的网络通信**
|
||||
故**Broker上下线**内容单独开一篇文章来详细讲解 [【kafka源码】Brokers的上下线流程](https://shirenchuang.blog.csdn.net/article/details/117846476)
|
||||
|
||||
## 源码总结
|
||||
本篇文章内容比较简单, Controller和Broker之间的通信就是通过 `RequestSendThread` 这个线程来进行发送请求;
|
||||
`RequestSendThread`维护的阻塞请求队列在没有任务的时候处理阻塞状态;
|
||||
当有需要发起请求的时候,直接向`queue`中添加任务就行了;
|
||||
|
||||
Controller自身也是一个Broker,所以Controller发出的请求,自己也会收到并且执行
|
||||
|
||||
|
||||
## Q&A
|
||||
### 如果Controller与Broker网络连接不通会怎么办?
|
||||
> 会一直进行重试, 直到zookeeper发现Broker通信有问题,会将这台Broker的节点移除,Controller就会收到通知,并将Controller与这台Broker的`RequestSendThread`线程shutdown;就不会再重试了; 如果zk跟Broker之间网络通信是正常的,只是发起的逻辑请求就是失败,则会一直进行重试
|
||||
|
||||
### 如果手动将zk中的 /brokers/ids/ 下的子节点删除会怎么样?
|
||||
>手动删除` /brokers/ids/Broker的ID`, Controller收到变更通知,则将该Broker在Controller中处理下线逻辑; 所有该Broker已经游离于集群之外,即使它服务还是正常的,但是它却提供不了服务了; 只能重启该Broker重新注册;
|
||||
289
docs/zh/Kafka分享/Kafka Controller /Controller中的状态机.md
Normal file
@@ -0,0 +1,289 @@
|
||||
|
||||
|
||||
前言
|
||||
>Controller中有两个状态机分别是`ReplicaStateMachine 副本状态机`和`PartitionStateMachine分区状态机` ; 他们的作用是负责处理每个分区和副本在状态变更过程中要处理的事情; 并且确保从上一个状态变更到下一个状态是合法的; 源码中你能看到很多地方只是进行状态流转; 所以我们要清楚每个流转都做了哪些事情;对我们阅读源码更清晰
|
||||
>
|
||||
>----
|
||||
>在之前的文章 [【kafka源码】Controller启动过程以及选举流程源码分析]() 中,我们有分析到,
|
||||
>`replicaStateMachine.startup()` 和 `partitionStateMachine.startup()`
|
||||
>副本专状态机和分区状态机的启动; 那我们就从这里开始好好讲下两个状态机
|
||||
|
||||
|
||||
## 源码解析
|
||||
<font color="red">如果觉得阅读源码解析太枯燥,请直接看 源码总结及其后面部分</font>
|
||||
|
||||
### ReplicaStateMachine 副本状态机
|
||||
Controller 选举成功之后 调用`ReplicaStateMachine.startup`启动副本状态机
|
||||
```scala
|
||||
|
||||
def startup(): Unit = {
|
||||
//初始化所有副本的状态
|
||||
initializeReplicaState()
|
||||
val (onlineReplicas, offlineReplicas) = controllerContext.onlineAndOfflineReplicas
|
||||
handleStateChanges(onlineReplicas.toSeq, OnlineReplica)
|
||||
handleStateChanges(offlineReplicas.toSeq, OfflineReplica)
|
||||
}
|
||||
|
||||
```
|
||||
1. 初始化所有副本的状态,如果副本在线则状态变更为`OnlineReplica` ;否则变更为`ReplicaDeletionIneligible`副本删除失败状态; 判断副本是否在线的条件是 副本所在Broker需要在线&&副本没有被标记为已下线状态(Map `replicasOnOfflineDirs`用于维护副本失败在线),一般情况下这个里面是被标记为删除的Topic
|
||||
2. 执行状态变更处理器
|
||||
|
||||
#### ReplicaStateMachine状态变更处理器
|
||||
>它确保每个状态转换都发生从合法的先前状态到目标状态。有效的状态转换是:
|
||||
>1. `NonExistentReplica --> NewReplica: `-- 将 LeaderAndIsr 请求与当前领导者和 isr 发送到新副本,并将分区的 UpdateMetadata 请求发送到每个实时代理
|
||||
>2. `NewReplica -> OnlineReplica` --如果需要,将新副本添加到分配的副本列表中
|
||||
>3. `OnlineReplica,OfflineReplica -> OnlineReplica:`--将带有当前领导者和 isr 的 LeaderAndIsr 请求发送到新副本,并将分区的 UpdateMetadata 请求发送到每个实时代理
|
||||
>4. `NewReplica,OnlineReplica,OfflineReplica,ReplicaDeletionIneligible -> OfflineReplica:`:-- 向副本发送 `StopReplicaRequest` ;
|
||||
> -- 从 isr 中删除此副本并将 LeaderAndIsr 请求(带有新的 isr)发送到领导副本,并将分区的 UpdateMetadata 请求发送到每个实时代理。
|
||||
> 5. `OfflineReplica -> ReplicaDeletionStarted:` -- 向副本发送 `StopReplicaRequest` (带 删除参数);
|
||||
> 6. `ReplicaDeletionStarted -> ReplicaDeletionSuccessful:` --在状态机中标记副本的状态
|
||||
> 7. `ReplicaDeletionStarted -> ReplicaDeletionIneligible:` --在状态机中标记副本的状态
|
||||
> 8. `ReplicaDeletionSuccessful -> NonExistentReplica:`--从内存分区副本分配缓存中删除副本
|
||||
```scala
|
||||
private def doHandleStateChanges(replicaId: Int, replicas: Seq[PartitionAndReplica], targetState: ReplicaState): Unit = {
|
||||
//如果有副本没有设置状态,则初始化为`NonExistentReplica`
|
||||
replicas.foreach(replica => controllerContext.putReplicaStateIfNotExists(replica, NonExistentReplica))
|
||||
//校验状态流转是不是正确
|
||||
val (validReplicas, invalidReplicas) = controllerContext.checkValidReplicaStateChange(replicas, targetState)
|
||||
invalidReplicas.foreach(replica => logInvalidTransition(replica, targetState))
|
||||
|
||||
//代码省略,在下面细细说来
|
||||
}
|
||||
```
|
||||
```scala
|
||||
controllerBrokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
|
||||
```
|
||||
1. 如果有副本没有设置状态,则初始化为`NonExistentReplica`
|
||||
2. 校验状态流转是不是正确
|
||||
3. 执行完了之后,还会可能尝试发一次`UPDATA_METADATA`
|
||||
|
||||
##### 先前状态 ==> OnlineReplica
|
||||
可流转的状态有
|
||||
1. `NewReplica`
|
||||
2. `OnlineReplica`
|
||||
3. `OfflineReplica`
|
||||
4. `ReplicaDeletionIneligible`
|
||||
|
||||
###### NewReplica ==》OnlineReplica
|
||||
>如果有需要,将新副本添加到分配的副本列表中;
|
||||
>比如[【kafka源码】TopicCommand之创建Topic源码解析]()
|
||||
|
||||
```scala
|
||||
case NewReplica =>
|
||||
val assignment = controllerContext.partitionFullReplicaAssignment(partition)
|
||||
if (!assignment.replicas.contains(replicaId)) {
|
||||
error(s"Adding replica ($replicaId) that is not part of the assignment $assignment")
|
||||
val newAssignment = assignment.copy(replicas = assignment.replicas :+ replicaId)
|
||||
controllerContext.updatePartitionFullReplicaAssignment(partition, newAssignment)
|
||||
}
|
||||
```
|
||||
|
||||
###### 其他状态 ==》OnlineReplica
|
||||
> 将带有当前领导者和 isr 的 LeaderAndIsr 请求发送到新副本,并将分区的 UpdateMetadata 请求发送到每个实时代理
|
||||
```scala
|
||||
case _ =>
|
||||
controllerContext.partitionLeadershipInfo.get(partition) match {
|
||||
case Some(leaderIsrAndControllerEpoch) =>
|
||||
controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(Seq(replicaId),
|
||||
replica.topicPartition,
|
||||
leaderIsrAndControllerEpoch,
|
||||
controllerContext.partitionFullReplicaAssignment(partition), isNew = false)
|
||||
case None =>
|
||||
}
|
||||
```
|
||||
##### 先前状态 ==> ReplicaDeletionIneligible
|
||||
> 在内存`replicaStates`中更新一下副本状态为`ReplicaDeletionIneligible`
|
||||
##### 先前状态 ==》OfflinePartition
|
||||
>-- 向副本发送 StopReplicaRequest ;
|
||||
– 从 isr 中删除此副本并将 LeaderAndIsr 请求(带有新的 isr)发送到领导副本,并将分区的 UpdateMetadata 请求发送到每个实时代理。
|
||||
|
||||
```scala
|
||||
|
||||
case OfflineReplica =>
|
||||
// 添加构建StopReplicaRequest请求的擦书,deletePartition = false表示还不删除分区
|
||||
validReplicas.foreach { replica =>
|
||||
controllerBrokerRequestBatch.addStopReplicaRequestForBrokers(Seq(replicaId), replica.topicPartition, deletePartition = false)
|
||||
}
|
||||
val (replicasWithLeadershipInfo, replicasWithoutLeadershipInfo) = validReplicas.partition { replica =>
|
||||
controllerContext.partitionLeadershipInfo.contains(replica.topicPartition)
|
||||
}
|
||||
//尝试从多个分区的 isr 中删除副本。从 isr 中删除副本会更新 Zookeeper 中的分区状态
|
||||
//反复尝试从多个分区的 isr 中删除副本,直到没有更多剩余的分区可以重试。
|
||||
//从/brokers/topics/test_create_topic13/partitions获取分区相关数据
|
||||
//移除副本之后,重新写入到zk中
|
||||
val updatedLeaderIsrAndControllerEpochs = removeReplicasFromIsr(replicaId, replicasWithLeadershipInfo.map(_.topicPartition))
|
||||
updatedLeaderIsrAndControllerEpochs.foreach { case (partition, leaderIsrAndControllerEpoch) =>
|
||||
if (!controllerContext.isTopicQueuedUpForDeletion(partition.topic)) {
|
||||
val recipients = controllerContext.partitionReplicaAssignment(partition).filterNot(_ == replicaId)
|
||||
controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(recipients,
|
||||
partition,
|
||||
leaderIsrAndControllerEpoch,
|
||||
controllerContext.partitionFullReplicaAssignment(partition), isNew = false)
|
||||
}
|
||||
val replica = PartitionAndReplica(partition, replicaId)
|
||||
val currentState = controllerContext.replicaState(replica)
|
||||
logSuccessfulTransition(replicaId, partition, currentState, OfflineReplica)
|
||||
controllerContext.putReplicaState(replica, OfflineReplica)
|
||||
}
|
||||
|
||||
replicasWithoutLeadershipInfo.foreach { replica =>
|
||||
val currentState = controllerContext.replicaState(replica)
|
||||
logSuccessfulTransition(replicaId, replica.topicPartition, currentState, OfflineReplica)
|
||||
controllerBrokerRequestBatch.addUpdateMetadataRequestForBrokers(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(replica.topicPartition))
|
||||
controllerContext.putReplicaState(replica, OfflineReplica)
|
||||
}
|
||||
```
|
||||
1. 添加构建StopReplicaRequest请求的参数,`deletePartition = false`表示还不删除分区
|
||||
2. 反复尝试从多个分区的 isr 中删除副本,直到没有更多剩余的分区可以重试。从`/brokers/topics/{TOPICNAME}/partitions`获取分区相关数据,进过计算然后重新写入到zk中`/brokers/topics/{TOPICNAME}/partitions/state/`; 当然内存中的副本状态机的状态也会变更成 `OfflineReplica` ;
|
||||
3. 根据条件判断是否需要发送`LeaderAndIsrRequest`、`UpdateMetadataRequest`
|
||||
4. 发送`StopReplicaRequests`请求;
|
||||
|
||||
|
||||
##### 先前状态==>ReplicaDeletionStarted
|
||||
> 向指定的副本发送 [StopReplicaRequest 请求]()(带 删除参数);
|
||||
|
||||
```scala
|
||||
controllerBrokerRequestBatch.addStopReplicaRequestForBrokers(Seq(replicaId), replica.topicPartition, deletePartition = true)
|
||||
|
||||
```
|
||||
|
||||
##### 当前状态 ==> NewReplica
|
||||
>一般情况下,创建Topic的时候会触发这个流转;
|
||||
|
||||
```scala
|
||||
case NewReplica =>
|
||||
validReplicas.foreach { replica =>
|
||||
val partition = replica.topicPartition
|
||||
val currentState = controllerContext.replicaState(replica)
|
||||
|
||||
controllerContext.partitionLeadershipInfo.get(partition) match {
|
||||
case Some(leaderIsrAndControllerEpoch) =>
|
||||
if (leaderIsrAndControllerEpoch.leaderAndIsr.leader == replicaId) {
|
||||
val exception = new StateChangeFailedException(s"Replica $replicaId for partition $partition cannot be moved to NewReplica state as it is being requested to become leader")
|
||||
logFailedStateChange(replica, currentState, OfflineReplica, exception)
|
||||
} else {
|
||||
controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(Seq(replicaId),
|
||||
replica.topicPartition,
|
||||
leaderIsrAndControllerEpoch,
|
||||
controllerContext.partitionFullReplicaAssignment(replica.topicPartition),
|
||||
isNew = true)
|
||||
logSuccessfulTransition(replicaId, partition, currentState, NewReplica)
|
||||
controllerContext.putReplicaState(replica, NewReplica)
|
||||
}
|
||||
case None =>
|
||||
logSuccessfulTransition(replicaId, partition, currentState, NewReplica)
|
||||
controllerContext.putReplicaState(replica, NewReplica)
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 在内存中更新 副本状态;
|
||||
2. 在某些情况下,将带有当前领导者和 isr 的 LeaderAndIsr 请求发送到新副本,并将分区的 UpdateMetadata 请求发送到每个实时代理
|
||||
|
||||
##### 当前状态 ==> NonExistentPartition
|
||||
1. `OfflinePartition`
|
||||
|
||||
##### 当前状态 ==> NonExistentPartition
|
||||
|
||||
|
||||
### PartitionStateMachine分区状态机
|
||||
`PartitionStateMachine.startup`
|
||||
```scala
|
||||
def startup(): Unit = {
|
||||
initializePartitionState()
|
||||
triggerOnlinePartitionStateChange()
|
||||
}
|
||||
```
|
||||
`PartitionStateMachine.initializePartitionState()`
|
||||
> 初始化分区状态
|
||||
```scala
|
||||
/**
|
||||
* Invoked on startup of the partition's state machine to set the initial state for all existing partitions in
|
||||
* zookeeper
|
||||
*/
|
||||
private def initializePartitionState(): Unit = {
|
||||
for (topicPartition <- controllerContext.allPartitions) {
|
||||
// check if leader and isr path exists for partition. If not, then it is in NEW state
|
||||
//检查leader和isr路径是否存在
|
||||
controllerContext.partitionLeadershipInfo.get(topicPartition) match {
|
||||
case Some(currentLeaderIsrAndEpoch) =>
|
||||
if (controllerContext.isReplicaOnline(currentLeaderIsrAndEpoch.leaderAndIsr.leader, topicPartition))
|
||||
// leader is alive
|
||||
controllerContext.putPartitionState(topicPartition, OnlinePartition)
|
||||
else
|
||||
controllerContext.putPartitionState(topicPartition, OfflinePartition)
|
||||
case None =>
|
||||
controllerContext.putPartitionState(topicPartition, NewPartition)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 如果分区不存在`LeaderIsr`,则状态是`NewPartition`
|
||||
2. 如果分区存在`LeaderIsr`,就判断一下Leader是否存活
|
||||
2.1 如果存活的话,状态是`OnlinePartition`
|
||||
2.2 否则是`OfflinePartition`
|
||||
|
||||
|
||||
`PartitionStateMachine. triggerOnlinePartitionStateChange()`
|
||||
>尝试将所有处于 `NewPartition `或 `OfflinePartition `状态的分区移动到 `OnlinePartition` 状态,但属于要删除的主题的分区除外
|
||||
|
||||
```scala
|
||||
def triggerOnlinePartitionStateChange(): Unit = {
|
||||
val partitions = controllerContext.partitionsInStates(Set(OfflinePartition, NewPartition))
|
||||
triggerOnlineStateChangeForPartitions(partitions)
|
||||
}
|
||||
|
||||
private def triggerOnlineStateChangeForPartitions(partitions: collection.Set[TopicPartition]): Unit = {
|
||||
// try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions
|
||||
// that belong to topics to be deleted
|
||||
val partitionsToTrigger = partitions.filter { partition =>
|
||||
!controllerContext.isTopicQueuedUpForDeletion(partition.topic)
|
||||
}.toSeq
|
||||
|
||||
handleStateChanges(partitionsToTrigger, OnlinePartition, Some(OfflinePartitionLeaderElectionStrategy(false)))
|
||||
// TODO: If handleStateChanges catches an exception, it is not enough to bail out and log an error.
|
||||
// It is important to trigger leader election for those partitions.
|
||||
}
|
||||
```
|
||||
|
||||
#### PartitionStateMachine 分区状态机
|
||||
|
||||
`PartitionStateMachine.doHandleStateChanges `
|
||||
` controllerBrokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
|
||||
`
|
||||
|
||||
>它确保每个状态转换都发生从合法的先前状态到目标状态。有效的状态转换是:
|
||||
>1. `NonExistentPartition -> NewPartition:` 将分配的副本从 ZK 加载到控制器缓存
|
||||
>2. `NewPartition -> OnlinePartition:` 将第一个活动副本指定为领导者,将所有活动副本指定为 isr;将此分区的leader和isr写入ZK ;向每个实时副本发送 LeaderAndIsr 请求,向每个实时代理发送 UpdateMetadata 请求
|
||||
>3. `OnlinePartition,OfflinePartition -> OnlinePartition:` 为这个分区选择新的leader和isr以及一组副本来接收LeaderAndIsr请求,并将leader和isr写入ZK;
|
||||
> 对于这个分区,向每个接收副本发送LeaderAndIsr请求,向每个live broker发送UpdateMetadata请求
|
||||
> 4. `NewPartition,OnlinePartition,OfflinePartition -> OfflinePartition:` 将分区状态标记为 Offline
|
||||
> 5. `OfflinePartition -> NonExistentPartition:` 将分区状态标记为 NonExistentPartition
|
||||
>
|
||||
##### 先前状态==》NewPartition
|
||||
>将分配的副本从 ZK 加载到控制器缓存
|
||||
|
||||
##### 先前状态==》OnlinePartition
|
||||
> 将第一个活动副本指定为领导者,将所有活动副本指定为 isr;将此分区的leader和isr写入ZK ;向每个实时副本发送 LeaderAndIsr 请求,向每个实时Broker发送 UpdateMetadata 请求
|
||||
|
||||
创建一个新的Topic的时候,我们主要看下面这个接口`initializeLeaderAndIsrForPartitions`
|
||||

|
||||
|
||||
0. 获取`leaderIsrAndControllerEpochs`; Leader为副本的第一个;
|
||||
1. 向zk中写入`/brokers/topics/{topicName}/partitions/` 持久节点; 无数据
|
||||
2. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}` 持久节点; 无数据
|
||||
3. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}/state` 持久节点; 数据为`leaderIsrAndControllerEpoch`
|
||||
4. 向副本所属Broker发送[`leaderAndIsrRequest`]()请求
|
||||
5. 向所有Broker发送[`UPDATE_METADATA` ]()请求
|
||||
|
||||
|
||||
##### 先前状态==》OfflinePartition
|
||||
>将分区状态标记为 Offline ; 在Map对象`partitionStates`中维护的; `NewPartition,OnlinePartition,OfflinePartition ` 可转;
|
||||
##### 先前状态==》NonExistentPartition
|
||||
|
||||
>将分区状态标记为 Offline ; 在Map对象`partitionStates`中维护的; `OfflinePartition ` 可转;
|
||||
|
||||
|
||||
|
||||
## 源码总结
|
||||
|
||||
## Q&A
|
||||
339
docs/zh/Kafka分享/Kafka Controller /Controller启动过程以及选举流程源码分析.md
Normal file
@@ -0,0 +1,339 @@
|
||||
[TOC]
|
||||
|
||||
## 前言
|
||||
>本篇文章,我们开始来分析分析Kafka的`Controller`部分的源码,Controller 作为 Kafka Server 端一个重要的组件,它的角色类似于其他分布式系统 Master 的角色,跟其他系统不一样的是,Kafka 集群的任何一台 Broker 都可以作为 Controller,但是在一个集群中同时只会有一个 Controller 是 alive 状态。Controller 在集群中负责的事务很多,比如:集群 meta 信息的一致性保证、Partition leader 的选举、broker 上下线等都是由 Controller 来具体负责。
|
||||
|
||||
## 源码分析
|
||||
老样子,我们还是先来撸一遍源码之后,再进行总结
|
||||
<font color="red">如果觉得阅读源码解析太枯燥,请直接看 **源码总结及其后面部分**</font>
|
||||
|
||||
|
||||
### 1.源码入口KafkaServer.startup
|
||||
我们在启动kafka服务的时候,最开始执行的是`KafkaServer.startup`方法; 这里面包含了kafka启动的所有流程; 我们主要看Controller的启动流程
|
||||
```scala
|
||||
def startup(): Unit = {
|
||||
try {
|
||||
//省略部分代码....
|
||||
/* start kafka controller */
|
||||
kafkaController = new KafkaController(config, zkClient, time, metrics, brokerInfo, brokerEpoch, tokenManager, threadNamePrefix)
|
||||
kafkaController.startup()
|
||||
//省略部分代码....
|
||||
}
|
||||
}
|
||||
```
|
||||
### 2. kafkaController.startup() 启动
|
||||
```scala
|
||||
/**
|
||||
每个kafka启动的时候都会调用, 注意这并不假设当前代理是控制器。
|
||||
它只是注册会话过期侦听器 并启动控制器尝试选举Controller
|
||||
*/
|
||||
def startup() = {
|
||||
//注册状态变更处理器; 这里是把`StateChangeHandler`这个处理器放到一个`stateChangeHandlers` Map中了
|
||||
zkClient.registerStateChangeHandler(new StateChangeHandler {
|
||||
override val name: String = StateChangeHandlers.ControllerHandler
|
||||
override def afterInitializingSession(): Unit = {
|
||||
eventManager.put(RegisterBrokerAndReelect)
|
||||
}
|
||||
override def beforeInitializingSession(): Unit = {
|
||||
val queuedEvent = eventManager.clearAndPut(Expire)
|
||||
|
||||
// Block initialization of the new session until the expiration event is being handled,
|
||||
// which ensures that all pending events have been processed before creating the new session
|
||||
queuedEvent.awaitProcessing()
|
||||
}
|
||||
})
|
||||
// 在事件管理器的队列里面放入 一个 Startup启动事件; 这个时候放入还不会执行;
|
||||
eventManager.put(Startup)
|
||||
//启动事件管理器,启动的是一个 `ControllerEventThread`的线程
|
||||
eventManager.start()
|
||||
}
|
||||
```
|
||||
1. `zkClient.registerStateChangeHandler` 注册一个`StateChangeHandler` 状态变更处理器; 有一个map `stateChangeHandlers`来维护这个处理器列表; 这个类型的处理器有下图三个方法,可以看到我们这里实现了`beforeInitializingSession`和`afterInitializingSession`方法,具体调用的时机,我后面再分析(监听zk的数据变更)
|
||||
2. `ControllerEventManager`是Controller的事件管理器; 里面维护了一个阻塞队列`queue`; 这个queue里面存放的是所有的Controller事件; 按顺序排队执行入队的事件; 上面的代码中`eventManager.put(Startup)` 在队列中放入了一个`Startup`启动事件; 所有的事件都是集成了`ControllerEvent`类的
|
||||
3. 启动事件管理器, 从待执行事件队列`queue`中获取事件进行执行,刚刚不是假如了一个`StartUp`事件么,这个事件就会执行这个事件
|
||||
|
||||
### 3. ControllerEventThread 执行事件线程
|
||||
` eventManager.start()` 之后执行了下面的方法
|
||||
|
||||
```scala
|
||||
class ControllerEventThread(name: String) extends ShutdownableThread(name = name, isInterruptible = false) {
|
||||
override def doWork(): Unit = {
|
||||
//从待执行队列里面take一个事件; 没有事件的时候这里会阻塞
|
||||
val dequeued = queue.take()
|
||||
dequeued.event match {
|
||||
case ShutdownEventThread => // The shutting down of the thread has been initiated at this point. Ignore this event.
|
||||
case controllerEvent =>
|
||||
//获取事件的ControllerState值;不同事件不一样,都集成自ControllerState
|
||||
_state = controllerEvent.state
|
||||
eventQueueTimeHist.update(time.milliseconds() - dequeued.enqueueTimeMs)
|
||||
try {
|
||||
// 定义process方法; 最终执行的是 事件提供的process方法;
|
||||
def process(): Unit = dequeued.process(processor)
|
||||
|
||||
//根据state获取不同的KafkaTimer 主要是为了采集数据; 我们只要关注里面是执行了 process()方法就行了
|
||||
rateAndTimeMetrics.get(state) match {
|
||||
case Some(timer) => timer.time { process() }
|
||||
case None => process()
|
||||
}
|
||||
} catch {
|
||||
case e: Throwable => error(s"Uncaught error processing event $controllerEvent", e)
|
||||
}
|
||||
|
||||
_state = ControllerState.Idle
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
```
|
||||
1. `val dequeued = queue.take()`从待执行队列里面take一个事件; 没有事件的时候这里会阻塞
|
||||
2. `dequeued.process(processor)`调用具体事件实现的 `process方法`如下图, 不过要注意的是这里使用了`CountDownLatch(1)`, 那肯定有个地方调用了`processingStarted.await()` 来等待这里的`process()执行完成`;上面的startUp方法就调用了; 
|
||||

|
||||
|
||||
### 4. processStartup 启动流程
|
||||
启动Controller的流程
|
||||
```scala
|
||||
private def processStartup(): Unit = {
|
||||
//注册znode变更事件和watch Controller节点是否在zk中存在
|
||||
zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
|
||||
//选举逻辑
|
||||
elect()
|
||||
}
|
||||
|
||||
```
|
||||
1. 注册`ZNodeChangeHandler` 节点变更事件处理器,在map `zNodeChangeHandlers`中保存了key=`/controller`;value=`ZNodeChangeHandler`的键值对; 其中`ZNodeChangeHandler`处理器有如下三个接口
|
||||

|
||||
2. 然后向zk发起一个`ExistsRequest(/controller)`的请求,去查询一下`/controller`节点是否存在; 并且如果不存在的话,就注册一个`watch` 监视这个节点;从下面的代码可以看出
|
||||

|
||||

|
||||
因为上一步中我们在map `zNodeChangeHandlers`中保存了key=`/controller`; 所以上图中可知,需要注册`watch`来进行`/controller`节点的监控;
|
||||
kafka是是怎实现监听的呢?`zookeeper`构建的时候传入了自定义的`WATCH`
|
||||

|
||||

|
||||
|
||||
|
||||
|
||||
3. 选举; 选举的过程其实就是几个Broker抢占式去成为Controller; 谁先创建`/controller`这个节点; 谁就成为Controller; 我们下面仔细分析以下选择
|
||||
|
||||
### 5. Controller的选举elect()
|
||||
|
||||
```scala
|
||||
private def elect(): Unit = {
|
||||
//去zk上获取 /controller 节点的数据 如果没有就赋值为-1
|
||||
activeControllerId = zkClient.getControllerId.getOrElse(-1)
|
||||
//如果获取到了数据就
|
||||
if (activeControllerId != -1) {
|
||||
debug(s"Broker $activeControllerId has been elected as the controller, so stopping the election process.")
|
||||
return
|
||||
}
|
||||
|
||||
try {
|
||||
|
||||
//尝试去zk中写入自己的Brokerid作为Controller;并且更新Controller epoch
|
||||
val (epoch, epochZkVersion) = zkClient.registerControllerAndIncrementControllerEpoch(config.brokerId)
|
||||
controllerContext.epoch = epoch
|
||||
controllerContext.epochZkVersion = epochZkVersion
|
||||
activeControllerId = config.brokerId
|
||||
//
|
||||
onControllerFailover()
|
||||
} catch {
|
||||
//尝试卸任Controller的职责
|
||||
maybeResign()
|
||||
//省略...
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 去zk上获取` /controller `节点的数据 如果没有就赋值为-1
|
||||
2. 如果获取到了数据说明已经有Controller注册成功了;直接结束选举流程
|
||||
3. 尝试去zk中写入自己的Brokerid作为Controller;并且更新Controller epoch
|
||||
- 获取zk节点`/controller_epoch`, 这个节点是表示Controller变更的次数,如果没有的话就创建这个节点(**持久节点**); 起始`controller_epoch=0` `ControllerEpochZkVersion=0`
|
||||
- 向zk发起一个`MultiRequest`请求;里面包含两个命令; 一个是向zk中创建`/controller`节点,节点内容是自己的brokerId;另一个命令是向`/controller_epoch`中更新数据; 数据+1 ;
|
||||
- 如果写入过程中抛出异常提示说节点已经存在,说明别的Broker已经抢先成为Controller了; 这个时候会做一个检查`checkControllerAndEpoch` 来检查是不是别的Controller抢先了; 如果是的话就抛出`ControllerMovedException`异常; 抛出了这个异常之后,当前Broker会尝试的去卸任一下Controller的职责; (因为有可能他之前是Controller,Controller转移之后都需要尝试卸任一下)
|
||||
|
||||
5. Controller确定之后,就是做一下成功之后的事情了 `onControllerFailover`
|
||||
|
||||
|
||||
### 6. 当选Controller之后的处理 onControllerFailover
|
||||
进入到`KafkaController.onControllerFailover`
|
||||
```scala
|
||||
private def onControllerFailover(): Unit = {
|
||||
|
||||
// 都是ZNodeChildChangeHandler处理器; 含有接口 handleChildChange;注册了不同事件的处理器
|
||||
// 对应的事件分别有`BrokerChange`、`TopicChange`、`TopicDeletion`、`LogDirEventNotification`
|
||||
val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
|
||||
isrChangeNotificationHandler)
|
||||
//把这些handle都维护在 map类型`zNodeChildChangeHandlers`中
|
||||
childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
|
||||
//都是ZNodeChangeHandler处理器,含有增删改节点接口;
|
||||
//分别对应的事件 `ReplicaLeaderElection`、`ZkPartitionReassignment`、``
|
||||
val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
|
||||
//把这些handle都维护在 map类型`zNodeChangeHandlers`中
|
||||
nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)
|
||||
|
||||
info("Deleting log dir event notifications")
|
||||
//删除所有日志目录事件通知。 ;获取zk中节点`/log_dir_event_notification`的值;然后把节点下面的节点全部删除
|
||||
zkClient.deleteLogDirEventNotifications(controllerContext.epochZkVersion)
|
||||
info("Deleting isr change notifications")
|
||||
// 删除节点 `/isr_change_notification`下的所有节点
|
||||
zkClient.deleteIsrChangeNotifications(controllerContext.epochZkVersion)
|
||||
info("Initializing controller context")
|
||||
initializeControllerContext()
|
||||
info("Fetching topic deletions in progress")
|
||||
val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
|
||||
info("Initializing topic deletion manager")
|
||||
topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)
|
||||
|
||||
// We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
|
||||
// are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
|
||||
// they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
|
||||
// partitionStateMachine.startup().
|
||||
info("Sending update metadata request")
|
||||
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set.empty)
|
||||
|
||||
replicaStateMachine.startup()
|
||||
partitionStateMachine.startup()
|
||||
|
||||
info(s"Ready to serve as the new controller with epoch $epoch")
|
||||
|
||||
initializePartitionReassignments()
|
||||
topicDeletionManager.tryTopicDeletion()
|
||||
val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
|
||||
onReplicaElection(pendingPreferredReplicaElections, ElectionType.PREFERRED, ZkTriggered)
|
||||
info("Starting the controller scheduler")
|
||||
kafkaScheduler.startup()
|
||||
if (config.autoLeaderRebalanceEnable) {
|
||||
scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)
|
||||
}
|
||||
scheduleUpdateControllerMetricsTask()
|
||||
|
||||
if (config.tokenAuthEnabled) {
|
||||
info("starting the token expiry check scheduler")
|
||||
tokenCleanScheduler.startup()
|
||||
tokenCleanScheduler.schedule(name = "delete-expired-tokens",
|
||||
fun = () => tokenManager.expireTokens,
|
||||
period = config.delegationTokenExpiryCheckIntervalMs,
|
||||
unit = TimeUnit.MILLISECONDS)
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 把事件`BrokerChange`、`TopicChange`、`TopicDeletion`、`LogDirEventNotification`对应的handle处理器都维护在 map类型`zNodeChildChangeHandlers`中
|
||||
2. 把事件 `ReplicaLeaderElection`、`ZkPartitionReassignment`对应的handle处理器都维护在 map类型`zNodeChildChangeHandlers`中
|
||||
3. 删除zk中节点`/log_dir_event_notification`下的所有节点
|
||||
4. 删除zk中节点 `/isr_change_notification`下的所有节点
|
||||
5. 初始化Controller的上下文对象`initializeControllerContext()`
|
||||
- 获取`/brokers/ids`节点信息,拿到所有的存活的BrokerID; 然后获取每个Broker的信息 `/brokers/ids/对应BrokerId`的信息以及对应的节点的Epoch; 也就是`cZxid`; 然后将数据保存在内存中
|
||||
- 获取`/brokers/topics`节点信息;拿到所有Topic之后,放到Map `partitionModificationsHandlers`中,key=topicName;value=对应节点的`PartitionModificationsHandler`; 节点是`/brokers/topics/topic名称`;最终相当于是在事件处理队列`queue`中给每个Topic添加了一个`PartitionModifications`事件; 这个事件是怎么处理的,我们下面分析
|
||||
- 同时又注册一下上面的`PartitionModificationsHandler`,保存在map `zNodeChangeHandlers` 中; key= `/brokers/topics/Topic名称`,Value=`PartitionModificationsHandler`; 我们上面也说到过,这个有个功能就是判断需不需要向zk中注册`watch`; 从下图的代码中可以看出,在获取zk数据(`GetDataRequest`)的时候,会去 `zNodeChangeHandlers`判断一下存不存在对应节点key;存在的话就注册`watch`监视数据
|
||||
- zk中获取`/brokers/topics/topic名称`所有topic的分区数据; 保存在内存中
|
||||
- 给每个broker注册broker变更处理器`BrokerModificationsHandler`(也是`ZNodeChangeHandler`)它对应的事件是`BrokerModifications`; 同样的`zNodeChangeHandlers`中也保存着对应的`/brokers/ids/对应BrokerId` 同样的`watch`监控;并且map `brokerModificationsHandlers`保存对应关系 key=`brokerID` value=`BrokerModificationsHandler`
|
||||
- 从zk中获取所有的topic-partition 信息; 节点: `/brokers/topics/Topic名称/partitions/分区号/state` ; 然后保存在缓存中`controllerContext.partitionLeadershipInfo`
|
||||

|
||||
- `controllerChannelManager.startup()` 这个单独开了一篇文章讲解,请看[【kafka源码】Controller与Brokers之间的网络通信](), 简单来说就是创建一个map来保存于所有Broker的发送请求线程对象`RequestSendThread`;这个对象中有一个 阻塞队列`queue`; 用来排队执行要执行的请求,没有任务时候回阻塞; Controller需要发送请求的时候只需要向这个`queue`中添加任务就行了
|
||||
|
||||
6. 初始化删除Topic管理器`topicDeletionManager.init()`
|
||||
- 读取zk节点`/admin/delete_topics`的子节点数据,表示的是标记为已经删除的Topic
|
||||
- 将被标记为删除的Topic,做一些开始删除Topic的操作;具体详情情况请看[【kafka源码】TopicCommand之删除Topic源码解析]()
|
||||
|
||||
7. `sendUpdateMetadataRequest` 给Brokers们发送`UPDATA_METADATA` 更新元数据的请求,关于更新元数据详细情况 [【kafka源码】更新元数据`UPDATA_METADATA`请求源码分析 ]()
|
||||
8. `replicaStateMachine.startup()` 启动副本状态机,获取所有在线的和不在线的副本;
|
||||
①. 将在线副本状态变更为`OnlineReplica:`将带有当前领导者和 isr 的 `LeaderAndIsr `请求发送到新副本,并将分区的 `UpdateMetadata `请求发送到每个实时代理
|
||||
②. 将不在线副本状态变更为`OfflineReplica:` 向副本发送 [StopReplicaRequest]() ; 从 isr 中删除此副本并将 [LeaderAndIsr]() 请求(带有新的 isr)发送到领导副本,并将分区的 UpdateMetadata 请求发送到每个实时代理。
|
||||
详细请看 [【kafka源码】Controller中的状态机](https://shirenchuang.blog.csdn.net/article/details/117848213)
|
||||
9. `partitionStateMachine.startup()`启动分区状态机,获取所有在线的和不在线(判断Leader是否在线)的分区;
|
||||
1. 如果分区不存在`LeaderIsr`,则状态是`NewPartition`
|
||||
2. 如果分区存在`LeaderIsr`,就判断一下Leader是否存活
|
||||
2.1 如果存活的话,状态是`OnlinePartition`
|
||||
2.2 否则是`OfflinePartition`
|
||||
3. 尝试将所有处于 `NewPartition `或 `OfflinePartition `状态的分区移动到 `OnlinePartition` 状态,但属于要删除的主题的分区除外
|
||||
|
||||
PS:如果之前创建Topic过程中,Controller发生了变更,Topic创建么有完成,那么这个状态流转的过程会继续创建下去; [【kafka源码】TopicCommand之创建Topic源码解析]()
|
||||
关于状态机 详细请看 [【kafka源码】Controller中的状态机](https://shirenchuang.blog.csdn.net/article/details/117848213)
|
||||
|
||||
11. ` initializePartitionReassignments` 初始化挂起的重新分配。这包括通过 `/admin/reassign_partitions` 发送的重新分配,它将取代任何正在进行的 API 重新分配。[【kafka源码】分区重分配 TODO..]()
|
||||
12. `topicDeletionManager.tryTopicDeletion()`尝试恢复未完成的Topic删除操作;相关情况 [【kafka源码】TopicCommand之删除Topic源码解析](https://shirenchuang.blog.csdn.net/article/details/117847877)
|
||||
13. 从`/admin/preferred_replica_election` 获取值,调用`onReplicaElection()` 尝试为每个给定分区选举一个副本作为领导者 ;相关内容请看[【kafka源码】Kafka的优先副本选举源码分析]();
|
||||
14. `kafkaScheduler.startup()`启动一些定时任务线程
|
||||
15. 如果配置了`auto.leader.rebalance.enable=true`,则启动LeaderRebalace的定时任务;线程名`auto-leader-rebalance-task`
|
||||
16. 如果配置了 `delegation.token.master.key`,则启动一些token的清理线程
|
||||
|
||||
|
||||
### 7. Controller重新选举
|
||||
当我们把zk中的节点`/controller`删除之后; 会调用下面接口;进行重新选举
|
||||
```scala
|
||||
private def processReelect(): Unit = {
|
||||
//尝试卸任一下
|
||||
maybeResign()
|
||||
//进行选举
|
||||
elect()
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## 源码总结
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
PS: 可以看到 Broker当选Controller之后,保存了很多zk上的数据到自己的内存中, 也承担了很多责任; 如果这台Broker自身压力就挺大,那么它当选Controller之后压力会更大,所以尽量让比较空闲的Broker当选Controller,那么如何实现这样一个目标呢? 可以指定Broker作为Controller;
|
||||
这样一个功能可以在 <font color=red size=5>项目地址: [didi/Logi-KafkaManager: 一站式Apache Kafka集群指标监控与运维管控平台](https://github.com/didi/Logi-KafkaManager)</font> 里面可以实现
|
||||
|
||||
|
||||
## Q&A
|
||||
|
||||
### 直接删除zk节点`/controller`会怎么样
|
||||
>Broker之间会立马重新选举Controller;
|
||||
|
||||
### 如果修改节点`/controller/`下的数据会成功将Controller转移吗
|
||||
假如`/controller`节点数据是`{"version":1,"brokerid":3,"timestamp":"1623746563454"}` 我把BrokerId=1;Controller会直接变成Broker-1?
|
||||
>Answer: **不会成功转移,并且当前的集群中Broker是没有Controller角色的;这就是一个非常严重的问题了**
|
||||
|
||||
分析源码:
|
||||
修改`/controller/`数据在Controller执行的代码是
|
||||
```scala
|
||||
private def processControllerChange(): Unit = {
|
||||
maybeResign()
|
||||
}
|
||||
|
||||
private def maybeResign(): Unit = {
|
||||
val wasActiveBeforeChange = isActive
|
||||
zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
|
||||
activeControllerId = zkClient.getControllerId.getOrElse(-1)
|
||||
if (wasActiveBeforeChange && !isActive) {
|
||||
onControllerResignation()
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
代码就非常清楚的看到, 修改数据之后,如果修改后的Broker-Id和当前的Controller的BrokerId不一致,执行`onControllerResignation` 就让当前的Controller卸任这个角色了;
|
||||
|
||||
### /log_dir_event_notification 是干啥 的
|
||||
> 当`log.dir`日志文件夹出现访问不了,磁盘损坏等等异常导致读写失败,就会触发一些异常通知事件;
|
||||
> 流程是->
|
||||
> 1. Broker检查到`log.dir`异常,做一些清理工作,然后向zk中创建持久序列节点`/log_dir_event_notification/log_dir_event_+序列号`;数据是 BrokerID;例如:
|
||||
>`/log_dir_event_notification/log_dir_event_0000000003`
|
||||
>2. Controller 监听到了zk的变更; 将从zk节点 /log_dir_event_notification/log_dir_event_序列号 中获取到的数据的Broker上的所有副本进行一个副本状态流转 ->OnlineReplica
|
||||
> 2.1 给所有broker 发送`LeaderAndIsrRequest`请求,让brokers们去查询他们的副本的状态,如果副本logDir已经离线则返回KAFKA_STORAGE_ERROR异常;
|
||||
> 2.2 完事之后会删除节点
|
||||
|
||||
### /isr_change_notification 是干啥用的
|
||||
> 当有isr变更的时候会在这个节点写入数据; Controller监听之后做一些通知
|
||||
### /admin/preferred_replica_election 是干啥用的
|
||||
>优先副本选举, 详情请戳[kafka的优先副本选举流程 .]()
|
||||
>
|
||||
|
||||
## 思考
|
||||
### 有什么办法实现Controller的优先选举?
|
||||
>既然我们知道了Controller承担了这么多的任务,又是Broker又是Controller,身兼数职压力难免会比较大;
|
||||
>所以我们很希望能够有一个功能能够知道Broker为Controller角色; 这样就可以指定压力比较小的Broker来承担Controller的角色了;
|
||||
|
||||
**那么,如何实现呢?**
|
||||
>Kafka原生目前并不支持这个功能,所以我们想要实现这个功能,就得要改源码了;
|
||||
>知道了原理, 改源码实现这个功能就很简单了; 有很多种实现方式;
|
||||
|
||||
比如说: 在zk里面设置一个节点专门用来存放候选节点; 竞选Controller的时候优先从这里面选择;
|
||||
然后Broker们启动的时候,可以判断一下自己是不是候选节点, 如果不是的话,那就让它睡个两三秒; (让候选者99米再跑)
|
||||
那么大概率的情况下,候选者肯定就会当选了;
|
||||
26
docs/zh/Kafka分享/Kafka Controller /Controller滴滴特性解读.md
Normal file
@@ -0,0 +1,26 @@
|
||||
|
||||
## Controller优先选举
|
||||
> 在原生的kafka中,Controller角色的选举,是每个Broker抢占式的去zk写入节点`Controller`
|
||||
> 任何一个Broker都有可能当选Controller;
|
||||
> 但是Controller角色除了是一个正常的Broker外,还承担着Controller角色的一些任务;
|
||||
> 具体情况 [【kafka源码】Controller启动过程以及选举流程源码分析]()
|
||||
> 当这台Broker本身压力很大的情况下,又当选Controller让Broker压力更大了;
|
||||
> 所以我们期望让Controller角色落在一些压力较小的Broker上;或者专门用一台机器用来当做Controller角色;
|
||||
> 基于这么一个需求,我们内部就对引擎做了些改造,用于支持`Controller优先选举`
|
||||
|
||||
|
||||
## 改造原理
|
||||
> 在`/config`节点下新增了节点`/config/extension/candidates/ `;
|
||||
> 将所有需要被优先选举的BrokerID存放到该节点下面;
|
||||
> 例如:
|
||||
> `/config/extension/candidates/0`
|
||||
> 
|
||||
|
||||
当Controller发生重新选举的时候, 每个Broker都去抢占式写入`/controller`节点, 但是会先去节点`/config/extension/candidates/`节点获取所有子节点,获取到有一个BrokerID=0; 这个时候会判断一下是否跟自己的BrokerID相等; 不相等的话就`sleep 3秒` 钟; 这样的话,那么BrokerId=0这个Broker就会大概率当选Controller; 如果这个Broker挂掉了,那么其他Broker就可能会当选
|
||||
|
||||
<font color=red>PS: `/config/extension/candidates/` 节点下可以配置多个候选Controller </font>
|
||||
|
||||
|
||||
## KM管理平台操作
|
||||
|
||||

|
||||
@@ -0,0 +1,614 @@
|
||||
|
||||
|
||||
## 1.脚本的使用
|
||||
>请看 [【kafka运维】副本扩缩容、数据迁移、分区重分配]()
|
||||
|
||||
## 2.源码解析
|
||||
<font color=red>如果阅读源码太枯燥,可以直接跳转到 源码总结和Q&A部分<font>
|
||||
|
||||
### 2.1`--generate ` 生成分配策略分析
|
||||
配置启动类`--zookeeper xxxx:2181 --topics-to-move-json-file config/move-json-file.json --broker-list "0,1,2,3" --generate`
|
||||

|
||||
配置`move-json-file.json`文件
|
||||

|
||||
启动,调试:
|
||||
`ReassignPartitionsCommand.generateAssignment`
|
||||

|
||||
1. 获取入参的数据
|
||||
2. 校验`--broker-list`传入的BrokerId是否有重复的,重复就报错
|
||||
3. 开始进行分配
|
||||
|
||||
`ReassignPartitionsCommand.generateAssignment`
|
||||
```scala
|
||||
def generateAssignment(zkClient: KafkaZkClient, brokerListToReassign: Seq[Int], topicsToMoveJsonString: String, disableRackAware: Boolean): (Map[TopicPartition, Seq[Int]], Map[TopicPartition, Seq[Int]]) = {
|
||||
//解析出游哪些Topic
|
||||
val topicsToReassign = parseTopicsData(topicsToMoveJsonString)
|
||||
//检查是否有重复的topic
|
||||
val duplicateTopicsToReassign = CoreUtils.duplicates(topicsToReassign)
|
||||
if (duplicateTopicsToReassign.nonEmpty)
|
||||
throw new AdminCommandFailedException("List of topics to reassign contains duplicate entries: %s".format(duplicateTopicsToReassign.mkString(",")))
|
||||
//获取topic当前的副本分配情况 /brokers/topics/{topicName}
|
||||
val currentAssignment = zkClient.getReplicaAssignmentForTopics(topicsToReassign.toSet)
|
||||
|
||||
val groupedByTopic = currentAssignment.groupBy { case (tp, _) => tp.topic }
|
||||
//机架感知模式
|
||||
val rackAwareMode = if (disableRackAware) RackAwareMode.Disabled else RackAwareMode.Enforced
|
||||
val adminZkClient = new AdminZkClient(zkClient)
|
||||
val brokerMetadatas = adminZkClient.getBrokerMetadatas(rackAwareMode, Some(brokerListToReassign))
|
||||
|
||||
val partitionsToBeReassigned = mutable.Map[TopicPartition, Seq[Int]]()
|
||||
groupedByTopic.foreach { case (topic, assignment) =>
|
||||
val (_, replicas) = assignment.head
|
||||
val assignedReplicas = AdminUtils.assignReplicasToBrokers(brokerMetadatas, assignment.size, replicas.size)
|
||||
partitionsToBeReassigned ++= assignedReplicas.map { case (partition, replicas) =>
|
||||
new TopicPartition(topic, partition) -> replicas
|
||||
}
|
||||
}
|
||||
(partitionsToBeReassigned, currentAssignment)
|
||||
}
|
||||
```
|
||||
1. 检查是否有重复的topic,重复则抛出异常
|
||||
2. 从zk节点` /brokers/topics/{topicName}`获取topic当前的副本分配情况
|
||||
3. 从zk节点`brokers/ids`中获取所有在线节点,并跟`--broker-list`参数传入的取个交集
|
||||
4. 获取Brokers元数据,如果机架感知模式`RackAwareMode.Enforced`(默认)&&上面3中获取到的交集列表brokers不是都有机架信息或者都没有机架信息的话就抛出异常; 因为要根据机架信息做分区分配的话,必须要么都有机架信息,要么都没有机架信息; 出现这种情况怎么办呢? 那就将机架感知模式`RackAwareMode`设置为`RackAwareMode.Disabled` ;只需要加上一个参数`--disable-rack-aware`就行了
|
||||
5. 调用`AdminUtils.assignReplicasToBrokers` 计算分配情况;
|
||||

|
||||
我们在[【kafka源码】创建Topic的时候是如何分区和副本的分配规则]()里面分析过就不再赘述了, `AdminUtils.assignReplicasToBrokers(要分配的Broker们的元数据, 分区数, 副本数)`
|
||||
需要注意的是副本数是通过`assignment.head.replicas.size`获取的,意思是第一个分区的副本数量,正常情况下分区副本都会相同,但是也不一定,也可能被设置为了不同
|
||||
|
||||
<font color=red>根据这条信息我们是不是就可以直接调用这个接口来实现其他功能? **比如副本的扩缩容**</font>
|
||||
|
||||
|
||||
|
||||
|
||||
### 2.2`--execute ` 执行阶段分析
|
||||
> 使用脚本执行
|
||||
> `--zookeeper xxx --reassignment-json-file config/reassignment-json-file.json --execute --throttle 10000`
|
||||
|
||||
|
||||
|
||||
`ReassignPartitionsCommand.executeAssignment`
|
||||
```scala
|
||||
def executeAssignment(zkClient: KafkaZkClient, adminClientOpt: Option[Admin], reassignmentJsonString: String, throttle: Throttle, timeoutMs: Long = 10000L): Unit = {
|
||||
//对json文件进行校验和解析
|
||||
val (partitionAssignment, replicaAssignment) = parseAndValidate(zkClient, reassignmentJsonString)
|
||||
val adminZkClient = new AdminZkClient(zkClient)
|
||||
val reassignPartitionsCommand = new ReassignPartitionsCommand(zkClient, adminClientOpt, partitionAssignment.toMap, replicaAssignment, adminZkClient)
|
||||
|
||||
//检查是否已经存在副本重分配进程, 则尝试限流
|
||||
if (zkClient.reassignPartitionsInProgress()) {
|
||||
reassignPartitionsCommand.maybeLimit(throttle)
|
||||
} else {
|
||||
//打印当前的副本分配方式,方便回滚
|
||||
printCurrentAssignment(zkClient, partitionAssignment.map(_._1.topic))
|
||||
if (throttle.interBrokerLimit >= 0 || throttle.replicaAlterLogDirsLimit >= 0)
|
||||
println(String.format("Warning: You must run Verify periodically, until the reassignment completes, to ensure the throttle is removed. You can also alter the throttle by rerunning the Execute command passing a new value."))
|
||||
//开始进行重分配进程
|
||||
if (reassignPartitionsCommand.reassignPartitions(throttle, timeoutMs)) {
|
||||
println("Successfully started reassignment of partitions.")
|
||||
} else
|
||||
println("Failed to reassign partitions %s".format(partitionAssignment))
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 解析json文件并做些校验
|
||||
1. (partition、replica非空校验,partition重复校验)
|
||||
3. 校验`partition`是否有不存在的分区;(新增分区请用`kafka-topic`)
|
||||
4. 检查配置中的Brokers-id是否都存在
|
||||
3. 如果发现已经存在副本重分配进程(检查是否有节点`/admin/reassign_partitions`),则检查是否需要更改限流; 如果有参数(`--throttle`,`--replica-alter-log-dirs-throttle`) 则设置限流信息; 而后不再执行下一步
|
||||
4. 如果当前没有执行中的副本重分配任务(检查是否有节点`/admin/reassign_partitions`),则开始进行副本重分配任务;
|
||||
|
||||
#### 2.2.1 已有任务,尝试限流
|
||||
如果zk中有节点`/admin/reassign_partitions`; 则表示当前已有一个任务在进行,那么当前操作就不继续了,如果有参数
|
||||
`--throttle:`
|
||||
`--replica-alter-log-dirs-throttle:`
|
||||
则进行限制
|
||||
|
||||
>限制当前移动副本的节流阀。请注意,此命令可用于更改节流阀,但如果某些代理已完成重新平衡,则它可能不会更改最初设置的所有限制。所以后面需要将这个限制给移除掉 通过`--verify`
|
||||
|
||||
`maybeLimit`
|
||||
```scala
|
||||
def maybeLimit(throttle: Throttle): Unit = {
|
||||
if (throttle.interBrokerLimit >= 0 || throttle.replicaAlterLogDirsLimit >= 0) {
|
||||
//当前存在的broker
|
||||
val existingBrokers = existingAssignment().values.flatten.toSeq
|
||||
//期望的broker
|
||||
val proposedBrokers = proposedPartitionAssignment.values.flatten.toSeq ++ proposedReplicaAssignment.keys.toSeq.map(_.brokerId())
|
||||
//前面broker相加去重
|
||||
val brokers = (existingBrokers ++ proposedBrokers).distinct
|
||||
|
||||
//遍历与之相关的Brokers, 添加限流配置写入到zk节点/config/broker/{brokerId}中
|
||||
for (id <- brokers) {
|
||||
//获取broker的配置 /config/broker/{brokerId}
|
||||
val configs = adminZkClient.fetchEntityConfig(ConfigType.Broker, id.toString)
|
||||
if (throttle.interBrokerLimit >= 0) {
|
||||
configs.put(DynamicConfig.Broker.LeaderReplicationThrottledRateProp, throttle.interBrokerLimit.toString)
|
||||
configs.put(DynamicConfig.Broker.FollowerReplicationThrottledRateProp, throttle.interBrokerLimit.toString)
|
||||
}
|
||||
if (throttle.replicaAlterLogDirsLimit >= 0)
|
||||
configs.put(DynamicConfig.Broker.ReplicaAlterLogDirsIoMaxBytesPerSecondProp, throttle.replicaAlterLogDirsLimit.toString)
|
||||
|
||||
adminZkClient.changeBrokerConfig(Seq(id), configs)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
`/config/brokers/{brokerId}`节点配置是Broker端的动态配置,不需要重启Broker实时生效;
|
||||
1. 如果传入了参数`--throttle:` 则从zk节点`/config/brokers/{BrokerId}`节点获取Broker们的配置信息,然后再加上以下两个配置重新写入到节点`/config/brokers/{BrokerId}`中
|
||||
`leader.replication.throttled.rate` 控制leader副本端处理FETCH请求的速率
|
||||
`follower.replication.throttled.rate` 控制follower副本发送FETCH请求的速率
|
||||
2. 如果传入了参数`--replica-alter-log-dirs-throttle:` 则将如下配置也写入节点中;
|
||||
`replica.alter.log.dirs.io.max.bytes.per.second:` broker内部目录之间迁移数据流量限制功能,限制数据拷贝从一个目录到另外一个目录带宽上限
|
||||
|
||||
例如写入之后的数据
|
||||
```json
|
||||
{"version":1,"config":{"leader.replication.throttled.rate":"1","follower.replication.throttled.rate":"1"}}
|
||||
```
|
||||
|
||||
**注意: 这里写入的限流配置,是写入所有与之相关的Broker的限流配置;**
|
||||
|
||||
#### 2.2.2 当前未有执行任务,开始执行副本重分配任务
|
||||
`ReassignPartitionsCommand.reassignPartitions`
|
||||
```scala
|
||||
def reassignPartitions(throttle: Throttle = NoThrottle, timeoutMs: Long = 10000L): Boolean = {
|
||||
//写入一些限流数据
|
||||
maybeThrottle(throttle)
|
||||
try {
|
||||
//验证分区是否存在
|
||||
val validPartitions = proposedPartitionAssignment.groupBy(_._1.topic())
|
||||
.flatMap { case (topic, topicPartitionReplicas) =>
|
||||
validatePartition(zkClient, topic, topicPartitionReplicas)
|
||||
}
|
||||
if (validPartitions.isEmpty) false
|
||||
else {
|
||||
if (proposedReplicaAssignment.nonEmpty && adminClientOpt.isEmpty)
|
||||
throw new AdminCommandFailedException("bootstrap-server needs to be provided in order to reassign replica to the specified log directory")
|
||||
val startTimeMs = System.currentTimeMillis()
|
||||
|
||||
// Send AlterReplicaLogDirsRequest to allow broker to create replica in the right log dir later if the replica has not been created yet.
|
||||
if (proposedReplicaAssignment.nonEmpty)
|
||||
alterReplicaLogDirsIgnoreReplicaNotAvailable(proposedReplicaAssignment, adminClientOpt.get, timeoutMs)
|
||||
|
||||
// Create reassignment znode so that controller will send LeaderAndIsrRequest to create replica in the broker
|
||||
zkClient.createPartitionReassignment(validPartitions.map({case (key, value) => (new TopicPartition(key.topic, key.partition), value)}).toMap)
|
||||
|
||||
// Send AlterReplicaLogDirsRequest again to make sure broker will start to move replica to the specified log directory.
|
||||
// It may take some time for controller to create replica in the broker. Retry if the replica has not been created.
|
||||
var remainingTimeMs = startTimeMs + timeoutMs - System.currentTimeMillis()
|
||||
val replicasAssignedToFutureDir = mutable.Set.empty[TopicPartitionReplica]
|
||||
while (remainingTimeMs > 0 && replicasAssignedToFutureDir.size < proposedReplicaAssignment.size) {
|
||||
replicasAssignedToFutureDir ++= alterReplicaLogDirsIgnoreReplicaNotAvailable(
|
||||
proposedReplicaAssignment.filter { case (replica, _) => !replicasAssignedToFutureDir.contains(replica) },
|
||||
adminClientOpt.get, remainingTimeMs)
|
||||
Thread.sleep(100)
|
||||
remainingTimeMs = startTimeMs + timeoutMs - System.currentTimeMillis()
|
||||
}
|
||||
replicasAssignedToFutureDir.size == proposedReplicaAssignment.size
|
||||
}
|
||||
} catch {
|
||||
case _: NodeExistsException =>
|
||||
val partitionsBeingReassigned = zkClient.getPartitionReassignment()
|
||||
throw new AdminCommandFailedException("Partition reassignment currently in " +
|
||||
"progress for %s. Aborting operation".format(partitionsBeingReassigned))
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
1. `maybeThrottle(throttle)` 设置副本移动时候的限流配置,这个方法只用于任务初始化的时候
|
||||
```scala
|
||||
private def maybeThrottle(throttle: Throttle): Unit = {
|
||||
if (throttle.interBrokerLimit >= 0)
|
||||
assignThrottledReplicas(existingAssignment(), proposedPartitionAssignment, adminZkClient)
|
||||
maybeLimit(throttle)
|
||||
if (throttle.interBrokerLimit >= 0 || throttle.replicaAlterLogDirsLimit >= 0)
|
||||
throttle.postUpdateAction()
|
||||
if (throttle.interBrokerLimit >= 0)
|
||||
println(s"The inter-broker throttle limit was set to ${throttle.interBrokerLimit} B/s")
|
||||
if (throttle.replicaAlterLogDirsLimit >= 0)
|
||||
println(s"The replica-alter-dir throttle limit was set to ${throttle.replicaAlterLogDirsLimit} B/s")
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
1.1 将一些topic的限流配置写入到节点`/config/topics/{topicName}`中
|
||||

|
||||
将计算得到的leader、follower 值写入到`/config/topics/{topicName}`中
|
||||
leader: 找到 TopicPartition中有新增的副本的 那个分区;数据= 分区号:副本号,分区号:副本号
|
||||
follower: 遍历 预期 TopicPartition,副本= 预期副本-现有副本;数据= 分区号:副本号,分区号:副本号
|
||||
`leader.replication.throttled.replicas`: leader
|
||||
`follower.replication.throttled.replicas`: follower
|
||||

|
||||
1.2. 执行 《**2.2.1 已有任务,尝试限流**》流程
|
||||
|
||||
2. 从zk中获取`/broker/topics/{topicName}`数据来验证给定的分区是否存在,如果分区不存在则忽略此分区的配置,继续流程
|
||||
3. 如果尚未创建副本,则发送 `AlterReplicaLogDirsRequest` 以允许代理稍后在正确的日志目录中创建副本。这个跟 `log_dirs` 有关 TODO....
|
||||
4. 将重分配的数据写入到zk的节点`/admin/reassign_partitions`中;数据内容如:
|
||||
```
|
||||
{"version":1,"partitions":[{"topic":"test_create_topic1","partition":0,"replicas":[0,1,2,3]},{"topic":"test_create_topic1","partition":1,"replicas":[1,2,0,3]},{"topic":"test_create_topic1","partition":2,"replicas":[2,1,0,3]}]}
|
||||
```
|
||||
5. 再次发送 `AlterReplicaLogDirsRequest `以确保代理将开始将副本移动到指定的日志目录。控制器在代理中创建副本可能需要一些时间。如果尚未创建副本,请重试。
|
||||
1. 像Broker发送`alterReplicaLogDirs`请求
|
||||
|
||||
|
||||
|
||||
|
||||
#### 2.2.3 Controller监听`/admin/reassign_partitions`节点变化
|
||||
|
||||
|
||||
`KafkaController.processZkPartitionReassignment`
|
||||
```scala
|
||||
private def processZkPartitionReassignment(): Set[TopicPartition] = {
|
||||
// We need to register the watcher if the path doesn't exist in order to detect future
|
||||
// reassignments and we get the `path exists` check for free
|
||||
if (isActive && zkClient.registerZNodeChangeHandlerAndCheckExistence(partitionReassignmentHandler)) {
|
||||
val reassignmentResults = mutable.Map.empty[TopicPartition, ApiError]
|
||||
val partitionsToReassign = mutable.Map.empty[TopicPartition, ReplicaAssignment]
|
||||
|
||||
zkClient.getPartitionReassignment().foreach { case (tp, targetReplicas) =>
|
||||
maybeBuildReassignment(tp, Some(targetReplicas)) match {
|
||||
case Some(context) => partitionsToReassign.put(tp, context)
|
||||
case None => reassignmentResults.put(tp, new ApiError(Errors.NO_REASSIGNMENT_IN_PROGRESS))
|
||||
}
|
||||
}
|
||||
|
||||
reassignmentResults ++= maybeTriggerPartitionReassignment(partitionsToReassign)
|
||||
val (partitionsReassigned, partitionsFailed) = reassignmentResults.partition(_._2.error == Errors.NONE)
|
||||
if (partitionsFailed.nonEmpty) {
|
||||
warn(s"Failed reassignment through zk with the following errors: $partitionsFailed")
|
||||
maybeRemoveFromZkReassignment((tp, _) => partitionsFailed.contains(tp))
|
||||
}
|
||||
partitionsReassigned.keySet
|
||||
} else {
|
||||
Set.empty
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. 判断是否是Controller角色并且是否存在节点`/admin/reassign_partitions`
|
||||
2. `maybeTriggerPartitionReassignment` 重分配,如果topic已经被标记为删除了,则此topic流程终止;
|
||||
3. `maybeRemoveFromZkReassignment`将执行失败的一些分区信息从zk中删除;(覆盖信息)
|
||||
|
||||
##### onPartitionReassignment
|
||||
`KafkaController.onPartitionReassignment`
|
||||
|
||||
```scala
|
||||
private def onPartitionReassignment(topicPartition: TopicPartition, reassignment: ReplicaAssignment): Unit = {
|
||||
// 暂停一些正在删除的Topic操作
|
||||
topicDeletionManager.markTopicIneligibleForDeletion(Set(topicPartition.topic), reason = "topic reassignment in progress")
|
||||
//更新当前的分配
|
||||
updateCurrentReassignment(topicPartition, reassignment)
|
||||
|
||||
val addingReplicas = reassignment.addingReplicas
|
||||
val removingReplicas = reassignment.removingReplicas
|
||||
|
||||
if (!isReassignmentComplete(topicPartition, reassignment)) {
|
||||
// A1. Send LeaderAndIsr request to every replica in ORS + TRS (with the new RS, AR and RR).
|
||||
updateLeaderEpochAndSendRequest(topicPartition, reassignment)
|
||||
// A2. replicas in AR -> NewReplica
|
||||
startNewReplicasForReassignedPartition(topicPartition, addingReplicas)
|
||||
} else {
|
||||
// B1. replicas in AR -> OnlineReplica
|
||||
replicaStateMachine.handleStateChanges(addingReplicas.map(PartitionAndReplica(topicPartition, _)), OnlineReplica)
|
||||
// B2. Set RS = TRS, AR = [], RR = [] in memory.
|
||||
val completedReassignment = ReplicaAssignment(reassignment.targetReplicas)
|
||||
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, completedReassignment)
|
||||
// B3. Send LeaderAndIsr request with a potential new leader (if current leader not in TRS) and
|
||||
// a new RS (using TRS) and same isr to every broker in ORS + TRS or TRS
|
||||
moveReassignedPartitionLeaderIfRequired(topicPartition, completedReassignment)
|
||||
// B4. replicas in RR -> Offline (force those replicas out of isr)
|
||||
// B5. replicas in RR -> NonExistentReplica (force those replicas to be deleted)
|
||||
stopRemovedReplicasOfReassignedPartition(topicPartition, removingReplicas)
|
||||
// B6. Update ZK with RS = TRS, AR = [], RR = [].
|
||||
updateReplicaAssignmentForPartition(topicPartition, completedReassignment)
|
||||
// B7. Remove the ISR reassign listener and maybe update the /admin/reassign_partitions path in ZK to remove this partition from it.
|
||||
removePartitionFromReassigningPartitions(topicPartition, completedReassignment)
|
||||
// B8. After electing a leader in B3, the replicas and isr information changes, so resend the update metadata request to every broker
|
||||
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicPartition))
|
||||
// signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed
|
||||
topicDeletionManager.resumeDeletionForTopics(Set(topicPartition.topic))
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. 暂停一些正在删除的Topic操作
|
||||
2. 更新 Zk节点`brokers/topics/{topicName}`,和内存中的当前分配状态。如果重新分配已经在进行中,那么新的重新分配将取代它并且一些副本将被关闭。
|
||||
2.1 更新zk中的topic节点信息`brokers/topics/{topicName}`,这里会标记AR哪些副本是新增的,RR哪些副本是要删除的;例如:
|
||||

|
||||
2.2 更新当前内存
|
||||
2.3 如果**重新分配**已经在进行中,那么一些当前新增加的副本有可能被立即删除,在这种情况下,我们需要停止副本。
|
||||
2.4 注册一个监听节点`/brokers/topics/{topicName}/partitions/{分区号}/state`变更的处理器`PartitionReassignmentIsrChangeHandler`
|
||||
3. 如果该分区的重新分配还没有完成(根据`/brokers/topics/{topicName}/partitions/{分区号}/state`里面的isr来判断是否已经包含了新增的BrokerId了);则
|
||||
以下几个名称说明:
|
||||
`ORS`: OriginReplicas 原先的副本
|
||||
`TRS`: targetReplicas 将要变更成的目标副本
|
||||
`AR`: adding_replicas 正在添加的副本
|
||||
`RR`:removing_replicas 正在移除的副本
|
||||
3.1 向 ORS + TRS 中的每个副本发送` LeaderAndIsr `请求(带有新的 RS、AR 和 RR)。
|
||||
3.2 给新增加的AR副本 进行状态变更成`NewReplica` ; 这个过程有发送`LeaderAndIsrRequest`详细请看[【kafka源码】Controller中的状态机]()
|
||||
|
||||
#### 2.2.4 Controller监听节点`brokers/topics/{topicName}`变化,检查是否有新增分区
|
||||
这一个流程可以不必在意,因为在这里没有做任何事情;
|
||||
|
||||
>上面的 **2.2.3** 的第2小段中不是有将新增的和删掉的副本写入到了 zk中吗
|
||||
>例如:
|
||||
>```json
|
||||
>
|
||||
>{"version":2,"partitions":{"2":[0,1],"1":[0,1],"0":[0,1]},"adding_replicas":{"2":[1],"1":[1],"0":[1]},"removing_replicas":{}}
|
||||
>
|
||||
>```
|
||||
Controller监听到这个节点之后,执行方法`processPartitionModifications`
|
||||
`KafkaController.processPartitionModifications`
|
||||
```scala
|
||||
private def processPartitionModifications(topic: String): Unit = {
|
||||
def restorePartitionReplicaAssignment(
|
||||
topic: String,
|
||||
newPartitionReplicaAssignment: Map[TopicPartition, ReplicaAssignment]
|
||||
): Unit = {
|
||||
info("Restoring the partition replica assignment for topic %s".format(topic))
|
||||
|
||||
//从zk节点中获取所有分区
|
||||
val existingPartitions = zkClient.getChildren(TopicPartitionsZNode.path(topic))
|
||||
//找到已经存在的分区
|
||||
val existingPartitionReplicaAssignment = newPartitionReplicaAssignment
|
||||
.filter(p => existingPartitions.contains(p._1.partition.toString))
|
||||
.map { case (tp, _) =>
|
||||
tp -> controllerContext.partitionFullReplicaAssignment(tp)
|
||||
}.toMap
|
||||
|
||||
zkClient.setTopicAssignment(topic,
|
||||
existingPartitionReplicaAssignment,
|
||||
controllerContext.epochZkVersion)
|
||||
}
|
||||
|
||||
if (!isActive) return
|
||||
val partitionReplicaAssignment = zkClient.getFullReplicaAssignmentForTopics(immutable.Set(topic))
|
||||
val partitionsToBeAdded = partitionReplicaAssignment.filter { case (topicPartition, _) =>
|
||||
controllerContext.partitionReplicaAssignment(topicPartition).isEmpty
|
||||
}
|
||||
|
||||
if (topicDeletionManager.isTopicQueuedUpForDeletion(topic)) {
|
||||
if (partitionsToBeAdded.nonEmpty) {
|
||||
warn("Skipping adding partitions %s for topic %s since it is currently being deleted"
|
||||
.format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))
|
||||
|
||||
restorePartitionReplicaAssignment(topic, partitionReplicaAssignment)
|
||||
} else {
|
||||
// This can happen if existing partition replica assignment are restored to prevent increasing partition count during topic deletion
|
||||
info("Ignoring partition change during topic deletion as no new partitions are added")
|
||||
}
|
||||
} else if (partitionsToBeAdded.nonEmpty) {
|
||||
info(s"New partitions to be added $partitionsToBeAdded")
|
||||
partitionsToBeAdded.foreach { case (topicPartition, assignedReplicas) =>
|
||||
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, assignedReplicas)
|
||||
}
|
||||
onNewPartitionCreation(partitionsToBeAdded.keySet)
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 从`brokers/topics/{topicName}`中获取完整的分配信息,例如
|
||||
```json
|
||||
{
|
||||
"version": 2,
|
||||
"partitions": {
|
||||
"2": [0, 1],
|
||||
"1": [0, 1],
|
||||
"0": [0, 1]
|
||||
},
|
||||
"adding_replicas": {
|
||||
"2": [1],
|
||||
"1": [1],
|
||||
"0": [1]
|
||||
},
|
||||
"removing_replicas": {}
|
||||
}
|
||||
```
|
||||
2. 如果有需要新增的分区,如下操作
|
||||
2.1 如果当前Topic刚好在删掉队列中,那么就没有必要进行分区扩容了; 将zk的`brokers/topics/{topicName}`数据恢复回去
|
||||
2.2 如果不在删除队列中,则开始走新增分区的流程;关于新增分区的流程 在[【kafka源码】TopicCommand之创建Topic源码解析
|
||||
]()里面已经详细讲过了,跳转后请搜索关键词`onNewPartitionCreation`
|
||||
|
||||
3. 如果该Topic正在删除中,则跳过该Topic的处理; 并且同时如果有AR(adding_replical),则重写一下zk节点`/broker/topics/{topicName}`节点的数据; 相当于是还原数据; 移除掉里面的AR;
|
||||
|
||||
**这一步完全不用理会,因为 分区副本重分配不会出现新增分区的情况;**
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
#### 2.2.5 Controller监听zk节点`/brokers/topics/{topicName}/partitions/{分区号}/state`
|
||||
> 上面2.2.3 里面的 2.4不是有说过注册一个监听节点`/brokers/topics/{topicName}/partitions/{分区号}/state`变更的处理器`PartitionReassignmentIsrChangeHandler`
|
||||
>
|
||||
到底是什么时候这个节点有变化呢? 前面我们不是对副本们发送了`LEADERANDISR`的请求么, 当新增的副本去leader
|
||||
fetch数据开始同步的时候,当数据同步完成跟上了ISR的节奏,就会去修改这个节点; 修改之后那么下面就开始执行监听流程了
|
||||
|
||||
这里跟 **2.2.3** 中有调用同一个接口; 不过这个时候经过了`LeaderAndIsr`请求
|
||||
`kafkaController.processPartitionReassignmentIsrChange->onPartitionReassignment`
|
||||
```scala
|
||||
private def onPartitionReassignment(topicPartition: TopicPartition, reassignment: ReplicaAssignment): Unit = {
|
||||
// While a reassignment is in progress, deletion is not allowed
|
||||
topicDeletionManager.markTopicIneligibleForDeletion(Set(topicPartition.topic), reason = "topic reassignment in progress")
|
||||
|
||||
updateCurrentReassignment(topicPartition, reassignment)
|
||||
|
||||
val addingReplicas = reassignment.addingReplicas
|
||||
val removingReplicas = reassignment.removingReplicas
|
||||
|
||||
if (!isReassignmentComplete(topicPartition, reassignment)) {
|
||||
// A1. Send LeaderAndIsr request to every replica in ORS + TRS (with the new RS, AR and RR).
|
||||
updateLeaderEpochAndSendRequest(topicPartition, reassignment)
|
||||
// A2. replicas in AR -> NewReplica
|
||||
startNewReplicasForReassignedPartition(topicPartition, addingReplicas)
|
||||
} else {
|
||||
// B1. replicas in AR -> OnlineReplica
|
||||
replicaStateMachine.handleStateChanges(addingReplicas.map(PartitionAndReplica(topicPartition, _)), OnlineReplica)
|
||||
// B2. Set RS = TRS, AR = [], RR = [] in memory.
|
||||
val completedReassignment = ReplicaAssignment(reassignment.targetReplicas)
|
||||
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, completedReassignment)
|
||||
// B3. Send LeaderAndIsr request with a potential new leader (if current leader not in TRS) and
|
||||
// a new RS (using TRS) and same isr to every broker in ORS + TRS or TRS
|
||||
moveReassignedPartitionLeaderIfRequired(topicPartition, completedReassignment)
|
||||
// B4. replicas in RR -> Offline (force those replicas out of isr)
|
||||
// B5. replicas in RR -> NonExistentReplica (force those replicas to be deleted)
|
||||
stopRemovedReplicasOfReassignedPartition(topicPartition, removingReplicas)
|
||||
// B6. Update ZK with RS = TRS, AR = [], RR = [].
|
||||
updateReplicaAssignmentForPartition(topicPartition, completedReassignment)
|
||||
// B7. Remove the ISR reassign listener and maybe update the /admin/reassign_partitions path in ZK to remove this partition from it.
|
||||
removePartitionFromReassigningPartitions(topicPartition, completedReassignment)
|
||||
// B8. After electing a leader in B3, the replicas and isr information changes, so resend the update metadata request to every broker
|
||||
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicPartition))
|
||||
// signal delete topic thread if reassignment for some partitions belonging to topics being deleted just completed
|
||||
topicDeletionManager.resumeDeletionForTopics(Set(topicPartition.topic))
|
||||
}
|
||||
}
|
||||
```
|
||||
以下几个名称说明:
|
||||
`ORS`: origin repilicas 原先的副本
|
||||
`RS`: Replicas 现在的副本
|
||||
`TRS`: targetReplicas 将要变更成的目标副本
|
||||
`AR`: adding_replicas 正在添加的副本
|
||||
`RR`:removing_replicas 正在移除的副本
|
||||
|
||||
1. 副本状态变更 -> `OnlineReplica`,将 AR 中的所有副本移动到 OnlineReplica 状态
|
||||
2. 在内存中设置 RS = TRS, AR = [], RR = []
|
||||
3. 向 ORS + TRS 或 TRS 中的每个经纪人发送带有潜在新Leader(如果当前Leader不在 TRS 中)和新 RS(使用 TRS)和相同 isr 的` LeaderAndIsr `请求
|
||||
6. 我们可能会将 `LeaderAndIsr `发送到多个 TRS 副本。将 RR 中的所有副本移动到 `OfflineReplica `状态。转换的过程中,有删除 ZooKeeper 中的 RR,并且仅向 Leader 发送一个 `LeaderAndIsr `以通知它缩小的 isr。之后,向 RR 中的副本发送一个 `StopReplica (delete = false)` 这个时候还没有正在的进行删除。
|
||||
7. 将 RR 中的所有副本移动到` NonExistentReplica `状态。这将向 RR 中的副本发送一个 `StopReplica (delete = true) `以物理删除磁盘上的副本。这里的流程可以看看文章[【kafka源码】TopicCommand之删除Topic源码解析]()
|
||||
5. 用RS=TRS, AR=[], RR=[] 更新 zk `/broker/topics/{topicName}` 节点,更新partitions并移除AR(adding_replicas)RR(removing_replicas) 例如
|
||||
```json
|
||||
{"version":2,"partitions":{"2":[0,1],"1":[0,1],"0":[0,1]},"adding_replicas":{},"removing_replicas":{}}
|
||||
|
||||
```
|
||||
|
||||
8. 删除 ISR 重新分配侦听器`/brokers/topics/{topicName}/partitions/{分区号}/state`,并可能更新 ZK 中的 `/admin/reassign_partitions `路径以从中删除此分区(如果存在)
|
||||
9. 选举leader后,replicas和isr信息发生变化。因此,向每个代理重新发送`UPDATE_METADATA`更新元数据请求。
|
||||
10. 恢复删除线程`resumeDeletions`; 该操作[【kafka源码】TopicCommand之删除Topic源码解析]()在分析过; 请移步阅读,并搜索关键字`resumeDeletions`
|
||||
|
||||
|
||||
|
||||
#### 2.2.6 Controller重新选举恢复 恢复任务
|
||||
> KafkaController.onControllerFailover() 里面 有调用接口`initializePartitionReassignments` 会恢复未完成的重分配任务
|
||||
|
||||
#### alterReplicaLogDirs请求
|
||||
> 副本跨路径迁移相关
|
||||
`KafkaApis.handleAlterReplicaLogDirsRequest`
|
||||
```scala
|
||||
def handleAlterReplicaLogDirsRequest(request: RequestChannel.Request): Unit = {
|
||||
val alterReplicaDirsRequest = request.body[AlterReplicaLogDirsRequest]
|
||||
val responseMap = {
|
||||
if (authorize(request, ALTER, CLUSTER, CLUSTER_NAME))
|
||||
replicaManager.alterReplicaLogDirs(alterReplicaDirsRequest.partitionDirs.asScala)
|
||||
else
|
||||
alterReplicaDirsRequest.partitionDirs.asScala.keys.map((_, Errors.CLUSTER_AUTHORIZATION_FAILED)).toMap
|
||||
}
|
||||
sendResponseMaybeThrottle(request, requestThrottleMs => new AlterReplicaLogDirsResponse(requestThrottleMs, responseMap.asJava))
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
### 2.3`--verify ` 验证结果分析
|
||||
|
||||
>校验执行情况, 顺便移除之前加过的限流配置
|
||||
>`--zookeeper xxxxx --reassignment-json-file config/reassignment-json-file.json --verify`
|
||||
>
|
||||
>
|
||||
源码在`ReassignPartitionsCommand.verifyAssignment` ,很简单 这里就不分析了
|
||||
主要就是把之前写入的配置给清理掉
|
||||
|
||||
|
||||
### 2.4 副本跨路径迁移
|
||||
>为什么线上Kafka机器各个磁盘间的占用不均匀,经常出现“一边倒”的情形? 这是因为Kafka只保证分区数量在各个磁盘上均匀分布,但它无法知晓每个分区实际占用空间,故很有可能出现某些分区消息数量巨大导致占用大量磁盘空间的情况。在1.1版本之前,用户对此毫无办法,因为1.1之前Kafka只支持分区数据在不同broker间的重分配,而无法做到在同一个broker下的不同磁盘间做重分配。1.1版本正式支持副本在不同路径间的迁移
|
||||
|
||||
**怎么在一台Broker上用多个路径存放分区呢?**
|
||||
|
||||
只需要在配置上接多个文件夹就行了
|
||||
```
|
||||
############################# Log Basics #############################
|
||||
|
||||
# A comma separated list of directories under which to store log files
|
||||
log.dirs=kafka-logs-5,kafka-logs-6,kafka-logs-7,kafka-logs-8
|
||||
|
||||
```
|
||||
|
||||
**注意同一个Broker上不同路径只会存放不同的分区,而不会将副本存放在同一个Broker; 不然那副本就没有意义了(容灾)**
|
||||
|
||||
|
||||
**怎么针对跨路径迁移呢?**
|
||||
|
||||
迁移的json文件有一个参数是`log_dirs`; 默认请求不传的话 它是`"log_dirs": ["any"]` (这个数组的数量要跟副本保持一致)
|
||||
但是你想实现跨路径迁移,只需要在这里填入绝对路径就行了,例如下面
|
||||
|
||||
迁移的json文件示例
|
||||
```json
|
||||
{
|
||||
"version": 1,
|
||||
"partitions": [{
|
||||
"topic": "test_create_topic4",
|
||||
"partition": 2,
|
||||
"replicas": [0],
|
||||
"log_dirs": ["/Users/xxxxx/work/IdeaPj/source/kafka/kafka-logs-5"]
|
||||
}, {
|
||||
"topic": "test_create_topic4",
|
||||
"partition": 1,
|
||||
"replicas": [0],
|
||||
"log_dirs": ["/Users/xxxxx/work/IdeaPj/source/kafka/kafka-logs-6"]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## 3.源码总结
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
## 4.Q&A
|
||||
|
||||
### 如果新增副本之后,会触发副本重新选举吗
|
||||
>**Question:** 如果原来副本分配方式是: `"replicas": [0,1]` 重新分配方式变更成 `"replicas": [0,1,2] `或者 `"replicas": [2,0,1]` Leader会变更吗?
|
||||
> **Answer:** 不会,只要没有涉及到原来的Leader的变更,就不会触发重新选举
|
||||
### 如果删除副本之后,会触发副本重新选举吗
|
||||
>**Question:** 如果原来副本分配方式是: `"replicas": [0,1,2]` 重新分配方式变更成 `"replicas": [0,1] `或者 `"replicas": [2,0]` 或者 `"replicas": [1,2] ` Leader会变更吗?
|
||||
> **Answer:** 不会,只要没有涉及到原来的Leader的变更,就不会触发重新选举 ;
|
||||
> 但是如果是之前的Leader被删除了,那就会触发重新选举了
|
||||
> 如果触发选举了,那么选举策略是什么?策略如下图所述
|
||||
> 
|
||||
|
||||
|
||||
|
||||
|
||||
### 在重新分配的过程中,如果执行删除操作会怎么样
|
||||
> 删除操作会等待,等待重新分配完成之后,继续进行删除操作
|
||||
> 可参考文章 [【kafka源码】TopicCommand之删除Topic源码解析]()中的 源码总结部分
|
||||
> 
|
||||
|
||||
|
||||
|
||||
### 副本增加是在哪个时机发生的
|
||||
> 
|
||||
>副本新增之后会开始与leader进行同步, 并修改节点`/brokers/topics/{topicName}/partitions/{分区号}/state` 的isr信息
|
||||
|
||||
### 副本删除是在哪个时机发生的
|
||||
>
|
||||
>副本的删除是一个副本状态转换的过程,具体请看 [【kafka源码】Controller中的状态机]()
|
||||
|
||||
|
||||
### 手动在zk中创建`/admin/reassign_partitions`节点能成功重分配吗
|
||||
> 可以但是没必要, 需要做好一些前置校验
|
||||
|
||||
### 限流配置详情
|
||||
> 里面有很多限流的配置, 关于限流相关 请看 [TODO.....]()
|
||||
|
||||
### 如果重新分配没有新增和删除副本,只是副本位置变更了
|
||||
> Q: 假设分区副本 [0,1,2] 变更为[2,1,0] 会把副本删除之后再新增吗? 会触发leader选举吗?
|
||||
> A: 不会, 副本么有增多和减少就不会有 新增和删除副本的流程; 最终只是在zk节点`/broker/topics/{topicName}` 修改了一下顺序而已, 产生影响只会在下一次进行优先副本选举的时候 让第一个副本作为了Leader;
|
||||
### 重分配过程手动写入限流信息会生效吗
|
||||
>关于限流相关 请看 [TODO.....]()
|
||||
|
||||
|
||||
### 如果Controller角色重新选举 那重新分配任务还会继续吗
|
||||
> KafkaController.onControllerFailover() 里面 有调用接口`initializePartitionReassignments` 会恢复未完成的重分配任务
|
||||
@@ -0,0 +1,411 @@
|
||||
|
||||
|
||||
## 脚本参数
|
||||
|
||||
`sh bin/kafka-topic -help` 查看更具体参数
|
||||
|
||||
下面只是列出了跟` --alter` 相关的参数
|
||||
|
||||
| 参数 |描述 |例子|
|
||||
|--|--|--|
|
||||
|`--bootstrap-server ` 指定kafka服务|指定连接到的kafka服务; 如果有这个参数,则 `--zookeeper`可以不需要|--bootstrap-server localhost:9092 |
|
||||
|`--replica-assignment `|副本分区分配方式;修改topic的时候可以自己指定副本分配情况; |`--replica-assignment id0:id1:id2,id3:id4:id5,id6:id7:id8 `;其中,“id0:id1:id2,id3:id4:id5,id6:id7:id8”表示Topic TopicName一共有3个Partition(以“,”分隔),每个Partition均有3个Replica(以“:”分隔),Topic Partition Replica与Kafka Broker之间的对应关系如下:
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Alert Topic脚本
|
||||
|
||||
|
||||
## 分区扩容
|
||||
**zk方式(不推荐)**
|
||||
```sh
|
||||
bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic topic1 --partitions 2
|
||||
```
|
||||
|
||||
**kafka版本 >= 2.2 支持下面方式(推荐)**
|
||||
**单个Topic扩容**
|
||||
>`bin/kafka-topics.sh --bootstrap-server broker_host:port --alter --topic test_create_topic1 --partitions 4`
|
||||
|
||||
**批量扩容** (将所有正则表达式匹配到的Topic分区扩容到4个)
|
||||
>`sh bin/kafka-topics.sh --topic ".*?" --bootstrap-server 172.23.248.85:9092 --alter --partitions 4`
|
||||
>
|
||||
`".*?"` 正则表达式的意思是匹配所有; 您可按需匹配
|
||||
|
||||
**PS:** 当某个Topic的分区少于指定的分区数时候,他会抛出异常;但是不会影响其他Topic正常进行;
|
||||
|
||||
---
|
||||
|
||||
相关可选参数
|
||||
| 参数 |描述 |例子|
|
||||
|--|--|--|
|
||||
|`--replica-assignment `|副本分区分配方式;创建topic的时候可以自己指定副本分配情况; |`--replica-assignment` BrokerId-0:BrokerId-1:BrokerId-2,BrokerId-1:BrokerId-2:BrokerId-0,BrokerId-2:BrokerId-1:BrokerId-0 ; 这个意思是有三个分区和三个副本,对应分配的Broker; 逗号隔开标识分区;冒号隔开表示副本|
|
||||
|
||||
**PS: 虽然这里配置的是全部的分区副本分配配置,但是正在生效的是新增的分区;**
|
||||
比如: 以前3分区1副本是这样的
|
||||
| Broker-1 |Broker-2 |Broker-3|Broker-4|
|
||||
|--|--|--|--|
|
||||
|0 | 1 |2|
|
||||
现在新增一个分区,`--replica-assignment` 2,1,3,4 ; 看这个意思好像是把0,1号分区互相换个Broker
|
||||
| Broker-1 |Broker-2 |Broker-3|Broker-4|
|
||||
|--|--|--|--|
|
||||
|1 | 0 |2|3||
|
||||
但是实际上不会这样做,Controller在处理的时候会把前面3个截掉; 只取新增的分区分配方式,原来的还是不会变
|
||||
| Broker-1 |Broker-2 |Broker-3|Broker-4|
|
||||
|--|--|--|--|
|
||||
|0 | 1 |2|3||
|
||||
|
||||
## 源码解析
|
||||
> <font color=red>如果觉得源码解析过程比较枯燥乏味,可以直接如果 **源码总结及其后面部分**</font>
|
||||
|
||||
因为在 [【kafka源码】TopicCommand之创建Topic源码解析]() 里面分析的比较详细; 故本文就着重点分析了;
|
||||
|
||||
### 1. `TopicCommand.alterTopic`
|
||||
```scala
|
||||
override def alterTopic(opts: TopicCommandOptions): Unit = {
|
||||
val topic = new CommandTopicPartition(opts)
|
||||
val topics = getTopics(opts.topic, opts.excludeInternalTopics)
|
||||
//校验Topic是否存在
|
||||
ensureTopicExists(topics, opts.topic)
|
||||
//获取一下该topic的一些基本信息
|
||||
val topicsInfo = adminClient.describeTopics(topics.asJavaCollection).values()
|
||||
adminClient.createPartitions(topics.map {topicName =>
|
||||
//判断是否有参数 replica-assignment 指定分区分配方式
|
||||
if (topic.hasReplicaAssignment) {
|
||||
val startPartitionId = topicsInfo.get(topicName).get().partitions().size()
|
||||
val newAssignment = {
|
||||
val replicaMap = topic.replicaAssignment.get.drop(startPartitionId)
|
||||
new util.ArrayList(replicaMap.map(p => p._2.asJava).asJavaCollection).asInstanceOf[util.List[util.List[Integer]]]
|
||||
}
|
||||
topicName -> NewPartitions.increaseTo(topic.partitions.get, newAssignment)
|
||||
} else {
|
||||
|
||||
topicName -> NewPartitions.increaseTo(topic.partitions.get)
|
||||
}}.toMap.asJava).all().get()
|
||||
}
|
||||
```
|
||||
1. 校验Topic是否存在
|
||||
2. 如果设置了`--replica-assignment `参数, 则会算出新增的分区数的分配; 这个并不会修改原本已经分配好的分区结构.从源码就可以看出来,假如我之前的分配方式是3,3,3(3分区一个副本都在BrokerId-3上)现在我传入的参数是: `3,3,3,3`(多出来一个分区),这个时候会把原有的给截取掉;只传入3,(表示在Broker3新增一个分区)
|
||||
3. 如果没有传入参数`--replica-assignment`,则后面会用默认分配策略分配
|
||||
|
||||
#### 客户端发起请求createPartitions
|
||||
|
||||
`KafkaAdminClient.createPartitions` 省略部分代码
|
||||
```java
|
||||
@Override
|
||||
public CreatePartitionsResult createPartitions(Map<String, NewPartitions> newPartitions,
|
||||
final CreatePartitionsOptions options) {
|
||||
final Map<String, KafkaFutureImpl<Void>> futures = new HashMap<>(newPartitions.size());
|
||||
for (String topic : newPartitions.keySet()) {
|
||||
futures.put(topic, new KafkaFutureImpl<>());
|
||||
}
|
||||
runnable.call(new Call("createPartitions", calcDeadlineMs(now, options.timeoutMs()),
|
||||
new ControllerNodeProvider()) {
|
||||
//省略部分代码
|
||||
@Override
|
||||
void handleFailure(Throwable throwable) {
|
||||
completeAllExceptionally(futures.values(), throwable);
|
||||
}
|
||||
}, now);
|
||||
return new CreatePartitionsResult(new HashMap<>(futures));
|
||||
}
|
||||
```
|
||||
1. 从源码中可以看到向`ControllerNodeProvider` 发起来`createPartitions`请求
|
||||
|
||||
|
||||
### 2. Controller角色的服务端接受createPartitions请求处理逻辑
|
||||
>
|
||||
`KafkaApis.handleCreatePartitionsRequest`
|
||||
```scala
|
||||
def handleCreatePartitionsRequest(request: RequestChannel.Request): Unit = {
|
||||
val createPartitionsRequest = request.body[CreatePartitionsRequest]
|
||||
|
||||
//部分代码省略..
|
||||
|
||||
//如果当前不是Controller角色直接抛出异常
|
||||
if (!controller.isActive) {
|
||||
val result = createPartitionsRequest.data.topics.asScala.map { topic =>
|
||||
(topic.name, new ApiError(Errors.NOT_CONTROLLER, null))
|
||||
}.toMap
|
||||
sendResponseCallback(result)
|
||||
} else {
|
||||
// Special handling to add duplicate topics to the response
|
||||
val topics = createPartitionsRequest.data.topics.asScala
|
||||
val dupes = topics.groupBy(_.name)
|
||||
.filter { _._2.size > 1 }
|
||||
.keySet
|
||||
val notDuped = topics.filterNot(topic => dupes.contains(topic.name))
|
||||
val authorizedTopics = filterAuthorized(request, ALTER, TOPIC, notDuped.map(_.name))
|
||||
val (authorized, unauthorized) = notDuped.partition { topic => authorizedTopics.contains(topic.name) }
|
||||
|
||||
val (queuedForDeletion, valid) = authorized.partition { topic =>
|
||||
controller.topicDeletionManager.isTopicQueuedUpForDeletion(topic.name)
|
||||
}
|
||||
|
||||
val errors = dupes.map(_ -> new ApiError(Errors.INVALID_REQUEST, "Duplicate topic in request.")) ++
|
||||
unauthorized.map(_.name -> new ApiError(Errors.TOPIC_AUTHORIZATION_FAILED, "The topic authorization is failed.")) ++
|
||||
queuedForDeletion.map(_.name -> new ApiError(Errors.INVALID_TOPIC_EXCEPTION, "The topic is queued for deletion."))
|
||||
|
||||
adminManager.createPartitions(createPartitionsRequest.data.timeoutMs,
|
||||
valid,
|
||||
createPartitionsRequest.data.validateOnly,
|
||||
request.context.listenerName, result => sendResponseCallback(result ++ errors))
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. 检验自身是不是Controller角色,不是的话就抛出异常终止流程
|
||||
2. 鉴权
|
||||
3. 调用` adminManager.createPartitions`
|
||||
3.1 从zk中获取`/brokers/ids/`Brokers列表的元信息的
|
||||
3.2 从zk获取`/brokers/topics/{topicName}`已经存在的副本分配方式,并判断是否有正在进行副本重分配的进程在执行,如果有的话就抛出异常结束流程
|
||||
3.3 如果从zk获取`/brokers/topics/{topicName}`数据不存在则抛出异常 `The topic '$topic' does not exist`
|
||||
3.4 检查修改的分区数是否比原来的分区数大,如果比原来还小或者等于原来分区数则抛出异常结束流程
|
||||
3.5 如果传入的参数`--replica-assignment` 中有不存在的BrokerId;则抛出异常`Unknown broker(s) in replica assignment`结束流程
|
||||
3.5 如果传入的`--partitions`数量 与`--replica-assignment`中新增的部分数量不匹配则抛出异常`Increasing the number of partitions by...` 结束流程
|
||||
3.6 调用` adminZkClient.addPartitions`
|
||||
|
||||
|
||||
#### ` adminZkClient.addPartitions` 添加分区
|
||||
|
||||
|
||||
1. 校验`--partitions`数量是否比存在的分区数大,否则异常`The number of partitions for a topic can only be increased`
|
||||
2. 如果传入了`--replica-assignment` ,则对副本进行一些简单的校验
|
||||
3. 调用`AdminUtils.assignReplicasToBrokers`分配副本 ; 这个我们在[【kafka源码】TopicCommand之创建Topic源码解析]() 也分析过; 具体请看[【kafka源码】创建Topic的时候是如何分区和副本的分配规则](); 当然这里由于我们是新增的分区,只会将新增的分区进行分配计算
|
||||
4. 得到分配规则只后,调用`adminZkClient.writeTopicPartitionAssignment` 写入
|
||||
|
||||
#### adminZkClient.writeTopicPartitionAssignment将分区信息写入zk中
|
||||

|
||||
|
||||
我们在 [【kafka源码】TopicCommand之创建Topic源码解析]()的时候也分析过这段代码,但是那个时候调用的是`zkClient.createTopicAssignment` 创建接口
|
||||
这里我们是调用` zkClient.setTopicAssignment` 写入接口, 写入当然会覆盖掉原有的信息,所以写入的时候会把原来分区信息获取到,重新写入;
|
||||
|
||||
1. 获取Topic原有分区副本分配信息
|
||||
2. 将原有的和现在要添加的组装成一个数据对象写入到zk节点`/brokers/topics/{topicName}`中
|
||||
|
||||
|
||||
### 3. Controller监控节点`/brokers/topics/{topicName}` ,真正在Broker上将分区写入磁盘
|
||||
监听到节点信息变更之后调用下面的接口;
|
||||
`KafkaController.processPartitionModifications`
|
||||
```scala
|
||||
private def processPartitionModifications(topic: String): Unit = {
|
||||
def restorePartitionReplicaAssignment(
|
||||
topic: String,
|
||||
newPartitionReplicaAssignment: Map[TopicPartition, ReplicaAssignment]
|
||||
): Unit = {
|
||||
info("Restoring the partition replica assignment for topic %s".format(topic))
|
||||
|
||||
val existingPartitions = zkClient.getChildren(TopicPartitionsZNode.path(topic))
|
||||
val existingPartitionReplicaAssignment = newPartitionReplicaAssignment
|
||||
.filter(p => existingPartitions.contains(p._1.partition.toString))
|
||||
.map { case (tp, _) =>
|
||||
tp -> controllerContext.partitionFullReplicaAssignment(tp)
|
||||
}.toMap
|
||||
|
||||
zkClient.setTopicAssignment(topic,
|
||||
existingPartitionReplicaAssignment,
|
||||
controllerContext.epochZkVersion)
|
||||
}
|
||||
|
||||
if (!isActive) return
|
||||
val partitionReplicaAssignment = zkClient.getFullReplicaAssignmentForTopics(immutable.Set(topic))
|
||||
val partitionsToBeAdded = partitionReplicaAssignment.filter { case (topicPartition, _) =>
|
||||
controllerContext.partitionReplicaAssignment(topicPartition).isEmpty
|
||||
}
|
||||
|
||||
if (topicDeletionManager.isTopicQueuedUpForDeletion(topic)) {
|
||||
if (partitionsToBeAdded.nonEmpty) {
|
||||
warn("Skipping adding partitions %s for topic %s since it is currently being deleted"
|
||||
.format(partitionsToBeAdded.map(_._1.partition).mkString(","), topic))
|
||||
|
||||
restorePartitionReplicaAssignment(topic, partitionReplicaAssignment)
|
||||
} else {
|
||||
// This can happen if existing partition replica assignment are restored to prevent increasing partition count during topic deletion
|
||||
info("Ignoring partition change during topic deletion as no new partitions are added")
|
||||
}
|
||||
} else if (partitionsToBeAdded.nonEmpty) {
|
||||
info(s"New partitions to be added $partitionsToBeAdded")
|
||||
partitionsToBeAdded.foreach { case (topicPartition, assignedReplicas) =>
|
||||
controllerContext.updatePartitionFullReplicaAssignment(topicPartition, assignedReplicas)
|
||||
}
|
||||
onNewPartitionCreation(partitionsToBeAdded.keySet)
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. 判断是否Controller,不是则直接结束流程
|
||||
2. 获取`/brokers/topics/{topicName}` 节点信息, 然后再对比一下当前该节点的分区分配信息; 看看有没有是新增的分区; 如果是新增的分区这个时候是还没有`/brokers/topics/{topicName}/partitions/{分区号}/state` ;
|
||||
3. 如果当前的TOPIC正在被删除中,那么就没有必要执行扩分区了
|
||||
5. 将新增加的分区信息加载到内存中
|
||||
6. 调用接口`KafkaController.onNewPartitionCreation`
|
||||
|
||||
#### KafkaController.onNewPartitionCreation 新增分区
|
||||
从这里开始 , 后面的流程就跟创建Topic的对应流程一样了;
|
||||
|
||||
> 该接口主要是针对新增分区和副本的一些状态流转过程; 在[【kafka源码】TopicCommand之创建Topic源码解析]() 也同样分析过
|
||||
|
||||
```scala
|
||||
/**
|
||||
* This callback is invoked by the topic change callback with the list of failed brokers as input.
|
||||
* It does the following -
|
||||
* 1. Move the newly created partitions to the NewPartition state
|
||||
* 2. Move the newly created partitions from NewPartition->OnlinePartition state
|
||||
*/
|
||||
private def onNewPartitionCreation(newPartitions: Set[TopicPartition]): Unit = {
|
||||
info(s"New partition creation callback for ${newPartitions.mkString(",")}")
|
||||
partitionStateMachine.handleStateChanges(newPartitions.toSeq, NewPartition)
|
||||
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions).toSeq, NewReplica)
|
||||
partitionStateMachine.handleStateChanges(
|
||||
newPartitions.toSeq,
|
||||
OnlinePartition,
|
||||
Some(OfflinePartitionLeaderElectionStrategy(false))
|
||||
)
|
||||
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions).toSeq, OnlineReplica)
|
||||
}
|
||||
```
|
||||
1. 将待创建的分区状态流转为`NewPartition`;
|
||||

|
||||
2. 将待创建的副本 状态流转为`NewReplica`;
|
||||

|
||||
3. 将分区状态从刚刚的`NewPartition`流转为`OnlinePartition`
|
||||
0. 获取`leaderIsrAndControllerEpochs`; Leader为副本的第一个;
|
||||
1. 向zk中写入`/brokers/topics/{topicName}/partitions/` 持久节点; 无数据
|
||||
2. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}` 持久节点; 无数据
|
||||
3. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}/state` 持久节点; 数据为`leaderIsrAndControllerEpoch`
|
||||
4. 向副本所属Broker发送[`leaderAndIsrRequest`]()请求
|
||||
5. 向所有Broker发送[`UPDATE_METADATA` ]()请求
|
||||
4. 将副本状态从刚刚的`NewReplica`流转为`OnlineReplica` ,更新下内存
|
||||
|
||||
关于分区状态机和副本状态机详情请看[【kafka源码】Controller中的状态机](TODO)
|
||||
|
||||
### 4. Broker收到LeaderAndIsrRequest 创建本地Log
|
||||
>上面步骤中有说到向副本所属Broker发送[`leaderAndIsrRequest`]()请求,那么这里做了什么呢
|
||||
>其实主要做的是 创建本地Log
|
||||
>
|
||||
代码太多,这里我们直接定位到只跟创建Topic相关的关键代码来分析
|
||||
`KafkaApis.handleLeaderAndIsrRequest->replicaManager.becomeLeaderOrFollower->ReplicaManager.makeLeaders...LogManager.getOrCreateLog`
|
||||
|
||||
```scala
|
||||
/**
|
||||
* 如果日志已经存在,只返回现有日志的副本否则如果 isNew=true 或者如果没有离线日志目录,则为给定的主题和给定的分区创建日志 否则抛出 KafkaStorageException
|
||||
*/
|
||||
def getOrCreateLog(topicPartition: TopicPartition, config: LogConfig, isNew: Boolean = false, isFuture: Boolean = false): Log = {
|
||||
logCreationOrDeletionLock synchronized {
|
||||
getLog(topicPartition, isFuture).getOrElse {
|
||||
// create the log if it has not already been created in another thread
|
||||
if (!isNew && offlineLogDirs.nonEmpty)
|
||||
throw new KafkaStorageException(s"Can not create log for $topicPartition because log directories ${offlineLogDirs.mkString(",")} are offline")
|
||||
|
||||
val logDirs: List[File] = {
|
||||
val preferredLogDir = preferredLogDirs.get(topicPartition)
|
||||
|
||||
if (isFuture) {
|
||||
if (preferredLogDir == null)
|
||||
throw new IllegalStateException(s"Can not create the future log for $topicPartition without having a preferred log directory")
|
||||
else if (getLog(topicPartition).get.dir.getParent == preferredLogDir)
|
||||
throw new IllegalStateException(s"Can not create the future log for $topicPartition in the current log directory of this partition")
|
||||
}
|
||||
|
||||
if (preferredLogDir != null)
|
||||
List(new File(preferredLogDir))
|
||||
else
|
||||
nextLogDirs()
|
||||
}
|
||||
|
||||
val logDirName = {
|
||||
if (isFuture)
|
||||
Log.logFutureDirName(topicPartition)
|
||||
else
|
||||
Log.logDirName(topicPartition)
|
||||
}
|
||||
|
||||
val logDir = logDirs
|
||||
.toStream // to prevent actually mapping the whole list, lazy map
|
||||
.map(createLogDirectory(_, logDirName))
|
||||
.find(_.isSuccess)
|
||||
.getOrElse(Failure(new KafkaStorageException("No log directories available. Tried " + logDirs.map(_.getAbsolutePath).mkString(", "))))
|
||||
.get // If Failure, will throw
|
||||
|
||||
val log = Log(
|
||||
dir = logDir,
|
||||
config = config,
|
||||
logStartOffset = 0L,
|
||||
recoveryPoint = 0L,
|
||||
maxProducerIdExpirationMs = maxPidExpirationMs,
|
||||
producerIdExpirationCheckIntervalMs = LogManager.ProducerIdExpirationCheckIntervalMs,
|
||||
scheduler = scheduler,
|
||||
time = time,
|
||||
brokerTopicStats = brokerTopicStats,
|
||||
logDirFailureChannel = logDirFailureChannel)
|
||||
|
||||
if (isFuture)
|
||||
futureLogs.put(topicPartition, log)
|
||||
else
|
||||
currentLogs.put(topicPartition, log)
|
||||
|
||||
info(s"Created log for partition $topicPartition in $logDir with properties " + s"{${config.originals.asScala.mkString(", ")}}.")
|
||||
// Remove the preferred log dir since it has already been satisfied
|
||||
preferredLogDirs.remove(topicPartition)
|
||||
|
||||
log
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 如果日志已经存在,只返回现有日志的副本否则如果 isNew=true 或者如果没有离线日志目录,则为给定的主题和给定的分区创建日志 否则抛出` KafkaStorageException`
|
||||
|
||||
详细请看 [【kafka源码】LeaderAndIsrRequest请求]()
|
||||
|
||||
|
||||
## 源码总结
|
||||
看图说话
|
||||

|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Q&A
|
||||
|
||||
### 如果自定义的分配Broker不存在会怎么样
|
||||
> 会抛出异常`Unknown broker(s) in replica assignment`, 因为在执行的时候会去zk获取当前的在线Broker列表,然后判断是否在线;
|
||||
|
||||
### 如果设置的分区数不等于 `--replica-assignment`中新增的数目会怎么样
|
||||
>会抛出异常`Increasing the number of partitions by..`结束流程
|
||||
|
||||
### 如果写入`/brokers/topics/{topicName}`之后 Controller监听到请求正好挂掉怎么办
|
||||
> Controller挂掉会发生重新选举,选举成功之后, 检查到`/brokers/topics/{topicName}`之后发现没有生成对应的分区,会自动执行接下来的流程;
|
||||
|
||||
|
||||
### 如果我手动在zk中写入节点`/brokers/topics/{topicName}/partitions/{分区号}/state` 会怎么样
|
||||
> Controller并没有监听这个节点,所以不会有变化; 但是当Controller发生重新选举的时候,
|
||||
> **被删除的节点会被重新添加回来;**
|
||||
>但是**写入的节点 就不会被删除了**;写入的节点信息会被保存在Controller内存中;
|
||||
>同样这会影响到分区扩容
|
||||
>
|
||||
>
|
||||
> ----
|
||||
> 例子🌰:
|
||||
> 当前分区3个,副本一个,手贱在zk上添加了一个节点如下图:
|
||||
> 
|
||||
> 这个时候我想扩展一个分区; 然后执行了脚本, 虽然`/brokers/topics/test_create_topic3`节点数据变; 但是Broker真正在`LeaderAndIsrRequest`请求里面没有执行创建本地Log文件; 这是因为源码读取到zk下面partitions的节点数量和新增之后的节点数量没有变更,那么它就认为本次请求没有变更就不会执行创建本地Log文件了;
|
||||
> 如果判断有变更,还是会去创建的;
|
||||
> 手贱zk写入N个partition节点 + 扩充N个分区 = Log文件不会被创建
|
||||
> 手贱zk写入N个partition节点 + 扩充>N个分区 = 正常扩容
|
||||
|
||||
### 如果直接修改节点/brokers/topics/{topicName}中的配置会怎么样
|
||||
>如果该节点信息是`{"version":2,"partitions":{"2":[1],"1":[1],"0":[1]},"adding_replicas":{},"removing_replicas":{}}` 看数据,说明3个分区1个副本都在Broker-1上;
|
||||
>我在zk上修改成`{"version":2,"partitions":{"2":[2],"1":[1],"0":[0]},"adding_replicas":{},"removing_replicas":{}}`
|
||||
>想将分区分配到 Broker-0,Broker-1,Broker-2上
|
||||
>TODO。。。
|
||||
|
||||
|
||||
|
||||
---
|
||||
<font color=red size=5>Tips:如果关于本篇文章你有疑问,可以在评论区留下,我会在**Q&A**部分进行解答 </font>
|
||||
|
||||
|
||||
|
||||
<font color=red size=2>PS: 文章阅读的源码版本是kafka-2.5 </font>
|
||||
597
docs/zh/Kafka分享/Kafka Controller /TopicCommand之创建Topic源码解析.md
Normal file
@@ -0,0 +1,597 @@
|
||||
|
||||
## 脚本参数
|
||||
|
||||
`sh bin/kafka-topic -help` 查看更具体参数
|
||||
|
||||
下面只是列出了跟` --create` 相关的参数
|
||||
|
||||
| 参数 |描述 |例子|
|
||||
|--|--|--|
|
||||
|`--bootstrap-server ` 指定kafka服务|指定连接到的kafka服务; 如果有这个参数,则 `--zookeeper`可以不需要|--bootstrap-server localhost:9092 |
|
||||
|`--zookeeper`|弃用, 通过zk的连接方式连接到kafka集群;|--zookeeper localhost:2181 或者localhost:2181/kafka|
|
||||
|`--replication-factor `|副本数量,注意不能大于broker数量;如果不提供,则会用集群中默认配置|--replication-factor 3 |
|
||||
|`--partitions`|分区数量|当创建或者修改topic的时候,用这个来指定分区数;如果创建的时候没有提供参数,则用集群中默认值; 注意如果是修改的时候,分区比之前小会有问题|--partitions 3 |
|
||||
|`--replica-assignment `|副本分区分配方式;创建topic的时候可以自己指定副本分配情况; |`--replica-assignment` BrokerId-0:BrokerId-1:BrokerId-2,BrokerId-1:BrokerId-2:BrokerId-0,BrokerId-2:BrokerId-1:BrokerId-0 ; 这个意思是有三个分区和三个副本,对应分配的Broker; 逗号隔开标识分区;冒号隔开表示副本|
|
||||
| `--config `<String: name=value> |用来设置topic级别的配置以覆盖默认配置;**只在--create 和--bootstrap-server 同时使用时候生效**; 可以配置的参数列表请看文末附件 |例如覆盖两个配置 `--config retention.bytes=123455 --config retention.ms=600001`|
|
||||
|`--command-config` <String: command 文件路径> |用来配置客户端Admin Client启动配置,**只在--bootstrap-server 同时使用时候生效**;|例如:设置请求的超时时间 `--command-config config/producer.proterties `; 然后在文件中配置 request.timeout.ms=300000|
|
||||
|`--create`|命令方式; 表示当前请求是创建Topic|`--create`|
|
||||
|
||||
|
||||
|
||||
|
||||
## 创建Topic脚本
|
||||
**zk方式(不推荐)**
|
||||
```shell
|
||||
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 3 --topic test
|
||||
```
|
||||
<font color="red">需要注意的是--zookeeper后面接的是kafka的zk配置, 假如你配置的是localhost:2181/kafka 带命名空间的这种,不要漏掉了 </font>
|
||||
|
||||
**kafka版本 >= 2.2 支持下面方式(推荐)**
|
||||
```shell
|
||||
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 3 --partitions 3 --topic test
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
当前分析的kafka源码版本为 `kafka-2.5`
|
||||
|
||||
## 创建Topic 源码分析
|
||||
<font color="red">温馨提示: 如果阅读源码略显枯燥,你可以直接看源码总结以及后面部分</font>
|
||||
|
||||
首先我们找到源码入口处, 查看一下 `kafka-topic.sh`脚本的内容
|
||||
`exec $(dirname $0)/kafka-run-class.sh kafka.admin.TopicCommand "$@"`
|
||||
最终是执行了`kafka.admin.TopicCommand`这个类,找到这个地方之后就可以断点调试源码了,用IDEA启动
|
||||

|
||||
记得配置一下入参
|
||||
比如: `--create --bootstrap-server 127.0.0.1:9092 --partitions 3 --topic test_create_topic3`
|
||||

|
||||
|
||||
|
||||
### 1. 源码入口
|
||||

|
||||
上面的源码主要作用是
|
||||
1. 根据是否有传入参数`--zookeeper` 来判断创建哪一种 对象`topicService`
|
||||
如果传入了`--zookeeper` 则创建 类 `ZookeeperTopicService`的对象
|
||||
否则创建类`AdminClientTopicService`的对象(我们主要分析这个对象)
|
||||
2. 根据传入的参数类型判断是创建topic还是删除等等其他 判断依据是 是否在参数里传入了`--create`
|
||||
|
||||
|
||||
### 2. 创建AdminClientTopicService 对象
|
||||
> `val topicService = new AdminClientTopicService(createAdminClient(commandConfig, bootstrapServer))`
|
||||
|
||||
#### 2.1 先创建 Admin
|
||||
```scala
|
||||
object AdminClientTopicService {
|
||||
def createAdminClient(commandConfig: Properties, bootstrapServer: Option[String]): Admin = {
|
||||
bootstrapServer match {
|
||||
case Some(serverList) => commandConfig.put(CommonClientConfigs.BOOTSTRAP_SERVERS_CONFIG, serverList)
|
||||
case None =>
|
||||
}
|
||||
Admin.create(commandConfig)
|
||||
}
|
||||
|
||||
def apply(commandConfig: Properties, bootstrapServer: Option[String]): AdminClientTopicService =
|
||||
new AdminClientTopicService(createAdminClient(commandConfig, bootstrapServer))
|
||||
}
|
||||
```
|
||||
|
||||
1. 如果有入参`--command-config` ,则将这个文件里面的参数都放到map `commandConfig`里面, 并且也加入`bootstrap.servers`的参数;假如配置文件里面已经有了`bootstrap.servers`配置,那么会将其覆盖
|
||||
2. 将上面的`commandConfig` 作为入参调用`Admin.create(commandConfig)`创建 Admin; 这个时候调用的Client模块的代码了, 从这里我们就可以看出,我们调用`kafka-topic.sh`脚本实际上是kafka模拟了一个客户端`Client`来创建Topic的过程;
|
||||

|
||||
|
||||
|
||||
|
||||
### 3. AdminClientTopicService.createTopic 创建Topic
|
||||
` topicService.createTopic(opts)`
|
||||
|
||||
```scala
|
||||
case class AdminClientTopicService private (adminClient: Admin) extends TopicService {
|
||||
|
||||
override def createTopic(topic: CommandTopicPartition): Unit = {
|
||||
//如果配置了副本副本数--replication-factor 一定要大于0
|
||||
if (topic.replicationFactor.exists(rf => rf > Short.MaxValue || rf < 1))
|
||||
throw new IllegalArgumentException(s"The replication factor must be between 1 and ${Short.MaxValue} inclusive")
|
||||
//如果配置了--partitions 分区数 必须大于0
|
||||
if (topic.partitions.exists(partitions => partitions < 1))
|
||||
throw new IllegalArgumentException(s"The partitions must be greater than 0")
|
||||
|
||||
//查询是否已经存在该Topic
|
||||
if (!adminClient.listTopics().names().get().contains(topic.name)) {
|
||||
val newTopic = if (topic.hasReplicaAssignment)
|
||||
//如果指定了--replica-assignment参数;则按照指定的来分配副本
|
||||
new NewTopic(topic.name, asJavaReplicaReassignment(topic.replicaAssignment.get))
|
||||
else {
|
||||
new NewTopic(
|
||||
topic.name,
|
||||
topic.partitions.asJava,
|
||||
topic.replicationFactor.map(_.toShort).map(Short.box).asJava)
|
||||
}
|
||||
|
||||
// 将配置--config 解析成一个配置map
|
||||
val configsMap = topic.configsToAdd.stringPropertyNames()
|
||||
.asScala
|
||||
.map(name => name -> topic.configsToAdd.getProperty(name))
|
||||
.toMap.asJava
|
||||
|
||||
newTopic.configs(configsMap)
|
||||
//调用adminClient创建Topic
|
||||
val createResult = adminClient.createTopics(Collections.singleton(newTopic))
|
||||
createResult.all().get()
|
||||
println(s"Created topic ${topic.name}.")
|
||||
} else {
|
||||
throw new IllegalArgumentException(s"Topic ${topic.name} already exists")
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 检查各项入参是否有问题
|
||||
2. `adminClient.listTopics()`,然后比较是否已经存在待创建的Topic;如果存在抛出异常;
|
||||
3. 判断是否配置了参数`--replica-assignment` ; 如果配置了,那么Topic就会按照指定的方式来配置副本情况
|
||||
4. 解析配置`--config ` 配置放到` configsMap`中; `configsMap`给到`NewTopic`对象
|
||||
5. 调用`adminClient.createTopics`创建Topic; 它是如何创建Topic的呢?往下分析源码
|
||||
|
||||
#### 3.1 KafkaAdminClient.createTopics(NewTopic) 创建Topic
|
||||
|
||||
```java
|
||||
@Override
|
||||
public CreateTopicsResult createTopics(final Collection<NewTopic> newTopics,
|
||||
final CreateTopicsOptions options) {
|
||||
|
||||
//省略部分源码...
|
||||
Call call = new Call("createTopics", calcDeadlineMs(now, options.timeoutMs()),
|
||||
new ControllerNodeProvider()) {
|
||||
|
||||
@Override
|
||||
public CreateTopicsRequest.Builder createRequest(int timeoutMs) {
|
||||
return new CreateTopicsRequest.Builder(
|
||||
new CreateTopicsRequestData().
|
||||
setTopics(topics).
|
||||
setTimeoutMs(timeoutMs).
|
||||
setValidateOnly(options.shouldValidateOnly()));
|
||||
}
|
||||
|
||||
@Override
|
||||
public void handleResponse(AbstractResponse abstractResponse) {
|
||||
//省略
|
||||
}
|
||||
|
||||
@Override
|
||||
void handleFailure(Throwable throwable) {
|
||||
completeAllExceptionally(topicFutures.values(), throwable);
|
||||
}
|
||||
};
|
||||
|
||||
}
|
||||
```
|
||||
这个代码里面主要看下Call里面的接口; 先不管Kafka如何跟服务端进行通信的细节; 我们主要关注创建Topic的逻辑;
|
||||
1. `createRequest`会构造一个请求参数`CreateTopicsRequest` 例如下图
|
||||

|
||||
2. 选择ControllerNodeProvider这个节点发起网络请求
|
||||

|
||||
可以清楚的看到, 创建Topic这个操作是需要Controller来执行的;
|
||||

|
||||
|
||||
|
||||
|
||||
|
||||
### 4. 发起网络请求
|
||||
[==>服务端客户端网络模型 ](TODO)
|
||||
|
||||
### 5. Controller角色的服务端接受请求处理逻辑
|
||||
首先找到服务端处理客户端请求的 **源码入口** ⇒ `KafkaRequestHandler.run()`
|
||||
|
||||
|
||||
主要看里面的 `apis.handle(request)` 方法; 可以看到客户端的请求都在`request.bodyAndSize()`里面
|
||||

|
||||
#### 5.1 KafkaApis.handle(request) 根据请求传递Api调用不同接口
|
||||
进入方法可以看到根据`request.header.apiKey` 调用对应的方法,客户端传过来的是`CreateTopics`
|
||||

|
||||
|
||||
#### 5.2 KafkaApis.handleCreateTopicsRequest 处理创建Topic的请求
|
||||
|
||||
```java
|
||||
|
||||
def handleCreateTopicsRequest(request: RequestChannel.Request): Unit = {
|
||||
// 部分代码省略
|
||||
//如果当前Broker不是属于Controller的话,就抛出异常
|
||||
if (!controller.isActive) {
|
||||
createTopicsRequest.data.topics.asScala.foreach { topic =>
|
||||
results.add(new CreatableTopicResult().setName(topic.name).
|
||||
setErrorCode(Errors.NOT_CONTROLLER.code))
|
||||
}
|
||||
sendResponseCallback(results)
|
||||
} else {
|
||||
// 部分代码省略
|
||||
}
|
||||
adminManager.createTopics(createTopicsRequest.data.timeoutMs,
|
||||
createTopicsRequest.data.validateOnly,
|
||||
toCreate,
|
||||
authorizedForDescribeConfigs,
|
||||
handleCreateTopicsResults)
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. 判断当前处理的broker是不是Controller,如果不是Controller的话直接抛出异常,从这里可以看出,CreateTopic这个操作必须是Controller来进行, 出现这种情况有可能是客户端发起请求的时候Controller已经变更;
|
||||
2. 鉴权 [【Kafka源码】kafka鉴权机制]()
|
||||
3. 调用`adminManager.createTopics()`
|
||||
|
||||
#### 5.3 adminManager.createTopics()
|
||||
> 创建主题并等等主题完全创建,回调函数将会在超时、错误、或者主题创建完成时触发
|
||||
|
||||
该方法过长,省略部分代码
|
||||
```scala
|
||||
def createTopics(timeout: Int,
|
||||
validateOnly: Boolean,
|
||||
toCreate: Map[String, CreatableTopic],
|
||||
includeConfigsAndMetatadata: Map[String, CreatableTopicResult],
|
||||
responseCallback: Map[String, ApiError] => Unit): Unit = {
|
||||
|
||||
// 1. map over topics creating assignment and calling zookeeper
|
||||
val brokers = metadataCache.getAliveBrokers.map { b => kafka.admin.BrokerMetadata(b.id, b.rack) }
|
||||
val metadata = toCreate.values.map(topic =>
|
||||
try {
|
||||
//省略部分代码
|
||||
//检查Topic是否存在
|
||||
//检查 --replica-assignment参数和 (--partitions || --replication-factor ) 不能同时使用
|
||||
// 如果(--partitions || --replication-factor ) 没有设置,则使用 Broker的配置(这个Broker肯定是Controller)
|
||||
// 计算分区副本分配方式
|
||||
|
||||
createTopicPolicy match {
|
||||
case Some(policy) =>
|
||||
//省略部分代码
|
||||
adminZkClient.validateTopicCreate(topic.name(), assignments, configs)
|
||||
if (!validateOnly)
|
||||
adminZkClient.createTopicWithAssignment(topic.name, configs, assignments)
|
||||
|
||||
case None =>
|
||||
if (validateOnly)
|
||||
//校验创建topic的参数准确性
|
||||
adminZkClient.validateTopicCreate(topic.name, assignments, configs)
|
||||
else
|
||||
//把topic相关数据写入到zk中
|
||||
adminZkClient.createTopicWithAssignment(topic.name, configs, assignments)
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
```
|
||||
1. 做一些校验检查
|
||||
①.检查Topic是否存在
|
||||
②. 检查` --replica-assignment`参数和 (`--partitions || --replication-factor` ) 不能同时使用
|
||||
③.如果(`--partitions || --replication-factor` ) 没有设置,则使用 Broker的配置(这个Broker肯定是Controller)
|
||||
④.计算分区副本分配方式
|
||||
|
||||
2. `createTopicPolicy` 根据Broker是否配置了创建Topic的自定义校验策略; 使用方式是自定义实现`org.apache.kafka.server.policy.CreateTopicPolicy`接口;并 在服务器配置 `create.topic.policy.class.name=自定义类`; 比如我就想所有创建Topic的请求分区数都要大于10; 那么这里就可以实现你的需求了
|
||||
3. `createTopicWithAssignment`把topic相关数据写入到zk中; 进去分析一下
|
||||
|
||||
|
||||
|
||||
#### 5.4 写入zookeeper数据
|
||||
我们进入到` adminZkClient.createTopicWithAssignment(topic.name, configs, assignments)
|
||||
`看看有哪些数据写入到了zk中;
|
||||
```scala
|
||||
def createTopicWithAssignment(topic: String,
|
||||
config: Properties,
|
||||
partitionReplicaAssignment: Map[Int, Seq[Int]]): Unit = {
|
||||
validateTopicCreate(topic, partitionReplicaAssignment, config)
|
||||
|
||||
// 将topic单独的配置写入到zk中
|
||||
zkClient.setOrCreateEntityConfigs(ConfigType.Topic, topic, config)
|
||||
|
||||
// 将topic分区相关信息写入zk中
|
||||
writeTopicPartitionAssignment(topic, partitionReplicaAssignment.mapValues(ReplicaAssignment(_)).toMap, isUpdate = false)
|
||||
}
|
||||
|
||||
```
|
||||
源码就不再深入了,这里直接详细说明一下
|
||||
|
||||
**写入Topic配置信息**
|
||||
1. 先调用`SetDataRequest`请求往节点` /config/topics/Topic名称` 写入数据; 这里
|
||||
一般这个时候都会返回 `NONODE (NoNode)`;节点不存在; 假如zk已经存在节点就直接覆盖掉
|
||||
2. 节点不存在的话,就发起`CreateRequest`请求,写入数据; 并且节点类型是**持久节点**
|
||||
|
||||
这里写入的数据,是我们入参时候传的topic配置`--config`; 这里的配置会覆盖默认配置
|
||||
|
||||
**写入Topic分区副本信息**
|
||||
1. 将已经分配好的副本分配策略写入到 `/brokers/topics/Topic名称` 中; 节点类型 **持久节点**
|
||||

|
||||
|
||||
**具体跟zk交互的地方在**
|
||||
`ZookeeperClient.send()` 这里包装了很多跟zk的交互;
|
||||

|
||||
### 6. Controller监听 `/brokers/topics/Topic名称`, 通知Broker将分区写入磁盘
|
||||
> Controller 有监听zk上的一些节点; 在上面的流程中已经在zk中写入了 `/brokers/topics/Topic名称` ; 这个时候Controller就监听到了这个变化并相应;
|
||||
|
||||
`KafkaController.processTopicChange`
|
||||
```scala
|
||||
|
||||
private def processTopicChange(): Unit = {
|
||||
//如果处理的不是Controller角色就返回
|
||||
if (!isActive) return
|
||||
//从zk中获取 `/brokers/topics 所有Topic
|
||||
val topics = zkClient.getAllTopicsInCluster
|
||||
//找出哪些是新增的
|
||||
val newTopics = topics -- controllerContext.allTopics
|
||||
//找出哪些Topic在zk上被删除了
|
||||
val deletedTopics = controllerContext.allTopics -- topics
|
||||
controllerContext.allTopics = topics
|
||||
|
||||
|
||||
registerPartitionModificationsHandlers(newTopics.toSeq)
|
||||
val addedPartitionReplicaAssignment = zkClient.getFullReplicaAssignmentForTopics(newTopics)
|
||||
deletedTopics.foreach(controllerContext.removeTopic)
|
||||
addedPartitionReplicaAssignment.foreach {
|
||||
case (topicAndPartition, newReplicaAssignment) => controllerContext.updatePartitionFullReplicaAssignment(topicAndPartition, newReplicaAssignment)
|
||||
}
|
||||
info(s"New topics: [$newTopics], deleted topics: [$deletedTopics], new partition replica assignment " +
|
||||
s"[$addedPartitionReplicaAssignment]")
|
||||
if (addedPartitionReplicaAssignment.nonEmpty)
|
||||
onNewPartitionCreation(addedPartitionReplicaAssignment.keySet)
|
||||
}
|
||||
```
|
||||
1. 从zk中获取 `/brokers/topics` 所有Topic跟当前Broker内存中所有Broker`controllerContext.allTopics`的差异; 就可以找到我们新增的Topic; 还有在zk中被删除了的Broker(该Topic会在当前内存中remove掉)
|
||||
2. 从zk中获取`/brokers/topics/{TopicName}` 给定主题的副本分配。并保存在内存中
|
||||
|
||||
4. 执行`onNewPartitionCreation`;分区状态开始流转
|
||||
|
||||
#### 6.1 onNewPartitionCreation 状态流转
|
||||
> 关于Controller的状态机 详情请看: [【kafka源码】Controller中的状态机](TODO)
|
||||
|
||||
```scala
|
||||
/**
|
||||
* This callback is invoked by the topic change callback with the list of failed brokers as input.
|
||||
* It does the following -
|
||||
* 1. Move the newly created partitions to the NewPartition state
|
||||
* 2. Move the newly created partitions from NewPartition->OnlinePartition state
|
||||
*/
|
||||
private def onNewPartitionCreation(newPartitions: Set[TopicPartition]): Unit = {
|
||||
info(s"New partition creation callback for ${newPartitions.mkString(",")}")
|
||||
partitionStateMachine.handleStateChanges(newPartitions.toSeq, NewPartition)
|
||||
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions).toSeq, NewReplica)
|
||||
partitionStateMachine.handleStateChanges(
|
||||
newPartitions.toSeq,
|
||||
OnlinePartition,
|
||||
Some(OfflinePartitionLeaderElectionStrategy(false))
|
||||
)
|
||||
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions).toSeq, OnlineReplica)
|
||||
}
|
||||
```
|
||||
1. 将待创建的分区状态流转为`NewPartition`;
|
||||

|
||||
2. 将待创建的副本 状态流转为`NewReplica`;
|
||||

|
||||
3. 将分区状态从刚刚的`NewPartition`流转为`OnlinePartition`
|
||||
0. 获取`leaderIsrAndControllerEpochs`; Leader为副本的第一个;
|
||||
1. 向zk中写入`/brokers/topics/{topicName}/partitions/` 持久节点; 无数据
|
||||
2. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}` 持久节点; 无数据
|
||||
3. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}/state` 持久节点; 数据为`leaderIsrAndControllerEpoch`
|
||||
4. 向副本所属Broker发送[`leaderAndIsrRequest`]()请求
|
||||
5. 向所有Broker发送[`UPDATE_METADATA` ]()请求
|
||||
4. 将副本状态从刚刚的`NewReplica`流转为`OnlineReplica` ,更新下内存
|
||||
|
||||
关于分区状态机和副本状态机详情请看[【kafka源码】Controller中的状态机](TODO)
|
||||
|
||||
### 7. Broker收到LeaderAndIsrRequest 创建本地Log
|
||||
>上面步骤中有说到向副本所属Broker发送[`leaderAndIsrRequest`]()请求,那么这里做了什么呢
|
||||
>其实主要做的是 创建本地Log
|
||||
>
|
||||
代码太多,这里我们直接定位到只跟创建Topic相关的关键代码来分析
|
||||
`KafkaApis.handleLeaderAndIsrRequest->replicaManager.becomeLeaderOrFollower->ReplicaManager.makeLeaders...LogManager.getOrCreateLog`
|
||||
|
||||
```scala
|
||||
/**
|
||||
* 如果日志已经存在,只返回现有日志的副本否则如果 isNew=true 或者如果没有离线日志目录,则为给定的主题和给定的分区创建日志 否则抛出 KafkaStorageException
|
||||
*/
|
||||
def getOrCreateLog(topicPartition: TopicPartition, config: LogConfig, isNew: Boolean = false, isFuture: Boolean = false): Log = {
|
||||
logCreationOrDeletionLock synchronized {
|
||||
getLog(topicPartition, isFuture).getOrElse {
|
||||
// create the log if it has not already been created in another thread
|
||||
if (!isNew && offlineLogDirs.nonEmpty)
|
||||
throw new KafkaStorageException(s"Can not create log for $topicPartition because log directories ${offlineLogDirs.mkString(",")} are offline")
|
||||
|
||||
val logDirs: List[File] = {
|
||||
val preferredLogDir = preferredLogDirs.get(topicPartition)
|
||||
|
||||
if (isFuture) {
|
||||
if (preferredLogDir == null)
|
||||
throw new IllegalStateException(s"Can not create the future log for $topicPartition without having a preferred log directory")
|
||||
else if (getLog(topicPartition).get.dir.getParent == preferredLogDir)
|
||||
throw new IllegalStateException(s"Can not create the future log for $topicPartition in the current log directory of this partition")
|
||||
}
|
||||
|
||||
if (preferredLogDir != null)
|
||||
List(new File(preferredLogDir))
|
||||
else
|
||||
nextLogDirs()
|
||||
}
|
||||
|
||||
val logDirName = {
|
||||
if (isFuture)
|
||||
Log.logFutureDirName(topicPartition)
|
||||
else
|
||||
Log.logDirName(topicPartition)
|
||||
}
|
||||
|
||||
val logDir = logDirs
|
||||
.toStream // to prevent actually mapping the whole list, lazy map
|
||||
.map(createLogDirectory(_, logDirName))
|
||||
.find(_.isSuccess)
|
||||
.getOrElse(Failure(new KafkaStorageException("No log directories available. Tried " + logDirs.map(_.getAbsolutePath).mkString(", "))))
|
||||
.get // If Failure, will throw
|
||||
|
||||
val log = Log(
|
||||
dir = logDir,
|
||||
config = config,
|
||||
logStartOffset = 0L,
|
||||
recoveryPoint = 0L,
|
||||
maxProducerIdExpirationMs = maxPidExpirationMs,
|
||||
producerIdExpirationCheckIntervalMs = LogManager.ProducerIdExpirationCheckIntervalMs,
|
||||
scheduler = scheduler,
|
||||
time = time,
|
||||
brokerTopicStats = brokerTopicStats,
|
||||
logDirFailureChannel = logDirFailureChannel)
|
||||
|
||||
if (isFuture)
|
||||
futureLogs.put(topicPartition, log)
|
||||
else
|
||||
currentLogs.put(topicPartition, log)
|
||||
|
||||
info(s"Created log for partition $topicPartition in $logDir with properties " + s"{${config.originals.asScala.mkString(", ")}}.")
|
||||
// Remove the preferred log dir since it has already been satisfied
|
||||
preferredLogDirs.remove(topicPartition)
|
||||
|
||||
log
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 如果日志已经存在,只返回现有日志的副本否则如果 isNew=true 或者如果没有离线日志目录,则为给定的主题和给定的分区创建日志 否则抛出` KafkaStorageException`
|
||||
|
||||
详细请看 [【kafka源码】LeaderAndIsrRequest请求]()
|
||||
|
||||
|
||||
## 源码总结
|
||||
> 如果上面的源码分析,你不想看,那么你可以直接看这里的简洁叙述
|
||||
|
||||
1. 根据是否有传入参数`--zookeeper` 来判断创建哪一种 对象`topicService`
|
||||
如果传入了`--zookeeper` 则创建 类 `ZookeeperTopicService`的对象
|
||||
否则创建类`AdminClientTopicService`的对象(我们主要分析这个对象)
|
||||
2. 如果有入参`--command-config` ,则将这个文件里面的参数都放到mapl类型 `commandConfig`里面, 并且也加入`bootstrap.servers`的参数;假如配置文件里面已经有了`bootstrap.servers`配置,那么会将其覆盖
|
||||
3. 将上面的`commandConfig `作为入参调用`Admin.create(commandConfig)`创建 Admin; 这个时候调用的Client模块的代码了, 从这里我们就可以猜测,我们调用`kafka-topic.sh`脚本实际上是kafka模拟了一个客户端Client来创建Topic的过程;
|
||||
4. 一些异常检查
|
||||
①.如果配置了副本副本数--replication-factor 一定要大于0
|
||||
②.如果配置了--partitions 分区数 必须大于0
|
||||
③.去zk查询是否已经存在该Topic
|
||||
5. 判断是否配置了参数`--replica-assignment` ; 如果配置了,那么Topic就会按照指定的方式来配置副本情况
|
||||
6. 解析配置`--config ` 配置放到`configsMap`中; configsMap给到NewTopic对象
|
||||
7. **将上面所有的参数包装成一个请求参数`CreateTopicsRequest` ;然后找到是`Controller`的节点发起请求(`ControllerNodeProvider`)**
|
||||
8. 服务端收到请求之后,开始根据`CreateTopicsRequest`来调用创建Topic的方法; 不过首先要判断一下自己这个时候是不是`Controller`; 有可能这个时候Controller重新选举了; 这个时候要抛出异常
|
||||
9. 服务端进行一下请求参数检查
|
||||
①.检查Topic是否存在
|
||||
②.检查 `--replica-assignment`参数和 (`--partitions` || `--replication-factor` ) 不能同时使用
|
||||
10. 如果(`--partitions` || `--replication-factor` ) 没有设置,则使用 Broker的默认配置(这个Broker肯定是Controller)
|
||||
11. 计算分区副本分配方式;如果是传入了 `--replica-assignment`;则会安装自定义参数进行组装;否则的话系统会自动计算分配方式; 具体详情请看 [【kafka源码】创建Topic的时候是如何分区和副本的分配规则 ]()
|
||||
12. `createTopicPolicy `根据Broker是否配置了创建Topic的自定义校验策略; 使用方式是自定义实现`org.apache.kafka.server.policy.CreateTopicPolicy`接口;并 在服务器配置 `create.topic.policy.class.name`=自定义类; 比如我就想所有创建Topic的请求分区数都要大于10; 那么这里就可以实现你的需求了
|
||||
13. **zk中写入Topic配置信息** 发起`CreateRequest`请求,这里写入的数据,是我们入参时候传的topic配置`--config`; 这里的配置会覆盖默认配置;并且节点类型是持久节点;**path** = `/config/topics/Topic名称`
|
||||
14. **zk中写入Topic分区副本信息** 发起`CreateRequest`请求 ,将已经分配好的副本分配策略 写入到 `/brokers/topics/Topic名称 `中; 节点类型 持久节点
|
||||
15. `Controller`监听zk上面的topic信息; 根据zk上变更的topic信息;计算出新增/删除了哪些Topic; 然后拿到新增Topic的 副本分配信息; 并做一些状态流转
|
||||
16. 向新增Topic所在Broker发送`leaderAndIsrRequest`请求,
|
||||
17. Broker收到`发送leaderAndIsrRequest请求`; 创建副本Log文件;
|
||||
|
||||

|
||||
|
||||
|
||||
## Q&A
|
||||
|
||||
|
||||
### 创建Topic的时候 在Zk上创建了哪些节点
|
||||
>接受客户端请求阶段:
|
||||
>1. topic的配置信息 ` /config/topics/Topic名称` 持久节点
|
||||
>2. topic的分区信息`/brokers/topics/Topic名称` 持久节点
|
||||
>
|
||||
>Controller监听zk节点`/brokers/topics`变更阶段
|
||||
>1. `/brokers/topics/{topicName}/partitions/ `持久节点; 无数据
|
||||
>2. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}` 持久节点; 无数据
|
||||
>3. 向zk中写入`/brokers/topics/{topicName}/partitions/{分区号}/state` 持久节点;
|
||||
|
||||
### 创建Topic的时候 什么时候在Broker磁盘上创建的日志文件
|
||||
>当Controller监听zk节点`/brokers/topics`变更之后,将新增的Topic 解析好的分区状态流转
|
||||
>`NonExistentPartition`->`NewPartition`->`OnlinePartition` 当流转到`OnlinePartition`的时候会像分区分配到的Broker发送一个`leaderAndIsrRequest`请求,当Broker们收到这个请求之后,根据请求参数做一些处理,其中就包括检查自身有没有这个分区副本的本地Log;如果没有的话就重新创建;
|
||||
### 如果我没有指定分区数或者副本数,那么会如何创建
|
||||
>我们都知道,如果我们没有指定分区数或者副本数, 则默认使用Broker的配置, 那么这么多Broker,假如不小心默认值配置不一样,那究竟使用哪一个呢? 那肯定是哪台机器执行创建topic的过程,就是使用谁的配置;
|
||||
**所以是谁执行的?** 那肯定是Controller啊! 上面的源码我们分析到了,创建的过程,会指定Controller这台机器去进行;
|
||||
|
||||
|
||||
### 如果我手动删除了`/brokers/topics/`下的某个节点会怎么样?
|
||||
>在Controller中的内存中更新一下相关信息
|
||||
>其他Broker呢?TODO.
|
||||
|
||||
### 如果我手动在zk中添加`/brokers/topics/{TopicName}`节点会怎么样
|
||||
>**先说结论:** 根据上面分析过的源码画出的时序图可以指定; 客户端发起创建Topic的请求,本质上是去zk里面写两个数据
|
||||
>1. topic的配置信息 ` /config/topics/Topic名称` 持久节点
|
||||
>2. topic的分区信息`/brokers/topics/Topic名称` 持久节点
|
||||
>所以我们绕过这一步骤直接去写入数据,可以达到一样的效果;不过我们的数据需要保证准确
|
||||
>因为在这一步已经没有了一些基本的校验了; 假如这一步我们写入的副本Brokerid不存在会怎样,从时序图中可以看到,`leaderAndIsrRequest请求`; 就不会正确的发送的不存在的BrokerId上,那么那台机器就不会创建Log文件;
|
||||
>
|
||||
>
|
||||
>**下面不妨让我们来验证一下;**
|
||||
>创建一个节点`/brokers/topics/create_topic_byhand_zk` 节点数据为下面数据;
|
||||
>```
|
||||
>{"version":2,"partitions":{"2":[3],"1":[3],"0":[3]},"adding_replicas":{},"removing_replicas":{}}
|
||||
>```
|
||||
>
|
||||
>这里我用的工具`PRETTYZOO`手动创建的,你也可以用命令行创建;
|
||||
>创建完成之后我们再看看本地有没有生成一个Log文件
|
||||
>
|
||||
>可以看到我们指定的Broker,已经生成了对应的分区副本Log文件;
|
||||
>而且zk中也写入了其他的数据
|
||||
>`在我们写入zk数据的时候,就已经确定好了哪个每个分区的Leader是谁了,那就是第一个副本默认为Leader`
|
||||
>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
### 如果写入`/brokers/topics/{TopicName}`节点之后Controller挂掉了会怎么样
|
||||
> **先说结论**:Controller 重新选举的时候,会有一些初始化的操作; 会把创建过程继续下去
|
||||
|
||||
> 然后我们来模拟这么一个过程,先停止集群,然后再zk中写入`/brokers/topics/{TopicName}`节点数据; 然后再启动一台Broker;
|
||||
> **源码分析:** 我们之前分析过[Controller的启动过程与选举]() 有提到过,这里再提一下Controller当选之后有一个地方处理这个事情
|
||||
> ```
|
||||
> replicaStateMachine.startup()
|
||||
> partitionStateMachine.startup()
|
||||
> ```
|
||||
> 启动状态机的过程是不是跟上面的**6.1 onNewPartitionCreation 状态流转** 的过程很像; 最终都把状态流转到了`OnlinePartition`; 伴随着是不发起了`leaderAndIsrRequest`请求; 是不是Broker收到请求之后,创建本地Log文件了
|
||||
>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## 附件
|
||||
|
||||
### --config 可生效参数
|
||||
请以`sh bin/kafka-topic -help` 为准
|
||||
```xml
|
||||
configurations:
|
||||
cleanup.policy
|
||||
compression.type
|
||||
delete.retention.ms
|
||||
file.delete.delay.ms
|
||||
flush.messages
|
||||
flush.ms
|
||||
follower.replication.throttled.
|
||||
replicas
|
||||
index.interval.bytes
|
||||
leader.replication.throttled.replicas
|
||||
max.compaction.lag.ms
|
||||
max.message.bytes
|
||||
message.downconversion.enable
|
||||
message.format.version
|
||||
message.timestamp.difference.max.ms
|
||||
message.timestamp.type
|
||||
min.cleanable.dirty.ratio
|
||||
min.compaction.lag.ms
|
||||
min.insync.replicas
|
||||
preallocate
|
||||
retention.bytes
|
||||
retention.ms
|
||||
segment.bytes
|
||||
segment.index.bytes
|
||||
segment.jitter.ms
|
||||
segment.ms
|
||||
unclean.leader.election.enable
|
||||
```
|
||||
|
||||
|
||||
---
|
||||
<font color=red size=5>Tips:如果关于本篇文章你有疑问,可以在评论区留下,我会在**Q&A**部分进行解答 </font>
|
||||
|
||||
|
||||
|
||||
<font color=red size=2>PS: 文章阅读的源码版本是kafka-2.5 </font>
|
||||
|
||||
420
docs/zh/Kafka分享/Kafka Controller /TopicCommand之删除Topic源码解析.md
Normal file
@@ -0,0 +1,420 @@
|
||||
|
||||
|
||||
## 删除Topic命令
|
||||
>bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic test
|
||||
|
||||
|
||||
|
||||
支持正则表达式匹配Topic来进行删除,只需要将topic 用双引号包裹起来
|
||||
例如: 删除以`create_topic_byhand_zk`为开头的topic;
|
||||
>>bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic "create_topic_byhand_zk.*"
|
||||
> `.`表示任意匹配除换行符 \n 之外的任何单字符。要匹配 . ,请使用 \. 。
|
||||
`·*·`:匹配前面的子表达式零次或多次。要匹配 * 字符,请使用 \*。
|
||||
`.*` : 任意字符
|
||||
|
||||
**删除任意Topic (慎用)**
|
||||
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic ".*?"
|
||||
>
|
||||
更多的用法请[参考正则表达式](https://www.runoob.com/regexp/regexp-syntax.html)
|
||||
|
||||
## 源码解析
|
||||
<font color="red">如果觉得阅读源码解析太枯燥,请直接看 **源码总结及其后面部分**</font>
|
||||
### 1. 客户端发起删除Topic的请求
|
||||
在[【kafka源码】TopicCommand之创建Topic源码解析]() 里面已经分析过了整个请求流程; 所以这里就不再详细的分析请求的过程了,直接看重点;
|
||||

|
||||
**向Controller发起 `deleteTopics`请求**
|
||||
|
||||
### 2. Controller处理deleteTopics的请求
|
||||
`KafkaApis.handle`
|
||||
`AdminManager.deleteTopics`
|
||||
```scala
|
||||
/**
|
||||
* Delete topics and wait until the topics have been completely deleted.
|
||||
* The callback function will be triggered either when timeout, error or the topics are deleted.
|
||||
*/
|
||||
def deleteTopics(timeout: Int,
|
||||
topics: Set[String],
|
||||
responseCallback: Map[String, Errors] => Unit): Unit = {
|
||||
|
||||
// 1. map over topics calling the asynchronous delete
|
||||
val metadata = topics.map { topic =>
|
||||
try {
|
||||
// zk中写入数据 标记要被删除的topic /admin/delete_topics/Topic名称
|
||||
adminZkClient.deleteTopic(topic)
|
||||
DeleteTopicMetadata(topic, Errors.NONE)
|
||||
} catch {
|
||||
case _: TopicAlreadyMarkedForDeletionException =>
|
||||
// swallow the exception, and still track deletion allowing multiple calls to wait for deletion
|
||||
DeleteTopicMetadata(topic, Errors.NONE)
|
||||
case e: Throwable =>
|
||||
error(s"Error processing delete topic request for topic $topic", e)
|
||||
DeleteTopicMetadata(topic, Errors.forException(e))
|
||||
}
|
||||
}
|
||||
|
||||
// 2. 如果客户端传过来的timeout<=0或者 写入zk数据过程异常了 则执行下面的,直接返回异常
|
||||
if (timeout <= 0 || !metadata.exists(_.error == Errors.NONE)) {
|
||||
val results = metadata.map { deleteTopicMetadata =>
|
||||
// ignore topics that already have errors
|
||||
if (deleteTopicMetadata.error == Errors.NONE) {
|
||||
(deleteTopicMetadata.topic, Errors.REQUEST_TIMED_OUT)
|
||||
} else {
|
||||
(deleteTopicMetadata.topic, deleteTopicMetadata.error)
|
||||
}
|
||||
}.toMap
|
||||
responseCallback(results)
|
||||
} else {
|
||||
// 3. else pass the topics and errors to the delayed operation and set the keys
|
||||
val delayedDelete = new DelayedDeleteTopics(timeout, metadata.toSeq, this, responseCallback)
|
||||
val delayedDeleteKeys = topics.map(new TopicKey(_)).toSeq
|
||||
// try to complete the request immediately, otherwise put it into the purgatory
|
||||
topicPurgatory.tryCompleteElseWatch(delayedDelete, delayedDeleteKeys)
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. zk中写入数据topic` /admin/delete_topics/Topic名称`; 标记要被删除的Topic
|
||||
2. 如果客户端传过来的timeout<=0或者 写入zk数据过程异常了 则直接返回异常
|
||||
|
||||
|
||||
### 3. Controller监听zk变更 执行删除Topic流程
|
||||
`KafkaController.processTopicDeletion`
|
||||
|
||||
```scala
|
||||
private def processTopicDeletion(): Unit = {
|
||||
if (!isActive) return
|
||||
var topicsToBeDeleted = zkClient.getTopicDeletions.toSet
|
||||
val nonExistentTopics = topicsToBeDeleted -- controllerContext.allTopics
|
||||
if (nonExistentTopics.nonEmpty) {
|
||||
warn(s"Ignoring request to delete non-existing topics ${nonExistentTopics.mkString(",")}")
|
||||
zkClient.deleteTopicDeletions(nonExistentTopics.toSeq, controllerContext.epochZkVersion)
|
||||
}
|
||||
topicsToBeDeleted --= nonExistentTopics
|
||||
if (config.deleteTopicEnable) {
|
||||
if (topicsToBeDeleted.nonEmpty) {
|
||||
info(s"Starting topic deletion for topics ${topicsToBeDeleted.mkString(",")}")
|
||||
// 标记暂时不可删除的Topic
|
||||
topicsToBeDeleted.foreach { topic =>
|
||||
val partitionReassignmentInProgress =
|
||||
controllerContext.partitionsBeingReassigned.map(_.topic).contains(topic)
|
||||
if (partitionReassignmentInProgress)
|
||||
topicDeletionManager.markTopicIneligibleForDeletion(Set(topic),
|
||||
reason = "topic reassignment in progress")
|
||||
}
|
||||
// add topic to deletion list
|
||||
topicDeletionManager.enqueueTopicsForDeletion(topicsToBeDeleted)
|
||||
}
|
||||
} else {
|
||||
// If delete topic is disabled remove entries under zookeeper path : /admin/delete_topics
|
||||
info(s"Removing $topicsToBeDeleted since delete topic is disabled")
|
||||
zkClient.deleteTopicDeletions(topicsToBeDeleted.toSeq, controllerContext.epochZkVersion)
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
1. 如果`/admin/delete_topics/`下面的节点有不存在的Topic,则清理掉
|
||||
2. 如果配置了`delete.topic.enable=false`不可删除Topic的话,则将`/admin/delete_topics/`下面的节点全部删除,然后流程结束
|
||||
3. `delete.topic.enable=true`; 将主题标记为不符合删除条件,放到`topicsIneligibleForDeletion`中; 不符合删除条件的是:**Topic分区正在进行分区重分配**
|
||||
4. 将Topic添加到删除Topic列表`topicsToBeDeleted`中;
|
||||
5. 然后调用`TopicDeletionManager.resumeDeletions()`方法执行删除操作
|
||||
|
||||
#### 3.1 resumeDeletions 执行删除方法
|
||||
`TopicDeletionManager.resumeDeletions()`
|
||||
|
||||
```scala
|
||||
private def resumeDeletions(): Unit = {
|
||||
val topicsQueuedForDeletion = Set.empty[String] ++ controllerContext.topicsToBeDeleted
|
||||
val topicsEligibleForRetry = mutable.Set.empty[String]
|
||||
val topicsEligibleForDeletion = mutable.Set.empty[String]
|
||||
|
||||
if (topicsQueuedForDeletion.nonEmpty)
|
||||
topicsQueuedForDeletion.foreach { topic =>
|
||||
// if all replicas are marked as deleted successfully, then topic deletion is done
|
||||
//如果所有副本都被标记为删除成功了,然后执行删除Topic成功操作;
|
||||
if (controllerContext.areAllReplicasInState(topic, ReplicaDeletionSuccessful)) {
|
||||
// clear up all state for this topic from controller cache and zookeeper
|
||||
//执行删除Topic成功之后的操作;
|
||||
completeDeleteTopic(topic)
|
||||
info(s"Deletion of topic $topic successfully completed")
|
||||
} else if (!controllerContext.isAnyReplicaInState(topic, ReplicaDeletionStarted)) {
|
||||
// if you come here, then no replica is in TopicDeletionStarted and all replicas are not in
|
||||
// TopicDeletionSuccessful. That means, that either given topic haven't initiated deletion
|
||||
// or there is at least one failed replica (which means topic deletion should be retried).
|
||||
if (controllerContext.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
|
||||
topicsEligibleForRetry += topic
|
||||
}
|
||||
}
|
||||
|
||||
// Add topic to the eligible set if it is eligible for deletion.
|
||||
if (isTopicEligibleForDeletion(topic)) {
|
||||
info(s"Deletion of topic $topic (re)started")
|
||||
topicsEligibleForDeletion += topic
|
||||
}
|
||||
}
|
||||
|
||||
// topic deletion retry will be kicked off
|
||||
if (topicsEligibleForRetry.nonEmpty) {
|
||||
retryDeletionForIneligibleReplicas(topicsEligibleForRetry)
|
||||
}
|
||||
|
||||
// topic deletion will be kicked off
|
||||
if (topicsEligibleForDeletion.nonEmpty) {
|
||||
//删除Topic,发送UpdataMetaData请求
|
||||
onTopicDeletion(topicsEligibleForDeletion)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
1. 重点看看`onTopicDeletion`方法,标记所有待删除分区;向Brokers发送`updateMetadataRequest`请求,告知Brokers这个主题正在被删除,并将Leader设置为`LeaderAndIsrLeaderDuringDelete`;
|
||||
1. 将待删除的Topic的所有分区,执行分区状态机的转换 ;当前状态-->`OfflinePartition`->`NonExistentPartition` ; 这两个状态转换只是在当前Controller内存中更新了一下状态; 关于状态机请看 [【kafka源码】Controller中的状态机TODO....]();
|
||||
2. `client.sendMetadataUpdate(topics.flatMap(controllerContext.partitionsForTopic))` 向待删除Topic分区发送`UpdateMetadata`请求; 这个时候更新了什么数据呢? 
|
||||
看上面图片源码, 发送`UpdateMetadata`请求的时候把分区的Leader= -2; 表示这个分区正在被删除;那么所有正在被删除的分区就被找到了;拿到这些待删除分区之后干嘛呢?
|
||||
1. 更新一下限流相关信息
|
||||
2. 调用`groupCoordinator.handleDeletedPartitions(deletedPartitions)`: 清除给定的`deletedPartitions`的组偏移量以及执行偏移量删除的函数;就是现在该分区不能提供服务啦,不能被消费啦
|
||||
|
||||
详细请看 [Kafka的元数据更新UpdateMetadata]()
|
||||
|
||||
4. 调用`TopicDeletionManager.onPartitionDeletion`接口如下;
|
||||
|
||||
#### 3.2 TopicDeletionManager.onPartitionDeletion
|
||||
1. 将所有Dead replicas 副本直接移动到`ReplicaDeletionIneligible`状态,如果某些副本已死,也将相应的主题标记为不适合删除,因为它无论如何都不会成功完成
|
||||
2. 副本状态转换成`OfflineReplica`; 这个时候会对该Topic的所有副本所在Broker发起[`StopReplicaRequest` ]()请求;(参数`deletePartitions = false`,表示还不执行删除操作); 以便他们停止向`Leader`发送`fetch`请求; 关于状态机请看 [【kafka源码】Controller中的状态机TODO....]();
|
||||
3. 副本状态转换成 `ReplicaDeletionStarted`状态,这个时候会对该Topic的所有副本所在Broker发起[`StopReplicaRequest` ]()请求;(参数`deletePartitions = true`,表示执行删除操作)。这将发送带有 deletePartition=true 的 [`StopReplicaRequest` ]()。并将删除相应分区的所有副本中的所有持久数据
|
||||
|
||||
|
||||
### 4. Brokers 接受StopReplica请求
|
||||
最终调用的是接口
|
||||
`ReplicaManager.stopReplica` ==> `LogManager.asyncDelete`
|
||||
|
||||
>将给定主题分区“logdir”的目录重命名为“logdir.uuid.delete”,并将其添加到删除队列中
|
||||
>例如 :
|
||||
>
|
||||
|
||||
```scala
|
||||
def asyncDelete(topicPartition: TopicPartition, isFuture: Boolean = false): Log = {
|
||||
val removedLog: Log = logCreationOrDeletionLock synchronized {
|
||||
//将待删除的partition在 Logs中删除掉
|
||||
if (isFuture)
|
||||
futureLogs.remove(topicPartition)
|
||||
else
|
||||
currentLogs.remove(topicPartition)
|
||||
}
|
||||
if (removedLog != null) {
|
||||
//我们需要等到要删除的日志上没有更多的清理任务,然后才能真正删除它。
|
||||
if (cleaner != null && !isFuture) {
|
||||
cleaner.abortCleaning(topicPartition)
|
||||
cleaner.updateCheckpoints(removedLog.dir.getParentFile)
|
||||
}
|
||||
//重命名topic副本文件夹 命名规则 topic-uuid-delete
|
||||
removedLog.renameDir(Log.logDeleteDirName(topicPartition))
|
||||
checkpointRecoveryOffsetsAndCleanSnapshot(removedLog.dir.getParentFile, ArrayBuffer.empty)
|
||||
checkpointLogStartOffsetsInDir(removedLog.dir.getParentFile)
|
||||
//将Log添加到待删除Log队列中,等待删除
|
||||
addLogToBeDeleted(removedLog)
|
||||
|
||||
} else if (offlineLogDirs.nonEmpty) {
|
||||
throw new KafkaStorageException(s"Failed to delete log for ${if (isFuture) "future" else ""} $topicPartition because it may be in one of the offline directories ${offlineLogDirs.mkString(",")}")
|
||||
}
|
||||
removedLog
|
||||
}
|
||||
```
|
||||
#### 4.1 日志清理定时线程
|
||||
>上面我们知道最终是将待删除的Log添加到了`logsToBeDeleted`这个队列中; 这个队列就是待删除Log队列,有一个线程 `kafka-delete-logs`专门来处理的;我们来看看这个线程怎么工作的
|
||||
|
||||
`LogManager.startup` 启动的时候 ,启动了一个定时线程
|
||||
```scala
|
||||
scheduler.schedule("kafka-delete-logs", // will be rescheduled after each delete logs with a dynamic period
|
||||
deleteLogs _,
|
||||
delay = InitialTaskDelayMs,
|
||||
unit = TimeUnit.MILLISECONDS)
|
||||
```
|
||||
|
||||
**删除日志的线程**
|
||||
```scala
|
||||
/**
|
||||
* Delete logs marked for deletion. Delete all logs for which `currentDefaultConfig.fileDeleteDelayMs`
|
||||
* has elapsed after the delete was scheduled. Logs for which this interval has not yet elapsed will be
|
||||
* considered for deletion in the next iteration of `deleteLogs`. The next iteration will be executed
|
||||
* after the remaining time for the first log that is not deleted. If there are no more `logsToBeDeleted`,
|
||||
* `deleteLogs` will be executed after `currentDefaultConfig.fileDeleteDelayMs`.
|
||||
* 删除标记为删除的日志文件;
|
||||
* file.delete.delay.ms 文件延迟删除时间 默认60000毫秒
|
||||
*
|
||||
*/
|
||||
private def deleteLogs(): Unit = {
|
||||
var nextDelayMs = 0L
|
||||
try {
|
||||
def nextDeleteDelayMs: Long = {
|
||||
if (!logsToBeDeleted.isEmpty) {
|
||||
val (_, scheduleTimeMs) = logsToBeDeleted.peek()
|
||||
scheduleTimeMs + currentDefaultConfig.fileDeleteDelayMs - time.milliseconds()
|
||||
} else
|
||||
currentDefaultConfig.fileDeleteDelayMs
|
||||
}
|
||||
|
||||
while ({nextDelayMs = nextDeleteDelayMs; nextDelayMs <= 0}) {
|
||||
val (removedLog, _) = logsToBeDeleted.take()
|
||||
if (removedLog != null) {
|
||||
try {
|
||||
//立即彻底删除此日志目录和文件系统中的所有内容
|
||||
removedLog.delete()
|
||||
info(s"Deleted log for partition ${removedLog.topicPartition} in ${removedLog.dir.getAbsolutePath}.")
|
||||
} catch {
|
||||
case e: KafkaStorageException =>
|
||||
error(s"Exception while deleting $removedLog in dir ${removedLog.dir.getParent}.", e)
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch {
|
||||
case e: Throwable =>
|
||||
error(s"Exception in kafka-delete-logs thread.", e)
|
||||
} finally {
|
||||
try {
|
||||
scheduler.schedule("kafka-delete-logs",
|
||||
deleteLogs _,
|
||||
delay = nextDelayMs,
|
||||
unit = TimeUnit.MILLISECONDS)
|
||||
} catch {
|
||||
case e: Throwable =>
|
||||
if (scheduler.isStarted) {
|
||||
// No errors should occur unless scheduler has been shutdown
|
||||
error(s"Failed to schedule next delete in kafka-delete-logs thread", e)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
`file.delete.delay.ms` 决定延迟多久删除
|
||||
|
||||
|
||||
### 5.StopReplica 请求成功 执行回调接口
|
||||
> Topic删除完成, 清理相关信息
|
||||
触发这个接口的地方是: 每个Broker执行删除`StopReplica`成功之后,都会执行一个回调函数;`TopicDeletionStopReplicaResponseReceived` ; 当然调用方是Controller,回调到的也就是Controller;
|
||||
|
||||
传入回调函数的地方
|
||||

|
||||
|
||||
|
||||
|
||||
执行回调函数 `KafkaController.processTopicDeletionStopReplicaResponseReceived`
|
||||
|
||||
1. 如果回调有异常,删除失败则将副本状态转换成==》`ReplicaDeletionIneligible`,并且重新执行`resumeDeletions`方法;
|
||||
2. 如果回调正常,则变更状态 `ReplicaDeletionStarted`==》`ReplicaDeletionSuccessful`;并且重新执行`resumeDeletions`方法;
|
||||
3. `resumeDeletions`方法会判断所有副本是否均被删除,如果全部删除了就会执行下面的`completeDeleteTopic`代码;否则会继续删除未被成功删除的副本
|
||||
```scala
|
||||
private def completeDeleteTopic(topic: String): Unit = {
|
||||
// deregister partition change listener on the deleted topic. This is to prevent the partition change listener
|
||||
// firing before the new topic listener when a deleted topic gets auto created
|
||||
client.mutePartitionModifications(topic)
|
||||
val replicasForDeletedTopic = controllerContext.replicasInState(topic, ReplicaDeletionSuccessful)
|
||||
// controller will remove this replica from the state machine as well as its partition assignment cache
|
||||
replicaStateMachine.handleStateChanges(replicasForDeletedTopic.toSeq, NonExistentReplica)
|
||||
controllerContext.topicsToBeDeleted -= topic
|
||||
controllerContext.topicsWithDeletionStarted -= topic
|
||||
client.deleteTopic(topic, controllerContext.epochZkVersion)
|
||||
controllerContext.removeTopic(topic)
|
||||
}
|
||||
```
|
||||
|
||||
1. 清理内存中相关信息
|
||||
2. 取消注册被删除Topic的相关节点监听器;节点是`/brokers/topics/Topic名称`
|
||||
3. 删除zk中的数据包括;`/brokers/topics/Topic名称`、`/config/topics/Topic名称` 、`/admin/delete_topics/Topic名称`
|
||||
|
||||
|
||||
|
||||
|
||||
### 6. Controller启动时候 尝试继续处理待删除的Topic
|
||||
我们之前分析Controller上线的时候有看到
|
||||
`KafkaController.onControllerFailover`
|
||||
以下省略部分代码
|
||||
```scala
|
||||
private def onControllerFailover(): Unit = {
|
||||
// 获取哪些Topic需要被删除,哪些暂时还不能删除
|
||||
val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
|
||||
|
||||
info("Initializing topic deletion manager")
|
||||
//Topic删除管理器初始化
|
||||
topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)
|
||||
|
||||
//Topic删除管理器 尝试开始删除Topi
|
||||
topicDeletionManager.tryTopicDeletion()
|
||||
|
||||
```
|
||||
#### 6.1 获取需要被删除的Topic和暂时不能删除的Topic
|
||||
` fetchTopicDeletionsInProgress`
|
||||
1. `topicsToBeDeleted`所有需要被删除的Topic从zk中`/admin/delete_topics` 获取
|
||||
2. `topicsIneligibleForDeletion`有一部分Topic还暂时不能被删除:
|
||||
①. Topic任意分区正在进行副本重分配
|
||||
②. Topic任意分区副本存在不在线的情况(只有topic有一个副本所在的Broker异常就不能能删除)
|
||||
3. 将得到的数据存在在`controllerContext`内存中
|
||||
|
||||
|
||||
#### 6.2 topicDeletionManager.init初始化删除管理器
|
||||
1. 如果服务器配置`delete.topic.enable=false`不允许删除topic的话,则删除`/admin/delete_topics` 中的节点; 这个节点下面的数据是标记topic需要被删除的意思;
|
||||
|
||||
#### 6.3 topicDeletionManager.tryTopicDeletion尝试恢复删除
|
||||
这里又回到了上面分析过的`resumeDeletions`啦;恢复删除操作
|
||||
```scala
|
||||
def tryTopicDeletion(): Unit = {
|
||||
if (isDeleteTopicEnabled) {
|
||||
resumeDeletions()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## 源码总结
|
||||
整个Topic删除, 请看下图
|
||||

|
||||
|
||||
|
||||
几个注意点:
|
||||
1. Controller 也是Broker
|
||||
2. Controller发起删除请求的时候,只是跟相关联的Broker发起删除请求;
|
||||
3. Broker不在线或者删除失败,Controller会持续进行删除操作; 或者Broker上线之后继续进行删除操作
|
||||
|
||||
|
||||
## Q&A
|
||||
<font color="red">列举在此主题下比较常见的问题; 如果读者有其他问题可以在评论区评论, 博主会不定期更新</font>
|
||||
|
||||
|
||||
|
||||
### 什么时候在/admin/delete_topics写入节点的
|
||||
>客户端发起删除操作deleteTopics的时候,Controller响应deleteTopics请求, 这个时候Controller就将待删除Topic写入了zk的`/admin/delete_topics/Topic名称`节点中了;
|
||||
### 什么时候真正执行删除Topic磁盘日志
|
||||
>Controller监听到zk节点`/admin/delete_topics`之后,向所有存活的Broker发送删除Topic的请求; Broker收到请求之后将待删除副本标记为--delete后缀; 然后会有专门日志清理现场来进行真正的删除操作; 延迟多久删除是靠`file.delete.delay.ms`来决定的;默认是60000毫秒 = 一分钟
|
||||
|
||||
### 为什么正在重新分配的Topic不能被删除
|
||||
> 正在重新分配的Topic,你都不知道它具体会落在哪个地方,所以肯定也就不知道啥时候删除啊;
|
||||
> 等分配完毕之后,就会继续删除流程
|
||||
|
||||
|
||||
### 如果在`/admin/delete_topics/`中手动写入一个节点会不会正常删除
|
||||
> 如果写入的节点,并不是一个真实存在的Topic;则将会直接被删除
|
||||
> 当然要注意如果配置了`delete.topic.enable=false`不可删除Topic的话,则将`/admin/delete_topics/`下面的节点全部删除,然后流程结束
|
||||
> 如果写入的节点是一个真实存在的Topic; 则将会执行删除Topic的流程; 本质上跟用Kafka客户端执行删除Topic操作没有什么不同
|
||||
|
||||
|
||||
|
||||
### 如果直接删除ZK上的`/brokers/topics/{topicName}`节点会怎样
|
||||
>TODO...
|
||||
|
||||
### Controller通知Brokers 执行StopReplica是通知所有的Broker还是只通知跟被删除Topic有关联的Broker?
|
||||
> **只是通知跟被删除Topic有关联的Broker;**
|
||||
> 请看下图源码,可以看到所有需要被`StopReplica`的副本都是被过滤了一遍,获取它们所在的BrokerId; 最后调用的时候也是`sendRequest(brokerId, stopReplicaRequest)` ;根据获取到的BrokerId发起的请求
|
||||
> 
|
||||
|
||||
### 删除过程有Broker不在线 或者执行失败怎么办
|
||||
>Controller会继续删除操作;或者等Broker上线然后继续删除操作; 反正就是一定会保证所有的分区都被删除(被标记了--delete)之后才会把zk上的数据清理掉;
|
||||
|
||||
### ReplicaStateMachine 副本状态机
|
||||
> 请看 [【kafka源码】Controller中的状态机TODO]()
|
||||
|
||||
### 在重新分配的过程中,如果执行删除操作会怎么样
|
||||
> 删除操作会等待,等待重新分配完成之后,继续进行删除操作
|
||||
> 
|
||||
|
||||
|
||||
Finally: 本文阅读源码为 `Kafka-2.5`
|
||||
149
docs/zh/Kafka分享/Kafka Controller /分区和副本的分配规则.md
Normal file
@@ -0,0 +1,149 @@
|
||||
|
||||
我们有分析过[TopicCommand之创建Topic源码解析]();
|
||||
因为篇幅太长所以 关于分区分配的问题单独开一篇文章写;
|
||||
|
||||
|
||||
## 源码分析
|
||||
**创建Topic的源码入口 `AdminManager.createTopics()`**
|
||||
|
||||
以下只列出了分区分配相关代码其他省略
|
||||
```java
|
||||
|
||||
def createTopics(timeout: Int,
|
||||
validateOnly: Boolean,
|
||||
toCreate: Map[String, CreatableTopic],
|
||||
includeConfigsAndMetatadata: Map[String, CreatableTopicResult],
|
||||
responseCallback: Map[String, ApiError] => Unit): Unit = {
|
||||
|
||||
// 1. map over topics creating assignment and calling zookeeper
|
||||
val brokers = metadataCache.getAliveBrokers.map { b => kafka.admin.BrokerMetadata(b.id, b.rack) }
|
||||
|
||||
val metadata = toCreate.values.map(topic =>
|
||||
try {
|
||||
val assignments = if (topic.assignments().isEmpty) {
|
||||
AdminUtils.assignReplicasToBrokers(
|
||||
brokers, resolvedNumPartitions, resolvedReplicationFactor)
|
||||
} else {
|
||||
val assignments = new mutable.HashMap[Int, Seq[Int]]
|
||||
// Note: we don't check that replicaAssignment contains unknown brokers - unlike in add-partitions case,
|
||||
// this follows the existing logic in TopicCommand
|
||||
topic.assignments.asScala.foreach {
|
||||
case assignment => assignments(assignment.partitionIndex()) =
|
||||
assignment.brokerIds().asScala.map(a => a: Int)
|
||||
}
|
||||
assignments
|
||||
}
|
||||
trace(s"Assignments for topic $topic are $assignments ")
|
||||
|
||||
}
|
||||
|
||||
```
|
||||
1. 以上有两种方式,一种是我们没有指定分区分配的情况也就是没有使用参数`--replica-assignment`;一种是自己指定了分区分配
|
||||
|
||||
### 1. 自己指定了分区分配规则
|
||||
从源码中得知, 会把我们指定的规则进行了包装,**注意它并没有去检查你指定的Broker是否存在;**
|
||||
|
||||
### 2. 自动分配 AdminUtils.assignReplicasToBrokers
|
||||

|
||||
1. 参数检查: 分区数>0; 副本数>0; 副本数<=Broker数 (如果自己未定义会直接使用Broker中个配置)
|
||||
2. 根据是否有 机架信息来进行不同方式的分配;
|
||||
3. 要么整个集群都有机架信息,要么整个集群都没有机架信息; 否则抛出异常
|
||||
|
||||
|
||||
#### 无机架方式分配
|
||||
`AdminUtils.assignReplicasToBrokersRackUnaware`
|
||||
```scala
|
||||
/**
|
||||
* 副本分配时,有三个原则:
|
||||
* 1. 将副本平均分布在所有的 Broker 上;
|
||||
* 2. partition 的多个副本应该分配在不同的 Broker 上;
|
||||
* 3. 如果所有的 Broker 有机架信息的话, partition 的副本应该分配到不同的机架上。
|
||||
*
|
||||
* 为实现上面的目标,在没有机架感知的情况下,应该按照下面两个原则分配 replica:
|
||||
* 1. 从 broker.list 随机选择一个 Broker,使用 round-robin 算法分配每个 partition 的第一个副本;
|
||||
* 2. 对于这个 partition 的其他副本,逐渐增加 Broker.id 来选择 replica 的分配。
|
||||
*/
|
||||
|
||||
private def assignReplicasToBrokersRackUnaware(nPartitions: Int,
|
||||
replicationFactor: Int,
|
||||
brokerList: Seq[Int],
|
||||
fixedStartIndex: Int,
|
||||
startPartitionId: Int): Map[Int, Seq[Int]] = {
|
||||
val ret = mutable.Map[Int, Seq[Int]]()
|
||||
// 这里是上一层传递过了的所有 存活的Broker列表的ID
|
||||
val brokerArray = brokerList.toArray
|
||||
//默认随机选一个index开始
|
||||
val startIndex = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(brokerArray.length)
|
||||
//默认从0这个分区号开始
|
||||
var currentPartitionId = math.max(0, startPartitionId)
|
||||
var nextReplicaShift = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(brokerArray.length)
|
||||
for (_ <- 0 until nPartitions) {
|
||||
if (currentPartitionId > 0 && (currentPartitionId % brokerArray.length == 0))
|
||||
nextReplicaShift += 1
|
||||
val firstReplicaIndex = (currentPartitionId + startIndex) % brokerArray.length
|
||||
val replicaBuffer = mutable.ArrayBuffer(brokerArray(firstReplicaIndex))
|
||||
for (j <- 0 until replicationFactor - 1)
|
||||
replicaBuffer += brokerArray(replicaIndex(firstReplicaIndex, nextReplicaShift, j, brokerArray.length))
|
||||
ret.put(currentPartitionId, replicaBuffer)
|
||||
currentPartitionId += 1
|
||||
}
|
||||
ret
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
#### 有机架方式分配
|
||||
|
||||
```java
|
||||
private def assignReplicasToBrokersRackAware(nPartitions: Int,
|
||||
replicationFactor: Int,
|
||||
brokerMetadatas: Seq[BrokerMetadata],
|
||||
fixedStartIndex: Int,
|
||||
startPartitionId: Int): Map[Int, Seq[Int]] = {
|
||||
val brokerRackMap = brokerMetadatas.collect { case BrokerMetadata(id, Some(rack)) =>
|
||||
id -> rack
|
||||
}.toMap
|
||||
val numRacks = brokerRackMap.values.toSet.size
|
||||
val arrangedBrokerList = getRackAlternatedBrokerList(brokerRackMap)
|
||||
val numBrokers = arrangedBrokerList.size
|
||||
val ret = mutable.Map[Int, Seq[Int]]()
|
||||
val startIndex = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(arrangedBrokerList.size)
|
||||
var currentPartitionId = math.max(0, startPartitionId)
|
||||
var nextReplicaShift = if (fixedStartIndex >= 0) fixedStartIndex else rand.nextInt(arrangedBrokerList.size)
|
||||
for (_ <- 0 until nPartitions) {
|
||||
if (currentPartitionId > 0 && (currentPartitionId % arrangedBrokerList.size == 0))
|
||||
nextReplicaShift += 1
|
||||
val firstReplicaIndex = (currentPartitionId + startIndex) % arrangedBrokerList.size
|
||||
val leader = arrangedBrokerList(firstReplicaIndex)
|
||||
val replicaBuffer = mutable.ArrayBuffer(leader)
|
||||
val racksWithReplicas = mutable.Set(brokerRackMap(leader))
|
||||
val brokersWithReplicas = mutable.Set(leader)
|
||||
var k = 0
|
||||
for (_ <- 0 until replicationFactor - 1) {
|
||||
var done = false
|
||||
while (!done) {
|
||||
val broker = arrangedBrokerList(replicaIndex(firstReplicaIndex, nextReplicaShift * numRacks, k, arrangedBrokerList.size))
|
||||
val rack = brokerRackMap(broker)
|
||||
// Skip this broker if
|
||||
// 1. there is already a broker in the same rack that has assigned a replica AND there is one or more racks
|
||||
// that do not have any replica, or
|
||||
// 2. the broker has already assigned a replica AND there is one or more brokers that do not have replica assigned
|
||||
if ((!racksWithReplicas.contains(rack) || racksWithReplicas.size == numRacks)
|
||||
&& (!brokersWithReplicas.contains(broker) || brokersWithReplicas.size == numBrokers)) {
|
||||
replicaBuffer += broker
|
||||
racksWithReplicas += rack
|
||||
brokersWithReplicas += broker
|
||||
done = true
|
||||
}
|
||||
k += 1
|
||||
}
|
||||
}
|
||||
ret.put(currentPartitionId, replicaBuffer)
|
||||
currentPartitionId += 1
|
||||
}
|
||||
ret
|
||||
}
|
||||
```
|
||||
|
||||
## 源码总结
|
||||
|
||||
51
docs/zh/Kafka分享/Kafka云平台简介/Kafka云平台简介.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# `Logi-Kafka` 云平台-简介
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、产品架构
|
||||
|
||||

|
||||
|
||||
- 资源层:`Logi-Kafka` 云平台最底层是资源层,是MySQL、Zookeeper及一些容器和物理机;
|
||||
- 引擎层:资源层之上是引擎层,这块主要是在社区`Kafka`基础上,增加了磁盘过载保护、指标埋点等40+优化改进后的`Kafka`消息队列服务;
|
||||
- 网关层:引擎层再之上是网关层,网关层主要是对Kafka-Topic的消费与发送进行权限管控、流量管控。以及还有,客户端接入时的服务发现和降级等功能;
|
||||
- 服务层:网关层再往上是服务层,基于滴滴内部Kafka平台的服务的经验沉淀,服务层具备Topic管理、集群管理等一套比较完善的监控及管控服务能力;
|
||||
- 平台层:最顶层是平台层,基于服务层的能力及用户角色权限的管控,平台层面向不同用户分别展示了用户控制台、运维控制台及一些开放的接口。
|
||||
|
||||
|
||||
## 2、模块功能
|
||||
|
||||

|
||||
|
||||
- Kafka集群(`Kafka-Brokers`):在`Apache-Kafka`的基础上,增加磁盘过载保护、指标体系细化及性能优化等特性后的Kafka。
|
||||
|
||||
- Kafka网关(`Kafka-Gateway`):滴滴自研的具备服务发现、流量控制、服务降级及安全管控等能力的Kafka集群网关。备注:部分网关的能力被嵌入于`Kafka-Broker`中。
|
||||
|
||||
- Kafka管控平台(`Kafka-Manager`):滴滴自研的面向`Kafka`的普通用户、研发人员及运维人员的一站式`Kafka`集群监控 & 运维管控平台。
|
||||
|
||||
|
||||
|
||||
|
||||
介绍完云平台整体架构之后,我们再来大致介绍一下各模块的功能及他们之间的交互。
|
||||
|
||||
- Kafka集群(`Kafka-Brokers`)
|
||||
1、承接`Kafka`客户端的发送及消费请求并进行处理。
|
||||
**2、`Kafka`网关中的流控和安全的能力嵌入于其中。**
|
||||
3、从`Kafka-Manager`定时同步权限和用户信息。
|
||||
4、将`Topic`的连接信息`POST`到`Kafka-Manager`。
|
||||
5、Kafka网关-服务发现模块会到`Kafka`集群同步元信息。
|
||||
6、强依赖于`Zookeeper`。
|
||||
|
||||
- Kafka网关(`Kafka-Gateway`) 之 服务发现
|
||||
1、`Kafka`统一对外的服务地址。
|
||||
2、`Kafka`客户端启动时,会首先请求服务发现,从而获取`Topic`的元信息。
|
||||
3、从`Kafka-Manager`定时同步各个集群的实际服务地址和流控降级信息。
|
||||
|
||||
- Kafka管控平台(`Kafka-Manager`)
|
||||
1、用户管控平台。
|
||||
2、服务发现和`Kafka`集群会定期进行集群实际服务地址、用户信息及权限信息进行同步。
|
||||
3、从`Kafka`集群中获取集群元信息及指标信息。
|
||||
|
||||
## 3、总结
|
||||
|
||||
本节概要介绍了一下滴滴`Logi-Kafka`云平台产品的整体架构 以及 相关模块的大体功能及之间的相互交互关系。
|
||||
BIN
docs/zh/Kafka分享/Kafka云平台简介/assets/kafka_cloud_arch.jpg
Normal file
|
After Width: | Height: | Size: 213 KiB |
BIN
docs/zh/Kafka分享/Kafka云平台简介/assets/kafka_server_arch.png
Normal file
|
After Width: | Height: | Size: 878 KiB |
112
docs/zh/Kafka分享/Kafka常见问题解答/Kafka客户端_断开重连策略/Kafka客户度_断开重连策略.md
Normal file
@@ -0,0 +1,112 @@
|
||||
结论:按照时间的过期策略,时间是按照时间索引文件的最后一条数据里面记录的时间作为是否过期的判断标准。那么另外一个问题,这个时间是怎么写入的,继续看下面
|
||||
|
||||
```java
|
||||
/**
|
||||
* If topic deletion is enabled, delete any log segments that have either expired due to time based retention
|
||||
* or because the log size is > retentionSize.
|
||||
* Whether or not deletion is enabled, delete any log segments that are before the log start offset
|
||||
*/
|
||||
// 清理策略
|
||||
def deleteOldSegments(): Int = {
|
||||
if (config.delete) {
|
||||
// 清理策略:保存时间、保存大小、开始offset
|
||||
deleteRetentionMsBreachedSegments() + deleteRetentionSizeBreachedSegments() + deleteLogStartOffsetBreachedSegments()
|
||||
} else {
|
||||
deleteLogStartOffsetBreachedSegments()
|
||||
}
|
||||
}
|
||||
|
||||
// 调用按照时间的清理策略
|
||||
private def deleteRetentionMsBreachedSegments(): Int = {
|
||||
if (config.retentionMs < 0) return 0
|
||||
val startMs = time.milliseconds
|
||||
// segment按照 startMs - segment.largestTimestamp > config.retentionMs 的策略进行清理.
|
||||
// 那么我们再看一下 segment.largestTimestamp 的时间是怎么获取的
|
||||
deleteOldSegments((segment, _) => startMs - segment.largestTimestamp > config.retentionMs,
|
||||
reason = s"retention time ${config.retentionMs}ms breach")
|
||||
}
|
||||
|
||||
/**
|
||||
* The largest timestamp this segment contains.
|
||||
*/
|
||||
// LogSegment的代码,可以发现如果maxTimestampSoFar>0时,就是maxTimestampSoFar,否则是最近一次修改时间
|
||||
// 那么maxTimestampSoFar是怎么获取的呢
|
||||
def largestTimestamp = if (maxTimestampSoFar >= 0) maxTimestampSoFar else lastModified
|
||||
|
||||
|
||||
// maxTimestampSoFar相当于是时间索引的最后一个entry的时间,那么我们继续看一下timeIndex.lastEntry是什么时间
|
||||
def maxTimestampSoFar: Long = {
|
||||
if (_maxTimestampSoFar.isEmpty)
|
||||
_maxTimestampSoFar = Some(timeIndex.lastEntry.timestamp)
|
||||
_maxTimestampSoFar.get
|
||||
}
|
||||
|
||||
// 获取时间索引的最后一个entry里面的时间&offset
|
||||
// 在同一个时间索引文件里面,时间字段是单调递增的,因此这里获取到的是时间索引里面最大的那个时间。
|
||||
// 那么这个时间是怎么写入的呢?我们继续往下看,看时间索引的写入这块
|
||||
private def lastEntryFromIndexFile: TimestampOffset = {
|
||||
inLock(lock) {
|
||||
_entries match {
|
||||
case 0 => TimestampOffset(RecordBatch.NO_TIMESTAMP, baseOffset)
|
||||
case s => parseEntry(mmap, s - 1)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
时间索引文件这个时间是如何写入的?
|
||||
结论:如果配置了LOG_APPEND_TIME,那么就是写入服务器的时间。如果是配置CREATE_TIME,那么就是record时间里面的最大的那一个。
|
||||
|
||||
```java
|
||||
// 从TimeIndex类的maybeAppend方法一步一步的向上查找,查看里面的时间数据的写入,我们可以发现
|
||||
// 这个时间是在 LogValidator.validateMessagesAndAssignOffsets 这个方法里面生成的
|
||||
// 遍历records
|
||||
for (batch <- records.batches.asScala) {
|
||||
validateBatch(topicPartition, firstBatch, batch, origin, toMagicValue, brokerTopicStats)
|
||||
val recordErrors = new ArrayBuffer[ApiRecordError](0)
|
||||
for ((record, batchIndex) <- batch.asScala.view.zipWithIndex) {
|
||||
validateRecord(batch, topicPartition, record, batchIndex, now, timestampType,
|
||||
timestampDiffMaxMs, compactedTopic, brokerTopicStats).foreach(recordError => recordErrors += recordError)
|
||||
// we fail the batch if any record fails, so we stop appending if any record fails
|
||||
if (recordErrors.isEmpty)
|
||||
// 拼接offset,这里还会计算那个时间戳
|
||||
builder.appendWithOffset(offsetCounter.getAndIncrement(), record)
|
||||
}
|
||||
processRecordErrors(recordErrors)
|
||||
}
|
||||
//
|
||||
private long appendLegacyRecord(long offset, long timestamp, ByteBuffer key, ByteBuffer value, byte magic) throws IOException {
|
||||
ensureOpenForRecordAppend();
|
||||
if (compressionType == CompressionType.NONE && timestampType == TimestampType.LOG_APPEND_TIME)
|
||||
// 定义了LOG_APPEND_TIME,则使用logAppendTime,
|
||||
timestamp = logAppendTime;
|
||||
|
||||
int size = LegacyRecord.recordSize(magic, key, value);
|
||||
AbstractLegacyRecordBatch.writeHeader(appendStream, toInnerOffset(offset), size);
|
||||
|
||||
if (timestampType == TimestampType.LOG_APPEND_TIME)
|
||||
timestamp = logAppendTime;
|
||||
long crc = LegacyRecord.write(appendStream, magic, timestamp, key, value, CompressionType.NONE, timestampType);
|
||||
// 时间计算
|
||||
recordWritten(offset, timestamp, size + Records.LOG_OVERHEAD);
|
||||
return crc;
|
||||
}
|
||||
|
||||
// 最值的计算
|
||||
private void recordWritten(long offset, long timestamp, int size) {
|
||||
if (numRecords == Integer.MAX_VALUE)
|
||||
throw new IllegalArgumentException("Maximum number of records per batch exceeded, max records: " + Integer.MAX_VALUE);
|
||||
if (offset - baseOffset > Integer.MAX_VALUE)
|
||||
throw new IllegalArgumentException("Maximum offset delta exceeded, base offset: " + baseOffset +
|
||||
", last offset: " + offset);
|
||||
numRecords += 1;
|
||||
uncompressedRecordsSizeInBytes += size;
|
||||
lastOffset = offset;
|
||||
if (magic > RecordBatch.MAGIC_VALUE_V0 && timestamp > maxTimestamp) {
|
||||
// 时间更新,最后时间索引记录的是maxTimestamp这个字段
|
||||
maxTimestamp = timestamp;
|
||||
offsetOfMaxTimestamp = offset;
|
||||
}
|
||||
}
|
||||
```
|
||||
112
docs/zh/Kafka分享/Kafka常见问题解答/Kafka服务端_日志清理策略/Kafka服务端-日志清理策略.md
Normal file
@@ -0,0 +1,112 @@
|
||||
结论:按照时间的过期策略,时间是按照时间索引文件的最后一条数据里面记录的时间作为是否过期的判断标准。那么另外一个问题,这个时间是怎么写入的,继续看下面
|
||||
|
||||
```java
|
||||
/**
|
||||
* If topic deletion is enabled, delete any log segments that have either expired due to time based retention
|
||||
* or because the log size is > retentionSize.
|
||||
* Whether or not deletion is enabled, delete any log segments that are before the log start offset
|
||||
*/
|
||||
// 清理策略
|
||||
def deleteOldSegments(): Int = {
|
||||
if (config.delete) {
|
||||
// 清理策略:保存时间、保存大小、开始offset
|
||||
deleteRetentionMsBreachedSegments() + deleteRetentionSizeBreachedSegments() + deleteLogStartOffsetBreachedSegments()
|
||||
} else {
|
||||
deleteLogStartOffsetBreachedSegments()
|
||||
}
|
||||
}
|
||||
|
||||
// 调用按照时间的清理策略
|
||||
private def deleteRetentionMsBreachedSegments(): Int = {
|
||||
if (config.retentionMs < 0) return 0
|
||||
val startMs = time.milliseconds
|
||||
// segment按照 startMs - segment.largestTimestamp > config.retentionMs 的策略进行清理.
|
||||
// 那么我们再看一下 segment.largestTimestamp 的时间是怎么获取的
|
||||
deleteOldSegments((segment, _) => startMs - segment.largestTimestamp > config.retentionMs,
|
||||
reason = s"retention time ${config.retentionMs}ms breach")
|
||||
}
|
||||
|
||||
/**
|
||||
* The largest timestamp this segment contains.
|
||||
*/
|
||||
// LogSegment的代码,可以发现如果maxTimestampSoFar>0时,就是maxTimestampSoFar,否则是最近一次修改时间
|
||||
// 那么maxTimestampSoFar是怎么获取的呢
|
||||
def largestTimestamp = if (maxTimestampSoFar >= 0) maxTimestampSoFar else lastModified
|
||||
|
||||
|
||||
// maxTimestampSoFar相当于是时间索引的最后一个entry的时间,那么我们继续看一下timeIndex.lastEntry是什么时间
|
||||
def maxTimestampSoFar: Long = {
|
||||
if (_maxTimestampSoFar.isEmpty)
|
||||
_maxTimestampSoFar = Some(timeIndex.lastEntry.timestamp)
|
||||
_maxTimestampSoFar.get
|
||||
}
|
||||
|
||||
// 获取时间索引的最后一个entry里面的时间&offset
|
||||
// 在同一个时间索引文件里面,时间字段是单调递增的,因此这里获取到的是时间索引里面最大的那个时间。
|
||||
// 那么这个时间是怎么写入的呢?我们继续往下看,看时间索引的写入这块
|
||||
private def lastEntryFromIndexFile: TimestampOffset = {
|
||||
inLock(lock) {
|
||||
_entries match {
|
||||
case 0 => TimestampOffset(RecordBatch.NO_TIMESTAMP, baseOffset)
|
||||
case s => parseEntry(mmap, s - 1)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
时间索引文件这个时间是如何写入的?
|
||||
结论:如果配置了LOG_APPEND_TIME,那么就是写入服务器的时间。如果是配置CREATE_TIME,那么就是record时间里面的最大的那一个。
|
||||
|
||||
```java
|
||||
// 从TimeIndex类的maybeAppend方法一步一步的向上查找,查看里面的时间数据的写入,我们可以发现
|
||||
// 这个时间是在 LogValidator.validateMessagesAndAssignOffsets 这个方法里面生成的
|
||||
// 遍历records
|
||||
for (batch <- records.batches.asScala) {
|
||||
validateBatch(topicPartition, firstBatch, batch, origin, toMagicValue, brokerTopicStats)
|
||||
val recordErrors = new ArrayBuffer[ApiRecordError](0)
|
||||
for ((record, batchIndex) <- batch.asScala.view.zipWithIndex) {
|
||||
validateRecord(batch, topicPartition, record, batchIndex, now, timestampType,
|
||||
timestampDiffMaxMs, compactedTopic, brokerTopicStats).foreach(recordError => recordErrors += recordError)
|
||||
// we fail the batch if any record fails, so we stop appending if any record fails
|
||||
if (recordErrors.isEmpty)
|
||||
// 拼接offset,这里还会计算那个时间戳
|
||||
builder.appendWithOffset(offsetCounter.getAndIncrement(), record)
|
||||
}
|
||||
processRecordErrors(recordErrors)
|
||||
}
|
||||
//
|
||||
private long appendLegacyRecord(long offset, long timestamp, ByteBuffer key, ByteBuffer value, byte magic) throws IOException {
|
||||
ensureOpenForRecordAppend();
|
||||
if (compressionType == CompressionType.NONE && timestampType == TimestampType.LOG_APPEND_TIME)
|
||||
// 定义了LOG_APPEND_TIME,则使用logAppendTime,
|
||||
timestamp = logAppendTime;
|
||||
|
||||
int size = LegacyRecord.recordSize(magic, key, value);
|
||||
AbstractLegacyRecordBatch.writeHeader(appendStream, toInnerOffset(offset), size);
|
||||
|
||||
if (timestampType == TimestampType.LOG_APPEND_TIME)
|
||||
timestamp = logAppendTime;
|
||||
long crc = LegacyRecord.write(appendStream, magic, timestamp, key, value, CompressionType.NONE, timestampType);
|
||||
// 时间计算
|
||||
recordWritten(offset, timestamp, size + Records.LOG_OVERHEAD);
|
||||
return crc;
|
||||
}
|
||||
|
||||
// 最值的计算
|
||||
private void recordWritten(long offset, long timestamp, int size) {
|
||||
if (numRecords == Integer.MAX_VALUE)
|
||||
throw new IllegalArgumentException("Maximum number of records per batch exceeded, max records: " + Integer.MAX_VALUE);
|
||||
if (offset - baseOffset > Integer.MAX_VALUE)
|
||||
throw new IllegalArgumentException("Maximum offset delta exceeded, base offset: " + baseOffset +
|
||||
", last offset: " + offset);
|
||||
numRecords += 1;
|
||||
uncompressedRecordsSizeInBytes += size;
|
||||
lastOffset = offset;
|
||||
if (magic > RecordBatch.MAGIC_VALUE_V0 && timestamp > maxTimestamp) {
|
||||
// 时间更新,最后时间索引记录的是maxTimestamp这个字段
|
||||
maxTimestamp = timestamp;
|
||||
offsetOfMaxTimestamp = offset;
|
||||
}
|
||||
}
|
||||
```
|
||||
42
docs/zh/Kafka分享/Kafka常见问题解答/Kafka服务端_权限控制/Kafka服务端_权限控制.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# Kafka服务端—权限控制
|
||||
|
||||
[TOC]
|
||||
|
||||
|
||||
资源类型:
|
||||
- UNKNOWN: 未知
|
||||
- ANY:任意的资源
|
||||
- TOPIC:Topic
|
||||
- GROUP:消费组
|
||||
- CLUSTER:整个集群
|
||||
- TRANSACTIONAL_ID:事物ID
|
||||
- DELEGATION_TOKEN:Token
|
||||
|
||||
|
||||
资源操作:
|
||||
- UNKNOWN:未知
|
||||
- ANY:任意的操作
|
||||
- ALL:所有的操作
|
||||
- READ:读
|
||||
- WRITE:写
|
||||
- CREATE:创建
|
||||
- DELETE:删除
|
||||
- ALTER:修改
|
||||
- DESCRIBE:描述,查看
|
||||
- CLUSTER_ACTION:集群动作
|
||||
- DESCRIBE_CONFIGS:查看配置
|
||||
- ALTER_CONFIGS:修改配置
|
||||
- IDEMPOTENT_WRITE:幂等写
|
||||
|
||||
|
||||
资源书写类型:
|
||||
- UNKNOWN:未知
|
||||
- ANY:任意
|
||||
- MATCH:满足LITERAL、PREFIXED或者*的任意中的一个即可
|
||||
- LITERAL:全匹配,完全按照原文匹配
|
||||
- PREFIXED:前缀匹配
|
||||
|
||||
|
||||
认证结果:
|
||||
- ALLOWED:允许
|
||||
- DENIED:拒绝
|
||||
180
docs/zh/Kafka分享/Kafka常见问题解答/Kafka集群滚动重启实践/Kafka集群滚动重启实践.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Kafka集群平稳滚动重启实践
|
||||
|
||||
[TOC]
|
||||
|
||||
## 0、前言
|
||||
|
||||
Kafka集群的滚动重启,是一件非常危险的事情,操作不当的情况下可能会导致Kafka集群不可服务。即便操作上准确无误,也可能因为业务方服务非常敏感、服务健壮性不足、使用的客户端存在BUG等原因,导致业务方业务受损。
|
||||
|
||||
基于以上的原因以及我们以往的经验,我们梳理了一下在对Kafka集群滚动重启中,需要做的事情以及注意的点。
|
||||
|
||||
|
||||
## 1、用户告知
|
||||
|
||||
### 1.1、告知内容
|
||||
|
||||
提前告知用户:
|
||||
- 我们要做什么;
|
||||
- 为什么要做;
|
||||
- 可能的影响,及简单处理方式,比如因为leader会切换,node客户端可能会消费中断,需要重启等;
|
||||
- 联系人;
|
||||
- 操作时间;**在操作时间选择上,建议选择业务方在的工作时间,方便出问题后能及时协同处理**
|
||||
|
||||
告知内容例子:
|
||||
```
|
||||
标题:
|
||||
[2021-11-11] XXX-Kafka 集群升级至 kafka_2.12-xxxx
|
||||
|
||||
变更原因:
|
||||
1、性能优化
|
||||
|
||||
变更内容:
|
||||
1、集群单机连接数限制调整到 1200
|
||||
|
||||
变更影响:
|
||||
升级过程中会有leader切换,理论上无影响,有问题及时联系kafka服务号
|
||||
|
||||
联系人:
|
||||
xxxxx@xxxx.com
|
||||
|
||||
计划时间:
|
||||
2021-11-11T10:00:00+08:00 至 2021-11-11T16:30:00+08:00
|
||||
```
|
||||
|
||||
### 1.2、相关建议
|
||||
|
||||
- 增加对自身服务监控,比如监控服务对应的Topic的流量,监控消费的Lag等指标,以便出现问题时能被及时发现;
|
||||
-
|
||||
|
||||
---
|
||||
|
||||
## 2、滚动重启
|
||||
|
||||
**真正操作前,建议演练一下。**
|
||||
|
||||
---
|
||||
|
||||
### 2.1、整体操作流程
|
||||
|
||||
- 1、再次通知用户,我们现在要开始进行重启操作,有问题随时联系;
|
||||
- 2、重启**非Controller**的一台Broker;
|
||||
- 3、观察重启后指标等是否都正常,如果出现异常则进行相应的处理;
|
||||
- 4、告知用户我们已重启一台,xxx分钟后,要操作剩余所有的机器,让用户注意自身服务是否正常,有问题随时反馈;
|
||||
- 5、xxx分钟后,剩余机器逐台重启,**Kafka-Controller放在最后重启**;
|
||||
- 6、操作完成后,告知用户已操作完成,让用户关注自身服务是否正常,有问题随时反馈;
|
||||
|
||||
---
|
||||
|
||||
|
||||
### 2.2、单台操作流程
|
||||
|
||||
单台操作时,主要分两部分,第一部分时操作进行重启,第二部分是重启完成之后观察服务是否正常。
|
||||
|
||||
#### 2.2.1、重启
|
||||
|
||||
**第一步:停服务**
|
||||
|
||||
```bash
|
||||
# 以kill的方式,停Kafka服务。
|
||||
|
||||
# 强调:不能以kill -9的方式停服!!!
|
||||
```
|
||||
|
||||
**第二步:修改配置**
|
||||
|
||||
```bash
|
||||
# 对本次重启需要进行修改的配置进行修改;
|
||||
|
||||
# 强烈要求将本次修改的配置的具体操作步骤罗列出来;
|
||||
```
|
||||
|
||||
**第三步:Broker限流**
|
||||
|
||||
```bash
|
||||
# 如果在停服务和启动服务之间的时间间隔非常的久,导致启动后需要同步非常多的数据,则在启动服务之前,我们需要做好副本同步之间的限流,否则可能会拉打满带宽,挂其他Broker等。
|
||||
|
||||
# 这里的需要同步的数据量怎么样算多,这个没有一个非常准确的值,只要说可能将leader带宽打满,拉挂其他Broker都算是数据量大。
|
||||
```
|
||||
|
||||
**第四步:起服务**
|
||||
|
||||
```bash
|
||||
# 启动Kafk服务,然后观察服务是否正常
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 2.2.2、观察
|
||||
|
||||
|
||||
**第一步:观察启动日志**
|
||||
|
||||
```bash
|
||||
# 查看server.log文件,看到该日志后表示Kafka服务端已启动完成
|
||||
[2021-11-17 14:07:22,459][INFO][main]: [KafkaServer id=2] started
|
||||
|
||||
# 查看server.log文件,检查是否存在ERROR及FATAL日志,如果出现这些日志,需要暂停升级并分析出现这些日志的影响。
|
||||
```
|
||||
|
||||
**第二步:观察服务监控**
|
||||
|
||||
如果可以做到下列指标的监控的话,建议都在监控系统中,配置上这些监控。
|
||||
|
||||
这一步正常来说只要配置上了,如果出现异常监控系统会主动通知,不需要我们细致的去看,所以虽然列的比较多,但是操作的时候不太需要主动去看所有的指标。
|
||||
|
||||
```bash
|
||||
# 服务存活监控;
|
||||
# 错误日志监控;
|
||||
# GC监控;
|
||||
# 脏选举监控;
|
||||
# ISR收缩速度监控;
|
||||
# leader=-1监控;
|
||||
# 网络处理线程负载监控;
|
||||
# 请求处理线程负载监控;
|
||||
# 副本未同步监控;
|
||||
# 系统负载监控(CPU、磁盘IO、磁盘容量、网络带宽、网络丢包、TCP连接数、TCP连接增速、文件句柄数);
|
||||
```
|
||||
|
||||
**第三步:检查变更是否生效**
|
||||
|
||||
这一步骤没有什么好说的,就是检查是否生效。
|
||||
|
||||
|
||||
|
||||
**第四步:观察流量是否正常**
|
||||
|
||||
```
|
||||
观察一:存在Broker组的概念,则可以观察重启所在的Broker组的整个流量和重启之前是否基本一致。
|
||||
|
||||
观察二:重点选取几个Broker上的Topic,观察流量是否出现异常,比如突然没有流入或流出流量了。
|
||||
```
|
||||
|
||||
|
||||
**第五步:等待副本同步完成**
|
||||
|
||||
```bash
|
||||
# 查看整个集群的副本同步状态,确保整个集群都是处于已同步的状态。该信息可以通过LogiKM查看。
|
||||
|
||||
# 实际上是不需要整个集群所有的Broker处于已同步的状态,只需要是落在所重启的Broker上的所有的分区都处于同步状态即可,但是这个不太好判断,因此简单粗暴的就是看整个集群都处于同步状态。
|
||||
```
|
||||
|
||||
### 2.3、其他重要说明
|
||||
|
||||
- 如果重启中需要同步非常大的数据量,Broker本身负载也较高,则建议重启操作要避开leader rebalance的时间;
|
||||
- 重启的过程中,会进行leader的切换,最后一台操作完成之后,需要进行leader rebalance;
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
## 3、信息记录
|
||||
|
||||
操作中,很难避免就不出现任何问题,出现问题时就需要我们做好相关的记录,比如记录:
|
||||
|
||||
- 1、重要的业务及其Topic;
|
||||
- 2、敏感的业务及其Topic;
|
||||
- 3、特殊客户端的业务及其Topic;
|
||||
- 4、不合理使用的业务及其Topic;
|
||||
|
||||
后续我们可以将这些Topic进行重点保障,以及再次进行操作的时候,我们能够更准确的触达到用户。
|
||||
|
||||
173
docs/zh/Kafka分享/Kafka开发环境搭建/Kafka开发环境搭建.md
Normal file
@@ -0,0 +1,173 @@
|
||||
# `Kafka` 本地开发环境搭建
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、环境准备
|
||||
|
||||
本地开发环境搭建之前,需要准备好如下环境。
|
||||
|
||||
- JDK 11
|
||||
- IDEA 2020.x
|
||||
- Gradle 5.5.1
|
||||
- Zookeeper
|
||||
|
||||
## 2、开发环境配置
|
||||
|
||||
环境准备好之后,我们便开始进行开发环境的配置。
|
||||
|
||||
**步骤一:打开工程,修改IDEA的配置**
|
||||
|
||||
修改Gradle配置
|
||||

|
||||
|
||||
修改Java Compiler配置,如下图所示增加`--add-exports=java.base/sun.nio.ch=ALL-UNNAMED`
|
||||

|
||||
|
||||
**步骤二:进行编译,生成消息协议文件**
|
||||
|
||||
```
|
||||
./gradlew assemble
|
||||
```
|
||||
|
||||

|
||||
|
||||
**步骤三:修改build.gradle文件**
|
||||
|
||||
修改`artifactory` 及 开启`1.8`的兼容
|
||||
|
||||

|
||||
|
||||
|
||||

|
||||
|
||||
**步骤四:修改启动配置**
|
||||
|
||||
```java
|
||||
// 部分看不清的补充说明
|
||||
|
||||
// VM配置
|
||||
// 日志输出位置、log4j配置文件位置、认证文件的位置
|
||||
-Dkafka.logs.dir=logs -Dlog4j.configuration=file:config/log4j.properties -Djava.security.auth.login.config=config/kafka_server_jaas.conf
|
||||
|
||||
// 参数配置
|
||||
config/server.properties
|
||||
```
|
||||
|
||||

|
||||
|
||||
**步骤五:开始编译**
|
||||
|
||||
点击`IDEA`正上方绿色的类似锤子的按钮,开始进行编译。
|
||||
|
||||
编译中:
|
||||

|
||||
|
||||
编译完成:
|
||||

|
||||
|
||||
**步骤六:配置Kafka配置文件**
|
||||
在步骤三中,我们设置了Kafka本地启动涉及到的`server.properties`和`log4j.properties`等文件,这里需要修改的主要是`server.properties`。
|
||||
|
||||
```java
|
||||
// server.properties 中主要需要修改的配置
|
||||
zookeeper.connect=xxxx
|
||||
gateway.url=xxxx
|
||||
cluster.id=xxxx
|
||||
|
||||
// 其他相关的配置可按需进行调整
|
||||
```
|
||||
|
||||
server.properties配置
|
||||

|
||||
|
||||
log4j.properties配置
|
||||

|
||||
|
||||
**步骤七:启动Kafka**
|
||||
|
||||

|
||||
|
||||
|
||||
至此,Kafka本地开发环境便搭建完成了。
|
||||
|
||||
|
||||
## 3、日常命令
|
||||
|
||||
```java
|
||||
// 编译
|
||||
./gradlew assemble
|
||||
|
||||
// 打包,打包完成之后会在core/build/distributions生成打包之后的.tgz文件
|
||||
./gradlew clean releaseTarGz
|
||||
|
||||
// 更多具体的命令可以看2.5版本源码包里面的cmd.txt文件
|
||||
```
|
||||
|
||||
## 4、Kafka 工程代码结构
|
||||
|
||||
主要代码在`clients`和`core`这两个地方。`clients`主要是Java客户端代码。`core`是Kafka服务端代码,也是最重要的代码。
|
||||
|
||||
本次主要介绍一下`core`模块,`clients`模块会在后续进行介绍。`core`模块主要有两部分代码,一部分是社区原生的代码,还有一部分是我们滴滴加入的一些代码。
|
||||
|
||||
### 4.1 Kafka-Core
|
||||
|
||||
这部分`core`模块里面主要是原生的`kafka scala`代码。
|
||||
|
||||
首先看一下图:
|
||||

|
||||
|
||||
|
||||
模块的说明:
|
||||
|
||||
| 模块 | 说明
|
||||
| :-------- |:--------:|
|
||||
| admin | 管理员运维操作相关模块
|
||||
| api | 该模块主要负责交互数据的组装,客户端与服务端交互数据编解码
|
||||
| cluster | Cluster、broker等几个实体类
|
||||
| common | 通用模块,主要是异常类和错误验证
|
||||
| contoroller | Controller相关模块
|
||||
| coordinator | 消费的Coordinator和事物的Coordinator
|
||||
| log | Kafka文件存储模块
|
||||
| metrics | 监控指标metrics模块
|
||||
| network | 网络事件处理模块
|
||||
| security | 安全模块
|
||||
| server | 服务端主模块,业务请求处理入口
|
||||
| tools/utils | 工具相关模块
|
||||
| zk/zookeeper | ZK相关模块
|
||||
|
||||
|
||||
|
||||
### 4.2 Kafka-Core-DiDi
|
||||
|
||||
这部分`core`模块里面主要是我们滴滴扩展的`kafka java`代码。
|
||||
|
||||
首先看一下图:
|
||||
|
||||

|
||||
|
||||
|
||||
模块的说明:
|
||||
|
||||
| 模块 | 说明
|
||||
| :-------- |:--------:|
|
||||
| cache | 缓存模块,主要缓存权限和用户信息并进行同步等
|
||||
| config | 配置模块
|
||||
| jmx | jmx相关模块
|
||||
| metrics | 滴滴Kafka特有的指标
|
||||
| partition | 旧版的分区禁用模块,代码基本废弃了
|
||||
| report | 上报模块,主要上报Topic连接信息
|
||||
| security | 安全管控模块
|
||||
| server | 服务端能力增强模块,包括磁盘过载保护等
|
||||
| util | 工具类
|
||||
|
||||
## 5、环境搭建问题记录
|
||||
1. 启动程序报错:
|
||||
Error:scalac: ‘jvm-11’ is not a valid choice for ‘-target’和scalac: bad option: ‘-target:jvm-11’
|
||||
解决办法:
|
||||
1. 项目的根目录下找到.idea文件夹
|
||||
2. 找到文件夹中scala_compiler.xml文件
|
||||
3. 注释掉其中的<parameter value="-target:jvm-11"
|
||||
4. 最后重启IDEA即可
|
||||
## 6、总结
|
||||
|
||||
本次介绍了Kafka本地开发环境的搭建以及Kafka相关模块的说明。有兴趣的同学可以尝试着搭建一套本地开发环境,方便后续的学习、日常的开发及问题的定位等。
|
||||
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_app_debug_config.jpg
Normal file
|
After Width: | Height: | Size: 188 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_change_artifactory.jpg
Normal file
|
After Width: | Height: | Size: 531 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_change_gradle_config.jpg
Normal file
|
After Width: | Height: | Size: 275 KiB |
|
After Width: | Height: | Size: 275 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_compatible_8.jpg
Normal file
|
After Width: | Height: | Size: 518 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_finished_compile.jpg
Normal file
|
After Width: | Height: | Size: 670 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_log4j_properties.jpg
Normal file
|
After Width: | Height: | Size: 673 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_server_properties.jpg
Normal file
|
After Width: | Height: | Size: 580 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_start_build.jpg
Normal file
|
After Width: | Height: | Size: 263 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_start_compile.jpg
Normal file
|
After Width: | Height: | Size: 600 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/dev_start_kafka.jpg
Normal file
|
After Width: | Height: | Size: 757 KiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/kafka_core_didi_module.png
Normal file
|
After Width: | Height: | Size: 1.8 MiB |
BIN
docs/zh/Kafka分享/Kafka开发环境搭建/assets/kafka_core_module.png
Normal file
|
After Width: | Height: | Size: 1.6 MiB |
301
docs/zh/Kafka分享/Kafka控制器_处理Broker上下线/Kafka控制器_处理Broker上下线.md
Normal file
@@ -0,0 +1,301 @@
|
||||
# Kafka控制器—处理Broker上下线
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
Broker的上下线,除了Broker自身的启停之外呢,Controller还需要对Broker的上下线做元信息的同步等。
|
||||
|
||||
Controller在感知Broker上下线的过程中,主要做了:
|
||||
1. 更新本地缓存的元信息;
|
||||
2. 下线的关闭连接,新增的增加连接;
|
||||
3. 调整下线副本的状态;
|
||||
4. 调整需要重新选举Leader的分区;
|
||||
5. 进行元信息的同步;
|
||||
|
||||
以上就是Controller做的相关事情,下面我们在细看一下具体的流程。
|
||||
|
||||
|
||||
## 2、处理上下线
|
||||
|
||||
### 2.1、感知方式
|
||||
|
||||
1. Broker正常上线:Controller感知ZK节点的变化来感知到Broker的上线。
|
||||
2. Broker正常下线:Broker主动发送ControlShutdown请求给Controller进行处理后再退出,退出后Controller感知到ZK节点变化后再次进行处理。
|
||||
3. Broker异常下线:Controller感知ZK节点的变化来感知Broker的下线。
|
||||
|
||||
那么归结起来,处理上下线就两个流程,一个是通过ZK进行上下线的处理。还有一个是处理ControlShutdown请求来进行Broker下线的处理。
|
||||
|
||||
|
||||
### 2.2、通过ZK感知Broker上下线
|
||||
|
||||
#### 2.2.1、大体流程
|
||||
|
||||

|
||||
|
||||
|
||||
#### 2.2.2、AddBroker & RemoveBroker
|
||||
|
||||
这块流程非常的简单,这里就不画相关说明图了,我们直接来看一下代码。
|
||||
|
||||
**AddBroker**
|
||||
|
||||
```Java
|
||||
def addBroker(broker: Broker): Unit = {
|
||||
// be careful here. Maybe the startup() API has already started the request send thread
|
||||
brokerLock synchronized {
|
||||
if (!brokerStateInfo.contains(broker.id)) {
|
||||
addNewBroker(broker)
|
||||
startRequestSendThread(broker.id)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private def addNewBroker(broker: Broker): Unit = {
|
||||
// 日志及配置等
|
||||
|
||||
// 创建NetworkClient
|
||||
val (networkClient, reconfigurableChannelBuilder) = {
|
||||
val channelBuilder = ChannelBuilders.clientChannelBuilder(。。。。。。)
|
||||
val reconfigurableChannelBuilder = channelBuilder match {。。。。。。。}
|
||||
val selector = new Selector(。。。。。。)
|
||||
val networkClient = new NetworkClient(。。。。。。)
|
||||
(networkClient, reconfigurableChannelBuilder)
|
||||
}
|
||||
val threadName = threadNamePrefix match {
|
||||
case None => s"Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
|
||||
case Some(name) => s"$name:Controller-${config.brokerId}-to-broker-${broker.id}-send-thread"
|
||||
}
|
||||
|
||||
// metrics
|
||||
|
||||
// 创建请求网络IO处理线程
|
||||
val requestThread = new RequestSendThread(config.brokerId, controllerContext, messageQueue, networkClient,
|
||||
brokerNode, config, time, requestRateAndQueueTimeMetrics, stateChangeLogger, threadName)
|
||||
requestThread.setDaemon(false)
|
||||
|
||||
// metrics
|
||||
|
||||
// 缓存创建的信息
|
||||
brokerStateInfo.put(broker.id, ControllerBrokerStateInfo(networkClient, brokerNode, messageQueue,
|
||||
requestThread, queueSizeGauge, requestRateAndQueueTimeMetrics, reconfigurableChannelBuilder))
|
||||
}
|
||||
```
|
||||
|
||||
**RemoveBroker**
|
||||
|
||||
```Java
|
||||
def removeBroker(brokerId: Int): Unit = {
|
||||
brokerLock synchronized {
|
||||
removeExistingBroker(brokerStateInfo(brokerId))
|
||||
}
|
||||
}
|
||||
|
||||
private def removeExistingBroker(brokerState: ControllerBrokerStateInfo): Unit = {
|
||||
try {
|
||||
// 关闭相关新建的对象
|
||||
brokerState.reconfigurableChannelBuilder.foreach(config.removeReconfigurable)
|
||||
brokerState.requestSendThread.shutdown()
|
||||
brokerState.networkClient.close()
|
||||
brokerState.messageQueue.clear()
|
||||
removeMetric(QueueSizeMetricName, brokerMetricTags(brokerState.brokerNode.id))
|
||||
removeMetric(RequestRateAndQueueTimeMetricName, brokerMetricTags(brokerState.brokerNode.id))
|
||||
brokerStateInfo.remove(brokerState.brokerNode.id)
|
||||
} catch {
|
||||
case e: Throwable => error("Error while removing broker by the controller", e)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 2.2.3、处理Broker上线(onBrokerStartup)
|
||||
|
||||
##### 2.2.3.1、大体流程
|
||||
|
||||

|
||||
|
||||
##### 2.2.3.2、相关代码
|
||||
|
||||
```Java
|
||||
private def onBrokerStartup(newBrokers: Seq[Int]): Unit = {
|
||||
info(s"New broker startup callback for ${newBrokers.mkString(",")}")
|
||||
newBrokers.foreach(controllerContext.replicasOnOfflineDirs.remove)
|
||||
val newBrokersSet = newBrokers.toSet
|
||||
val existingBrokers = controllerContext.liveOrShuttingDownBrokerIds -- newBrokers
|
||||
|
||||
// 发送空的元信息到已存在的broker上
|
||||
sendUpdateMetadataRequest(existingBrokers.toSeq, Set.empty)
|
||||
|
||||
// 发送完整的元信息到新增的Broker上
|
||||
sendUpdateMetadataRequest(newBrokers, controllerContext.partitionLeadershipInfo.keySet)
|
||||
|
||||
// 或者到所有在新增Broker上的副本
|
||||
val allReplicasOnNewBrokers = controllerContext.replicasOnBrokers(newBrokersSet)
|
||||
|
||||
// 变更副本状态
|
||||
replicaStateMachine.handleStateChanges(allReplicasOnNewBrokers.toSeq, OnlineReplica)
|
||||
|
||||
// 变更分区状态
|
||||
partitionStateMachine.triggerOnlinePartitionStateChange()
|
||||
|
||||
// 恢复迁移
|
||||
maybeResumeReassignments { (_, assignment) =>
|
||||
assignment.targetReplicas.exists(newBrokersSet.contains)
|
||||
}
|
||||
|
||||
// 恢复删除
|
||||
val replicasForTopicsToBeDeleted = allReplicasOnNewBrokers.filter(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic))
|
||||
if (replicasForTopicsToBeDeleted.nonEmpty) {
|
||||
// 日志
|
||||
topicDeletionManager.resumeDeletionForTopics(replicasForTopicsToBeDeleted.map(_.topic))
|
||||
}
|
||||
|
||||
// 注册监听
|
||||
registerBrokerModificationsHandler(newBrokers)
|
||||
}
|
||||
```
|
||||
|
||||
#### 2.2.4、处理Broker下线(onBrokerFailure)
|
||||
|
||||
##### 2.2.4.1、大体流程
|
||||
|
||||

|
||||
|
||||
|
||||
##### 2.2.4.2、相关代码
|
||||
|
||||
```Java
|
||||
private def onBrokerFailure(deadBrokers: Seq[Int]): Unit = {
|
||||
info(s"Broker failure callback for ${deadBrokers.mkString(",")}")
|
||||
// 缓存中移除dead-broker
|
||||
|
||||
// 获取到dead-broker上相关的副本
|
||||
val allReplicasOnDeadBrokers = controllerContext.replicasOnBrokers(deadBrokers.toSet)
|
||||
|
||||
// 相关副本状态处理
|
||||
onReplicasBecomeOffline(allReplicasOnDeadBrokers)
|
||||
|
||||
// 取消Broker节点被修改的事件的监听
|
||||
unregisterBrokerModificationsHandler(deadBrokers)
|
||||
}
|
||||
|
||||
private def onReplicasBecomeOffline(newOfflineReplicas: Set[PartitionAndReplica]): Unit = {
|
||||
// 被影响的副本中,区分是要被删除的和不用被删除的
|
||||
val (newOfflineReplicasForDeletion, newOfflineReplicasNotForDeletion) =
|
||||
newOfflineReplicas.partition(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic))
|
||||
|
||||
// 获取broker下线后将无leader的分区
|
||||
val partitionsWithoutLeader = controllerContext.partitionLeadershipInfo.filter(partitionAndLeader =>
|
||||
!controllerContext.isReplicaOnline(partitionAndLeader._2.leaderAndIsr.leader, partitionAndLeader._1) &&
|
||||
!topicDeletionManager.isTopicQueuedUpForDeletion(partitionAndLeader._1.topic)).keySet
|
||||
|
||||
// 无leader的分区进行状态切换,及leader选举
|
||||
partitionStateMachine.handleStateChanges(partitionsWithoutLeader.toSeq, OfflinePartition)
|
||||
partitionStateMachine.triggerOnlinePartitionStateChange()
|
||||
|
||||
// 不删除的副本的状态切换
|
||||
replicaStateMachine.handleStateChanges(newOfflineReplicasNotForDeletion.toSeq, OfflineReplica)
|
||||
|
||||
if (newOfflineReplicasForDeletion.nonEmpty) {
|
||||
// 需要删除的副本的Topic标记删除失败
|
||||
topicDeletionManager.failReplicaDeletion(newOfflineReplicasForDeletion)
|
||||
}
|
||||
|
||||
// 如果没有leader变化的分区,则对所有broker进行空的元信息同步
|
||||
if (partitionsWithoutLeader.isEmpty) {
|
||||
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set.empty)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### 2.3、Broker主动下线——处理ControlShutdown请求
|
||||
|
||||
#### 2.3.1、大体流程
|
||||
|
||||

|
||||
|
||||
#### 2.3.2、相关代码
|
||||
|
||||
```Java
|
||||
// 逐层调用
|
||||
def handleControlledShutdownRequest(request: RequestChannel.Request): Unit = {
|
||||
//////
|
||||
}
|
||||
|
||||
private def processControlledShutdown(id: Int, brokerEpoch: Long, controlledShutdownCallback: Try[Set[TopicPartition]] => Unit): Unit = {
|
||||
//////
|
||||
}
|
||||
|
||||
// 执行下线请求的处理
|
||||
private def doControlledShutdown(id: Int, brokerEpoch: Long): Set[TopicPartition] = {
|
||||
if (!isActive) {
|
||||
throw new ControllerMovedException("Controller moved to another broker. Aborting controlled shutdown")
|
||||
}
|
||||
|
||||
// epoch值异常,抛出异常。broker不存在,抛出异常等
|
||||
|
||||
// 加入shuttingdown中
|
||||
controllerContext.shuttingDownBrokerIds.add(id)
|
||||
|
||||
// 获取本次broker下线影响到的分区
|
||||
val partitionsToActOn = controllerContext.partitionsOnBroker(id).filter { partition =>
|
||||
controllerContext.partitionReplicaAssignment(partition).size > 1 &&
|
||||
controllerContext.partitionLeadershipInfo.contains(partition) &&
|
||||
!topicDeletionManager.isTopicQueuedUpForDeletion(partition.topic)
|
||||
}
|
||||
|
||||
// 分区区分是leader分区还是follower分区
|
||||
val (partitionsLedByBroker, partitionsFollowedByBroker) = partitionsToActOn.partition { partition =>
|
||||
controllerContext.partitionLeadershipInfo(partition).leaderAndIsr.leader == id
|
||||
}
|
||||
|
||||
// leader分区进行leader重新选举等
|
||||
partitionStateMachine.handleStateChanges(partitionsLedByBroker.toSeq, OnlinePartition, Some(ControlledShutdownPartitionLeaderElectionStrategy))
|
||||
try {
|
||||
brokerRequestBatch.newBatch()
|
||||
partitionsFollowedByBroker.foreach { partition =>
|
||||
brokerRequestBatch.addStopReplicaRequestForBrokers(Seq(id), partition, deletePartition = false)
|
||||
}
|
||||
brokerRequestBatch.sendRequestsToBrokers(epoch)
|
||||
} catch {
|
||||
case e: IllegalStateException =>
|
||||
handleIllegalState(e)
|
||||
}
|
||||
|
||||
// Follower分区的副本,调整状态为OfflineReplica
|
||||
replicaStateMachine.handleStateChanges(partitionsFollowedByBroker.map(partition =>
|
||||
PartitionAndReplica(partition, id)).toSeq, OfflineReplica)
|
||||
|
||||
def replicatedPartitionsBrokerLeads() = {
|
||||
// 获取获取落在broker上的leader分区
|
||||
}
|
||||
replicatedPartitionsBrokerLeads().toSet
|
||||
}
|
||||
```
|
||||
|
||||
## 3、常见问题
|
||||
|
||||
### 3.1、元信息同步的范围
|
||||
|
||||
准守的基本原则:
|
||||
1. Topic的Leader及Follower的信息没有变化时,基本上只需要发送UpdateMetadata请求,会发送到所有的Broker。
|
||||
2. 如果Topic的Leader或Follower的信息发生变化了,则会对迁移到的相关Broker发送LeaderAndIsr请求以更新副本之间的同步状态。此外还会对整个集群的Broker发送UpdateMetadata请求,从而保证集群每个Broker上缓存的元信息是一致的。
|
||||
3. 牵扯到副本的暂停副本同步的时候,会对相关的Broker发送StopReplica的请求。
|
||||
|
||||
此外呢,我们在代码中也可以看到,有时候还会发送空的UpdateMetadata请求到Broker。
|
||||
|
||||
这个的主要原因是:
|
||||
UpdateMetadata请求,除了同步Topic元信息之外,还会同步集群的Broker信息。所以最后一个原则:
|
||||
- 即使Topic都没有变化,但是Broker发生变化的时候,也会发送UpdateMetadata请求。
|
||||
|
||||
|
||||
### 3.2、元信息同步性能
|
||||
|
||||
上述的操作的主流程上,除了和ZK可能存在部分的网络IO之外,不会存在和集群其他的Broker的直接的网络IO。
|
||||
|
||||
因此,基本上秒级或者更短的时间可处理完。
|
||||
|
||||
|
||||
## 4、总结
|
||||
|
||||
本次分享了Broker上下线过程中,Controller需要做的事情,然后对常见的问题进行了讨论。以上就是本次分享的全部内容,谢谢大家。
|
||||
|
After Width: | Height: | Size: 311 KiB |
|
After Width: | Height: | Size: 319 KiB |
|
After Width: | Height: | Size: 382 KiB |
|
After Width: | Height: | Size: 290 KiB |
@@ -0,0 +1,251 @@
|
||||
# KIP-415—增量Rebalance协议
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、背景
|
||||
|
||||
Kafka为了让消费数据这个过程在Kafka集群中尽可能的均衡,Kafka设计了消费客户端的Rebalance功能,Rebalance能够帮助Kafka客户端尽可能的实现负载均衡。
|
||||
|
||||
但是在Kafka 2.3版本之前,Rebalance各种分配策略基本都是基于Eager协议(包括RangeAssignor,RoundRobinAssignor等),也就是大家熟悉的旧的Rebalance。旧的Rebalance一直以来都为人诟病,因为Rebalance过程会触发Stop The World(STW),此时对应Topic的资源都会处于不可用的状态,小规模的集群还好,如果是大规模的集群,比如几百个节点的Consumer消费客户度等,那么重平衡就是一场灾难。
|
||||
|
||||
在2.x的时候,社区意识到需要对现有的Rebalance做出改变。所以在Kafka 2.3版本,首先在Kafka Connect应用了Cooperative协议,然后在Kafka 2.4版本时候的时候,在Kafka Consumer Client中也添加了该协议的支持。
|
||||
|
||||
本次分享,我们就来对这个特性进行一个简单的介绍。
|
||||
|
||||
## 2、增量Rebalance协议
|
||||
|
||||
### 2.1、Eager协议 与 Cooperative协议 的Rebalance过程
|
||||
|
||||
**Eager协议**
|
||||
|
||||

|
||||
|
||||
网上抄袭的美图:
|
||||
|
||||

|
||||
|
||||
|
||||
**Cooperative协议**
|
||||
|
||||

|
||||
|
||||
|
||||
网上抄袭的美图:
|
||||
|
||||

|
||||
|
||||
### 2.2、代码实现
|
||||
|
||||
客户端这块的代码实现上,整体和Eager协议差不多,仅仅只是在一些点做了一些改动,具体的见:
|
||||
|
||||
#### 2.2.1、JoinGroup前
|
||||
```Java
|
||||
@Override
|
||||
protected void onJoinPrepare(int generation, String memberId) {
|
||||
// 相关日志等
|
||||
|
||||
final Set<TopicPartition> revokedPartitions;
|
||||
if (generation == Generation.NO_GENERATION.generationId && memberId.equals(Generation.NO_GENERATION.memberId)) {
|
||||
// 。。。 错误的情况
|
||||
} else {
|
||||
switch (protocol) {
|
||||
case EAGER:
|
||||
// EAGER协议,放弃了所有的分区
|
||||
revokedPartitions = new HashSet<>(subscriptions.assignedPartitions());
|
||||
exception = invokePartitionsRevoked(revokedPartitions);
|
||||
|
||||
subscriptions.assignFromSubscribed(Collections.emptySet());
|
||||
|
||||
break;
|
||||
|
||||
case COOPERATIVE:
|
||||
// COOPERATIVE协议,仅放弃不在subscription中的分区
|
||||
// 不被放弃的分区,还是处于一个可用的状态(FETCHING状态)
|
||||
Set<TopicPartition> ownedPartitions = new HashSet<>(subscriptions.assignedPartitions());
|
||||
revokedPartitions = ownedPartitions.stream()
|
||||
.filter(tp -> !subscriptions.subscription().contains(tp.topic()))
|
||||
.collect(Collectors.toSet());
|
||||
|
||||
if (!revokedPartitions.isEmpty()) {
|
||||
exception = invokePartitionsRevoked(revokedPartitions);
|
||||
|
||||
ownedPartitions.removeAll(revokedPartitions);
|
||||
subscriptions.assignFromSubscribed(ownedPartitions);
|
||||
}
|
||||
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
isLeader = false;
|
||||
subscriptions.resetGroupSubscription();
|
||||
|
||||
if (exception != null) {
|
||||
throw new KafkaException("User rebalance callback throws an error", exception);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### 2.2.2、SyncGroup前
|
||||
|
||||
SyncGroup之前,就是使用Cooperative协议的分配器,对分区进行分配。在2.5版本中,CooperativeStickyAssignor是支持Cooperative协议,具体的代码可以看CooperativeStickyAssignor这个类,这里就不展开介绍了。
|
||||
|
||||
|
||||
|
||||
#### 2.2.3、SyncGroup后
|
||||
|
||||
在一轮的Rebalance结束之后,最后会重新设置分配的状态。
|
||||
|
||||
```Java
|
||||
@Override
|
||||
protected void onJoinComplete(int generation,
|
||||
String memberId,
|
||||
String assignmentStrategy,
|
||||
ByteBuffer assignmentBuffer) {
|
||||
// 公共部分
|
||||
|
||||
final AtomicReference<Exception> firstException = new AtomicReference<>(null);
|
||||
Set<TopicPartition> addedPartitions = new HashSet<>(assignedPartitions);
|
||||
addedPartitions.removeAll(ownedPartitions);
|
||||
|
||||
if (protocol == RebalanceProtocol.COOPERATIVE) {// COOPERATIVE协议单独多处理的部分
|
||||
// revokedPartitions是需要放弃的分区,ownedPartitions是上一次拥有的分区,assignedPartitions是本次分配的分区
|
||||
Set<TopicPartition> revokedPartitions = new HashSet<>(ownedPartitions);
|
||||
revokedPartitions.removeAll(assignedPartitions);
|
||||
|
||||
log.info("Updating assignment with\n" +
|
||||
"now assigned partitions: {}\n" +
|
||||
"compare with previously owned partitions: {}\n" +
|
||||
"newly added partitions: {}\n" +
|
||||
"revoked partitions: {}\n",
|
||||
Utils.join(assignedPartitions, ", "),
|
||||
Utils.join(ownedPartitions, ", "),
|
||||
Utils.join(addedPartitions, ", "),
|
||||
Utils.join(revokedPartitions, ", ")
|
||||
);
|
||||
|
||||
if (!revokedPartitions.isEmpty()) {
|
||||
// 如果存在需要放弃的分区,则触发re-join等
|
||||
firstException.compareAndSet(null, invokePartitionsRevoked(revokedPartitions));
|
||||
|
||||
// if revoked any partitions, need to re-join the group afterwards
|
||||
log.debug("Need to revoke partitions {} and re-join the group", revokedPartitions);
|
||||
requestRejoin();
|
||||
}
|
||||
}
|
||||
|
||||
// 其他公共调用
|
||||
```
|
||||
|
||||
|
||||
### 2.3、使用例子
|
||||
|
||||
在Kafka集群支持该协议的前提下,仅需在Kafka消费客户端的配置中加上这个配置即可。
|
||||
|
||||
```Java
|
||||
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, Collections.singletonList(CooperativeStickyAssignor.class));
|
||||
```
|
||||
|
||||
|
||||
### 2.4、客户端日志
|
||||
|
||||
**客户端一**
|
||||
|
||||
```Java
|
||||
// 第一轮:
|
||||
|
||||
// 仅有一个客户端时,所有的分区都分配给该客户端
|
||||
2021-06-08 20:17:50.252 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Finished assignment for group at generation 9: {consumer-cg_logi_kafka_test_1-1-56a695ad-68c2-4e09-88a2-759e3854e366=Assignment(partitions=[kmo_community-0, kmo_community-1, kmo_community-2])}
|
||||
|
||||
// 第一轮仅有一个客户端的时候,所有分区都分配该客户端
|
||||
2021-06-08 20:17:50.288 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Executing onJoinComplete with generation 9 and memberId consumer-cg_logi_kafka_test_1-1-56a695ad-68c2-4e09-88a2-759e3854e366
|
||||
2021-06-08 20:17:50.288 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Updating assignment with
|
||||
now assigned partitions: kmo_community-0, kmo_community-1, kmo_community-2
|
||||
compare with previously owned partitions:
|
||||
newly added partitions: kmo_community-0, kmo_community-1, kmo_community-2
|
||||
revoked partitions:
|
||||
|
||||
// 第二轮:
|
||||
|
||||
// 存在两个客户端的时候,有一个分区没有分配给任何客户端
|
||||
2021-06-08 20:18:26.431 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Finished assignment for group at generation 10: {consumer-cg_logi_kafka_test_1-1-56a695ad-68c2-4e09-88a2-759e3854e366=Assignment(partitions=[kmo_community-1, kmo_community-2]), consumer-cg_logi_kafka_test_1-1-6ea3c93c-d878-4451-81f7-fc6c41d12963=Assignment(partitions=[])}
|
||||
|
||||
// 放弃了kmo_community-0分区,但是1,2分区继续保留消费
|
||||
2021-06-08 20:18:26.465 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Executing onJoinComplete with generation 10 and memberId consumer-cg_logi_kafka_test_1-1-56a695ad-68c2-4e09-88a2-759e3854e366
|
||||
2021-06-08 20:18:26.465 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Updating assignment with
|
||||
now assigned partitions: kmo_community-1, kmo_community-2
|
||||
compare with previously owned partitions: kmo_community-0, kmo_community-1, kmo_community-2
|
||||
newly added partitions:
|
||||
revoked partitions: kmo_community-0
|
||||
|
||||
// 第三轮:
|
||||
|
||||
// 存在两个客户端的时候,没有分配的客户端,重新非配给了新的消费客户端
|
||||
2021-06-08 20:18:29.548 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Finished assignment for group at generation 11: {consumer-cg_logi_kafka_test_1-1-56a695ad-68c2-4e09-88a2-759e3854e366=Assignment(partitions=[kmo_community-1, kmo_community-2]), consumer-cg_logi_kafka_test_1-1-6ea3c93c-d878-4451-81f7-fc6c41d12963=Assignment(partitions=[kmo_community-0])}
|
||||
|
||||
// 第三轮rebalance的时候,该客户端没有任何变化
|
||||
2021-06-08 20:18:29.583 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Updating assignment with
|
||||
now assigned partitions: kmo_community-1, kmo_community-2
|
||||
compare with previously owned partitions: kmo_community-1, kmo_community-2
|
||||
newly added partitions:
|
||||
revoked partitions:
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
**客户端二**
|
||||
|
||||
客户端二是在客户端一稳定运行之后上线的。
|
||||
|
||||
```Java
|
||||
|
||||
// 第二轮:
|
||||
|
||||
// 第二轮rebalance的时候,没有分配到任何分区
|
||||
2021-06-08 20:18:26.467 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Executing onJoinComplete with generation 10 and memberId consumer-cg_logi_kafka_test_1-1-6ea3c93c-d878-4451-81f7-fc6c41d12963
|
||||
2021-06-08 20:18:26.468 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Updating assignment with
|
||||
now assigned partitions:
|
||||
compare with previously owned partitions:
|
||||
newly added partitions:
|
||||
revoked partitions:
|
||||
|
||||
// 第三轮:
|
||||
|
||||
// 第三轮rebalance的时候,分配到了kmo_community-0
|
||||
2021-06-08 20:18:29.584 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Executing onJoinComplete with generation 11 and memberId consumer-cg_logi_kafka_test_1-1-6ea3c93c-d878-4451-81f7-fc6c41d12963
|
||||
2021-06-08 20:18:29.584 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Updating assignment with
|
||||
now assigned partitions: kmo_community-0
|
||||
compare with previously owned partitions:
|
||||
newly added partitions: kmo_community-0
|
||||
revoked partitions:
|
||||
```
|
||||
|
||||
|
||||
## 3、常见问题
|
||||
|
||||
### 3.1、为什么SyncGroup后,如果存在revokedPartitions分区的时候,还要进行Re-Join操作?
|
||||
|
||||
现在的做法:
|
||||
- 在进行分配的时候,如果将分区X从客户端1夺走,但是并不会立即将其分配给客户端2。因此造成了在这一轮Rebalance结束之后呢,如果存在revokedPartitions,则就还需要进行一轮Rebalance。
|
||||
|
||||
那么为什么不修改成:
|
||||
- 在进行分配的时候,如果将分区X从客户端1夺走,就立即将其分配给客户端2。这样的话,是不是在Rebalance结束之后,即便存在revokedPartitions,那么也不需要进行Rebalance了。
|
||||
|
||||
如果这样修改的话,可能存在的问题是,分区X在分配给了客户端2时,还在被客户端1使用,那么客户端得去处理分区X同时被客户端1和客户端2消费的情况,这种情况的正确处理**可能不是非常好处理**,因此没有采用这种方案。
|
||||
|
||||
采用增量Rebalance方式,同时串行化进行分区的放弃和分配,和Eager的Rebalance协议的大体处理流程基本一致,因此在实现相对比较简单,不需要去考虑前面提到的竞争问题,而且收益也还可以。
|
||||
|
||||
|
||||
|
||||
## 4、总结
|
||||
|
||||
本次分享简要介绍了一下KIP-429: Kafka Consumer Incremental Rebalance Protocol,功能还是非常的性感的,大家在使用增量Rebalance协议的方式进行消费的时候,有遇到什么问题也欢迎大家一起交流。
|
||||
|
||||
|
||||
|
||||
## 5、参考
|
||||
|
||||
[KIP-429: Kafka Consumer Incremental Rebalance Protocol](https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol)
|
||||
|
After Width: | Height: | Size: 402 KiB |
|
After Width: | Height: | Size: 410 KiB |
|
After Width: | Height: | Size: 305 KiB |
|
After Width: | Height: | Size: 222 KiB |
10
docs/zh/Kafka分享/Kafka服务端_API请求_Fetch/Kafka服务端_API请求_Fetch.md
Normal file
@@ -0,0 +1,10 @@
|
||||
# Kafka服务端—API请求—Fetch请求
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
|
||||
## 2、Fetch请求
|
||||
|
||||
|
||||
@@ -0,0 +1,114 @@
|
||||
# Kafka Broker 元信息变化请求处理
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
Kafka Controller 主要通过 LEADER_AND_ISR、STOP_REPLICA 和 UPDATE_METADATA 三类请求,进行元信息变化的通知。
|
||||
|
||||
因此,Kafka Broker 主要也是通过接收 Kafka Controller 发出来的这三个请求,来调整自身的状态。
|
||||
|
||||
本期分享主要介绍 Kafka Broker 如何处理这三类请求,以便后续在分享 Kafka Controller 的状态转变、Topic增删改查等专题的时候,对 Kafka Broker 所做的事情有更加清晰快速的认识。
|
||||
|
||||
## 2、实现概述
|
||||
|
||||
这三个请求对象,都是继承自 AbstractControlRequest ,具体请求类之间的关系如下图所示:
|
||||
|
||||
<img src="./assets/abstract_control_request_class_related_entry.jpg" width="738px" height="400px">
|
||||
|
||||
- AbstractControlRequest:三个字段用于告知controllerId,同时还有版本信息,确保只有最新的controller发出来的信息可以被处理。
|
||||
|
||||
|
||||
- UpdateMetadataRequest:同步集群元信息的请求,同步的信息包括分区信息和存活的broker信息。所有的Broker都会存储一份完整的集群元信息,因此客户端随便请求哪一台Broker都可以获取到Topic的元信息。
|
||||
|
||||
|
||||
- StopReplicaRequest:通知停副本同步的请求,此外还带有一个是否将副本删除的字段。一般Broker下线、Topic删除、Topic缩副本、Topic迁移等,Kafka Controller都会发送该请求。
|
||||
|
||||
|
||||
- LeaderAndIsrRequest:通知分区状态(Leader、AR、ISR等)的请求
|
||||
|
||||
|
||||
**问题一:这里我们发现 UPDATE_METADATA 请求和 LEADER_AND_ISR 请求,他们请求的数据格式基本上是一致的,这块为什么要这么设计,为什么不设计成一个接口呢?**
|
||||
|
||||
**问题二:这里的Node、Broker还有EndPoint的区别是什么?**
|
||||
|
||||
---
|
||||
|
||||
## 3、UPDATE_METADATA
|
||||
|
||||
### 3.1、UPDATE_METADATA 功能概述
|
||||
|
||||
区分 METADATA 请求和 UPDATE_METADATA ,
|
||||
- METADATA:大部分是客户端发起,请求获取Topic元信息的。
|
||||
- UPDATE_METADATA:大部分是Controller发出来,对Broker上的元信息进行更新。
|
||||
|
||||
|
||||
### 3.2、UPDATE_METADATA 大体流程
|
||||
|
||||
<img src="./assets/update_metadata_summary_flow_chat.jpg" width="552px" height="500px">
|
||||
|
||||
|
||||
### 3.3、UPDATE_METADATA 代码详读
|
||||
|
||||
#### 3.3.1、存储的元信息
|
||||
|
||||
```scala
|
||||
// 在package kafka.server中的MetadataCache中,存储了如下信息:
|
||||
// 分区状态(UpdateMetadataPartitionState)
|
||||
// controllerId
|
||||
// 存活的broker信息
|
||||
// 存活的节点
|
||||
case class MetadataSnapshot(partitionStates: mutable.AnyRefMap[String, mutable.LongMap[UpdateMetadataPartitionState]],
|
||||
controllerId: Option[Int],
|
||||
aliveBrokers: mutable.LongMap[Broker],
|
||||
aliveNodes: mutable.LongMap[collection.Map[ListenerName, Node]])
|
||||
```
|
||||
|
||||
#### 3.3.2、Quota分配策略
|
||||
|
||||
按照Leader的分布,按Leader的比例数量分配Quota。因此存在的问题是,如果Topic的分区流量不均衡,那么可能当Topic的整体流量没有到限流值的是,就显示已经被限流了。
|
||||
|
||||
<img src="./assets/update_metadata_change_quota.jpg" width="680px" height="400px">
|
||||
|
||||
---
|
||||
|
||||
## 4、STOP_REPLICA
|
||||
|
||||
### 4.1、STOP_REPLICA 功能概述
|
||||
|
||||
正如名字一样,该请求的主要功能就是用于停Broker的副本的。
|
||||
|
||||
### 4.2、STOP_REPLICA 大体流程
|
||||
|
||||
<img src="./assets/stop_replica_summary_flow_chat.jpg" width="653px" height="400px">
|
||||
|
||||
## 5、LEADER_AND_ISR
|
||||
|
||||
### 5.1、LEADER_AND_ISR 功能概述
|
||||
|
||||
Leader_And_Isr请求的主要功能就是将分区的leader和follower切换的消息通知给broker,然后broker进行Leader和Follower的切换。
|
||||
|
||||
|
||||
### 5.2、LEADER_AND_ISR 大体流程
|
||||
|
||||
<img src="./assets/leader_and_isr_summary_flow_chat.jpg" width="700px" height="600px">
|
||||
|
||||
### 5.3、makeLeader 详细说明
|
||||
|
||||
<img src="./assets/leader_and_isr_make_leader_flow_chat.jpg" width="799px" height="600px">
|
||||
|
||||
|
||||
### 5.4、makeFollower 详细说明
|
||||
|
||||
<img src="./assets/leader_and_isr_make_follower_flow_chat.jpg" width="799px" height="600px">
|
||||
|
||||
---
|
||||
|
||||
## 6、日常问题
|
||||
|
||||
### 6.1、问题一:UPDATE_METADATA 和 LEADER_AND_ISR之间的区别?
|
||||
|
||||
从Kafka Broker的角度看,确实两个请求的数据基本是一样的,
|
||||
|
||||
### 6.2、问题二:这里的Node、Broker还有EndPoint的区别是什么?
|
||||
|
||||
|
After Width: | Height: | Size: 451 KiB |
|
After Width: | Height: | Size: 303 KiB |
|
After Width: | Height: | Size: 318 KiB |
|
After Width: | Height: | Size: 404 KiB |
|
After Width: | Height: | Size: 238 KiB |
|
After Width: | Height: | Size: 921 KiB |
|
After Width: | Height: | Size: 344 KiB |
66
docs/zh/Kafka分享/Kafka服务端_Broker上线/Kafka服务端_Broker上线.md
Normal file
@@ -0,0 +1,66 @@
|
||||
# Kafka服务端—Broker上线
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## 2、上线概述
|
||||
|
||||
### 2.1、上线流程
|
||||
|
||||

|
||||
|
||||
|
||||
### 2.2、组件说明
|
||||
|
||||
|组件名称|用途|
|
||||
|:----|:----|
|
||||
|KafkaZkClient|自封装的ZK客户端,操作ZK节点及注册监听器等|
|
||||
|LogManager|Log文件管理器,Kafka分区副本的数据存储于Log文件|
|
||||
|MetadataCache|元信息缓存,每台Broker缓存的Kafka集群的元信息|
|
||||
|SocketServer|Socket服务器,用于在网络层请求的接收与发送|
|
||||
|ReplicaManager|副本管理器,管理分区副本之间的同步|
|
||||
|KafkaController|Kafka控制器,控制Kafka元信息的同步等|
|
||||
|GroupCoordinator|消费组协调器,协调消费客户端分区的分配及记录消费进度|
|
||||
|TransactionCoordinator|事物协调器|
|
||||
|KafkaApis|后台线程池,用于Kafka-Gateway及Kafka相关的后台任务|
|
||||
|KafkaRequestHandlePool|工作线程池,在网络层收到请求后交由工作线程进行处理|
|
||||
|
||||
|
||||
### 2.3、组件说明
|
||||
|
||||
|
||||
|
||||
## 3、相关组件详解
|
||||
|
||||
### 3.1、获取Broker元信息及Offline的目录
|
||||
|
||||
这块比较简单,获取Broker元信息及Offline的目录,就是去读取每个数据目录下面的meta.properties文件里面数据。
|
||||
|
||||
这块比较简单,具体读写元信息的过程可以看一下`BrokerMetadataCheckpoint`这个类。
|
||||
|
||||
|
||||
```Java
|
||||
// meta.properties文件里面数据的例子:
|
||||
#
|
||||
#Wed Jun 23 18:24:10 CST 2021
|
||||
broker.id=1
|
||||
version=0
|
||||
cluster.id=4
|
||||
```
|
||||
|
||||
## 4、常见问题
|
||||
|
||||
|
||||
|
||||
|
||||
## 5、总结
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
After Width: | Height: | Size: 243 KiB |
46
docs/zh/Kafka分享/Kafka服务端_Broker下线/Kafka服务端_Broker下线.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# Kafka服务端—Broker上线
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
|
||||
## 2、上线概述
|
||||
|
||||
### 2.1、上线流程
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
|组件名称|用途|
|
||||
|:----|:----|
|
||||
|KafkaZkClient|自封装的ZK客户端,操作ZK节点及注册监听器等|
|
||||
|LogManager|Log文件管理器,Kafka分区副本的数据存储于Log文件|
|
||||
|MetadataCache|元信息缓存,每台Broker缓存的Kafka集群的元信息|
|
||||
|SocketServer|Socket服务器,用于在网络层请求的接收与发送|
|
||||
|ReplicaManager|副本管理器,管理分区副本之间的同步|
|
||||
|KafkaController|Kafka控制器,控制Kafka元信息的同步等|
|
||||
|GroupCoordinator|消费组协调器,协调消费客户端分区的分配及记录消费进度|
|
||||
|TransactionCoordinator|事物协调器|
|
||||
|KafkaApis|后台线程池,用于Kafka-Gateway及Kafka相关的后台任务|
|
||||
|KafkaRequestHandlePool|工作线程池,在网络层收到请求后交由工作线程进行处理|
|
||||
|
||||
|
||||
## 3、上线流程
|
||||
|
||||
### 3.1、流程概述
|
||||
|
||||
|
||||
|
||||
## 5、常见问题
|
||||
|
||||
|
||||
|
||||
|
||||
## 6、总结
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
After Width: | Height: | Size: 114 KiB |
406
docs/zh/Kafka分享/Kafka服务端_副本管理_副本同步/Kafka服务端_副本管理_副本同步.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# Kafka服务端—副本管理—副本同步
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
|
||||
|
||||
## 2、相关原理
|
||||
|
||||
### 2.1、基本概念
|
||||
|
||||
Kafka每个分区下可能有很多歌副本(Replica)用于实现冗余,从而进一步实现高可用。副本根据角色的不同,可以分为:
|
||||
1. Leader副本:响应Clients端读写请求的副本;
|
||||
2. Follower副本:被动的备份Leader副本中的数据,不能响应Clients端的写请求,读请求在高版本可以支持。
|
||||
3. ISR副本:包含了Leader副本和所有与Leader副本保持同步的Follower副本。
|
||||
4. AR副本:包含了Leader副本和所有的在线或者不在线的Follower副本,AR即分区正常情况下的所有副本。
|
||||
|
||||
|
||||
每个Kafka副本对象都有两个重要的属性:LEO和HW,其中:
|
||||
LEO:日志末端位移(Log End Offset),记录了该副本底层日志(log)中下一条消息的位移值。注意是下一条消息。也就是说,如果LEO=10,那么表示该副本保存了10条消息,位移值范围是[0, 9]。
|
||||
HW:位移的水位值(High Watermark)。对于同一个副本对象而言,其HW值不会大于LEO值。小于等于HW值的所有消息都被认为是“已备份”的。
|
||||
|
||||
具体的如图所示:
|
||||
|
||||

|
||||
|
||||
Kafka在副本
|
||||
|
||||
|
||||
### 2.2、状态更新
|
||||
|
||||
在副本同步中,状态更新主要更新的是HW和LEO,本节主要介绍一下LEO和HW的更新逻辑。
|
||||
|
||||
|
||||
这里需要首先言明的是,Kafka有两套Follower副本的LEO,
|
||||
1. 一套LEO保存在Follower副本所在Broker的副本管理机中;
|
||||
2. 另一套LEO保存在Leader副本所在Broker的副本管理机中。即Leader副本机器上保存了所有的Follower副本的LEO。
|
||||
|
||||
那为什么要有两套呢?这是因为Kafka使用前者帮助Follower副本更新其HW值;而利用后者帮助Leader副本更新其HW。
|
||||
|
||||
|
||||
大致的更新如图所示:
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
### 2.3、线程模型
|
||||
|
||||
副本同步的Fetch线程是按照(brokerId, fetchId)的维度创建的,因此每个线程内所要进行的副本同步的分区的Leader都是在同一个broker中。
|
||||
|
||||
**Broker之间的副本同步线程模型(1:1)**
|
||||
|
||||
单个Broker之间的连接数为2的情况下,可能的副本同步的分配线程数。
|
||||
|
||||

|
||||
|
||||
1. 单个Topic维度上,分区均匀的分布在副本同步线程上。
|
||||
2. 整体上,可能存在线程分配的分区多少的问题。
|
||||
|
||||
|
||||
|
||||
**Broker之间的副本同步线程模型(N:N)**
|
||||
|
||||
<img src="./assets/kafka_broker_rm_fetch_thread_module_nn.jpg" width="400px" height="400px">
|
||||
|
||||
1. Broker上存在两两副本同步的情况。
|
||||
2. 每个副本同步的连接都如上图所示。
|
||||
|
||||
|
||||
|
||||
## 3、具体行为
|
||||
|
||||
|
||||
### 3.1、Leader行为
|
||||
|
||||
在副本同步过程中,Leader侧主要做两件事情,分别是:
|
||||
1. 维护Follower的副本同步状态。
|
||||
2. 如果处于同步时,则尝试ExpandIsr。
|
||||
|
||||

|
||||
|
||||
#### 3.1.1、读取数据
|
||||
|
||||
|
||||
|
||||
#### 3.1.2、更新状态
|
||||
|
||||
**大体流程**
|
||||
|
||||

|
||||
|
||||
**主要代码**
|
||||
|
||||
```Java
|
||||
|
||||
```
|
||||
|
||||
### 3.2、Follower行为
|
||||
|
||||
Follower在副本同步,主要做两件事,分别是:
|
||||
1. 如果需要,则对日志进行切断操作。
|
||||
2. 如果需要,则对日志进行同步操作。
|
||||
|
||||
```Java
|
||||
override def doWork(): Unit = {
|
||||
val startTimeMs = Time.SYSTEM.milliseconds
|
||||
maybeTruncate() // 日志切断
|
||||
RequestChannel.setSession(Session.AdminSession)
|
||||
maybeFetch() // 日志拉取
|
||||
RequestChannel.cleanSession
|
||||
fetcherStats.totalTime.update(Time.SYSTEM.milliseconds - startTimeMs);
|
||||
}
|
||||
```
|
||||
|
||||
#### 3.2.1、日志截断
|
||||
|
||||
|
||||
|
||||
|
||||
#### 3.2.2、副本拉取
|
||||
|
||||
**大体流程**
|
||||
|
||||

|
||||
|
||||
|
||||
**相关代码**
|
||||
|
||||
```Java
|
||||
// 尝试进行Fetch
|
||||
private def maybeFetch(): Unit = {
|
||||
val fetchRequestOpt = inLock(partitionMapLock) {
|
||||
// 构造Fetch请求的Builder,Builder里面包含本次将Fetch哪些分区的数据。
|
||||
// Fetch的分区会考虑是否被限流了,如果限流了,则本次Fetch忽略该分区。
|
||||
val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(partitionStates.partitionStateMap.asScala)
|
||||
|
||||
handlePartitionsWithErrors(partitionsWithError, "maybeFetch")
|
||||
if (fetchRequestOpt.isEmpty) {
|
||||
trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
|
||||
partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
|
||||
}
|
||||
fetchRequestOpt
|
||||
}
|
||||
fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>
|
||||
processFetchRequest(sessionPartitions, fetchRequest)
|
||||
}
|
||||
}
|
||||
|
||||
// 处理Fetch请求,包括1、发送Fetch请求,2、处理Fetch请求的Response
|
||||
private def processFetchRequest(sessionPartitions: util.Map[TopicPartition, FetchRequest.PartitionData],
|
||||
fetchRequest: FetchRequest.Builder): Unit = {
|
||||
val partitionsWithError = mutable.Set[TopicPartition]()
|
||||
var responseData: Map[TopicPartition, FetchData] = Map.empty
|
||||
|
||||
try {
|
||||
trace(s"Sending fetch request $fetchRequest")
|
||||
val startTimeMs = System.currentTimeMillis()
|
||||
// 发送fetch请求到leader,同时记录操作的耗时
|
||||
responseData = fetchFromLeader(fetchRequest)
|
||||
fetcherStats.requestTime.update(Time.SYSTEM.milliseconds - startTimeMs);
|
||||
} catch {
|
||||
// 错误日志及处理
|
||||
}
|
||||
fetcherStats.requestRate.mark()
|
||||
|
||||
if (responseData.nonEmpty) { // 存在数据,则进行处理
|
||||
inLock(partitionMapLock) {
|
||||
responseData.foreach { case (topicPartition, partitionData) =>
|
||||
Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>
|
||||
val fetchPartitionData = sessionPartitions.get(topicPartition)
|
||||
if (fetchPartitionData != null && fetchPartitionData.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {
|
||||
// 存在fetchPartitionData,并且fetch的offset和需要的一致
|
||||
// 同时分区的状态是readyFetch的,则进行下面的处理
|
||||
// 否则本次fetch的数据会被丢弃
|
||||
val requestEpoch = if (fetchPartitionData.currentLeaderEpoch.isPresent) Some(fetchPartitionData.currentLeaderEpoch.get().toInt) else None
|
||||
partitionData.error match {
|
||||
case Errors.NONE =>
|
||||
try {
|
||||
// 数据落盘,并返回落盘的信息
|
||||
val logAppendInfoOpt = processPartitionData(topicPartition, currentFetchState.fetchOffset,
|
||||
partitionData)
|
||||
|
||||
logAppendInfoOpt.foreach { logAppendInfo =>
|
||||
// 。。。。。。
|
||||
// metrics信息中,更新lag值
|
||||
fetcherLagStats.getAndMaybePut(topicPartition).lag = lag
|
||||
|
||||
// 更新其他信息,包括FetchState和metrics
|
||||
val newFetchState = PartitionFetchState(nextOffset, Some(lag), currentFetchState.currentLeaderEpoch, state = Fetching)
|
||||
partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)
|
||||
fetcherStats.byteRate.mark(validBytes)
|
||||
}
|
||||
}
|
||||
} catch {
|
||||
// 异常处理
|
||||
}
|
||||
// 错误处理
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (partitionsWithError.nonEmpty) {
|
||||
// 对错误的分区进行统一处理
|
||||
handlePartitionsWithErrors(partitionsWithError, "processFetchRequest")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## 3、ISR扩缩
|
||||
|
||||
当Follower没有跟上Leader的时候,Leader会进行ISR的收缩。
|
||||
|
||||
ISR收缩在Broker这块主要分两步进行:
|
||||
1. 检测出哪些分区的ISR需要进行收缩了,同时将收缩后的ISR信息写到ZK。
|
||||
2. 将ISR收缩的事件注册到ZK,然后Controller通过ZK感知到ISR收缩了,从而进行ISR收缩的元信息的广播处理。
|
||||
|
||||
### 3.1、ISR收缩
|
||||
|
||||

|
||||
|
||||
|
||||
### 3.2、ISR扩张
|
||||
|
||||
|
||||
### 3.3、传播ISR收缩
|
||||
|
||||

|
||||
|
||||
### 3.3、相关代码
|
||||
|
||||
|
||||
**后台检查ISR是否收缩及进行广播的线程**
|
||||
|
||||
```Java
|
||||
// ReplicaManager创建好了之后,会调用startup启动相关线程
|
||||
def startup(): Unit = {
|
||||
// 检查分区是否需要收缩ISR的线程,如果收缩了,则进行收缩。
|
||||
// follower将在 config.replicaLagTimeMaxMs 到 config.replicaLagTimeMaxMs * 1.5 的时间范围内被移出ISR中
|
||||
scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
|
||||
|
||||
// Isr收缩之后的广播告知线程
|
||||
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
|
||||
|
||||
// 其他相关的处理线程 。。。。。。
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
**检查是否收缩**
|
||||
```Java
|
||||
private def maybeShrinkIsr(): Unit = {
|
||||
trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
|
||||
|
||||
// 仅对Online的分区进行判断
|
||||
allPartitions.keys.foreach { topicPartition =>
|
||||
nonOfflinePartition(topicPartition).foreach(_.maybeShrinkIsr())
|
||||
}
|
||||
}
|
||||
|
||||
// 检查是否收缩
|
||||
def maybeShrinkIsr(): Unit = {
|
||||
val needsIsrUpdate = inReadLock(leaderIsrUpdateLock) {
|
||||
needsShrinkIsr() // 判断是否需要ShrinkIsr
|
||||
}
|
||||
val leaderHWIncremented = needsIsrUpdate && inWriteLock(leaderIsrUpdateLock) {
|
||||
leaderLogIfLocal match {
|
||||
case Some(leaderLog) =>
|
||||
val outOfSyncReplicaIds = getOutOfSyncReplicas(replicaLagTimeMaxMs) // 获取OSR的副本,这里的replicaLagTimeMaxMs默认是30秒
|
||||
if (outOfSyncReplicaIds.nonEmpty) {
|
||||
val newInSyncReplicaIds = inSyncReplicaIds -- outOfSyncReplicaIds
|
||||
// 如果有收缩,则打印收缩的日志
|
||||
|
||||
// 更新ZK上的ISR信息
|
||||
shrinkIsr(newInSyncReplicaIds)
|
||||
|
||||
// 如果HW需要增大,则对其进行增大
|
||||
maybeIncrementLeaderHW(leaderLog)
|
||||
} else {
|
||||
false
|
||||
}
|
||||
|
||||
case None => false // do nothing if no longer leader
|
||||
}
|
||||
}
|
||||
|
||||
// some delayed operations may be unblocked after HW changed
|
||||
if (leaderHWIncremented)
|
||||
tryCompleteDelayedRequests()
|
||||
}
|
||||
|
||||
// Follower是否OSR的判断
|
||||
private def isFollowerOutOfSync(replicaId: Int,
|
||||
leaderEndOffset: Long,
|
||||
currentTimeMs: Long,
|
||||
maxLagMs: Long): Boolean = {
|
||||
val followerReplica = getReplicaOrException(replicaId)
|
||||
// follower的leo和leader的leo不一致,
|
||||
// 并且followerReplica的 lastCaughtUpTimeMs 在 maxLagMs 没有被更新了
|
||||
followerReplica.logEndOffset != leaderEndOffset &&
|
||||
(currentTimeMs - followerReplica.lastCaughtUpTimeMs) > maxLagMs
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
**检查是否广播**
|
||||
|
||||
```Java
|
||||
def maybePropagateIsrChanges(): Unit = {
|
||||
val now = System.currentTimeMillis()
|
||||
isrChangeSet synchronized {
|
||||
if (isrChangeSet.nonEmpty &&
|
||||
(lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
|
||||
lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
|
||||
// 注册ISR收缩的事件
|
||||
zkClient.propagateIsrChanges(isrChangeSet)
|
||||
isrChangeSet.clear()
|
||||
lastIsrPropagationMs.set(now)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 在 /isr_change_notification 目录下面,创建一个以 isr_change_ 打头的,顺序递增的持久化的ZK节点
|
||||
// 节点内存储的是进行ISR搜索的topic-partiton
|
||||
def propagateIsrChanges(isrChangeSet: collection.Set[TopicPartition]): Unit = {
|
||||
val isrChangeNotificationPath: String = createSequentialPersistentPath(IsrChangeNotificationSequenceZNode.path(),
|
||||
IsrChangeNotificationSequenceZNode.encode(isrChangeSet))
|
||||
debug(s"Added $isrChangeNotificationPath for $isrChangeSet")
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
**_lastCaughtUpTimeMs更新策略**
|
||||
|
||||
```Java
|
||||
def updateFetchState(followerFetchOffsetMetadata: LogOffsetMetadata,
|
||||
followerStartOffset: Long,
|
||||
followerFetchTimeMs: Long,
|
||||
leaderEndOffset: Long,
|
||||
lastSentHighwatermark: Long): Unit = {
|
||||
if (followerFetchOffsetMetadata.messageOffset >= leaderEndOffset)
|
||||
// 如果fetch的offset大于leader的end offset,则直接取二者的最大值
|
||||
_lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, followerFetchTimeMs)
|
||||
else if (followerFetchOffsetMetadata.messageOffset >= lastFetchLeaderLogEndOffset)
|
||||
// 如果fetch的offset大于上一次leader的end offset,则表示本次拉取跟上了上一次的leader的end offset
|
||||
// 这里表示,如果leader数据增长的非常快,follower的fetch的进度处于:
|
||||
// 本次fetch的位置比上一次fetch时的leader的leo还小的状态,
|
||||
// 则_lastCaughtUpTimeMs将一直处于不能被更新的状态,从而最终导致ISR收缩。
|
||||
_lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, lastFetchTimeMs)
|
||||
|
||||
_logStartOffset = followerStartOffset
|
||||
_logEndOffsetMetadata = followerFetchOffsetMetadata
|
||||
lastFetchLeaderLogEndOffset = leaderEndOffset
|
||||
lastFetchTimeMs = followerFetchTimeMs
|
||||
updateLastSentHighWatermark(lastSentHighwatermark)
|
||||
trace(s"Updated state of replica to $this")
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## 4、常见问题
|
||||
|
||||
|
||||
### 4.1、副本同步的网络连接数
|
||||
|
||||
每个副本同步线程,都持有一个网络连接,因此我们只需要关注副本同步的线程数,就可以知道网络连接数了。
|
||||
|
||||
通常情况下,比如一个Region内如果有10台Broker,然后配置的副本同步的线程数为10,那么**任意一台**Broker的副本同步线程数就是 (10 - 1) * 10 = 90 个,网络连接数也是90。
|
||||
|
||||
|
||||
|
||||
### 4.2、如何避免分区饥饿
|
||||
|
||||
**Follower侧**
|
||||
|
||||
使用了LinkedHashMap存储分区及分区的状态。组装Fetch请求的数据的时候,按序逐个获取LinkedHashMap中的分区。
|
||||
|
||||
在收到Fetch请求的Response的数据之后,将分区移动到LinkedHashMap的最后面。
|
||||
|
||||
从而,整体上维护了一个类似队列的结构来进行分区数据的Fetch,上一次Fetch到的数据的会被移动到队列的最后面。
|
||||
|
||||
|
||||
**Leader侧**
|
||||
|
||||
|
||||
|
||||
## 5、相关指标
|
||||
|
||||
|说明|ObjectName例子|
|
||||
|:---|:---|
|
||||
|查看具体某个分区进行副本同步时的Lag|kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ReplicaFetcherThread-0-2,topic=__consumer_offsets,partition=1|
|
||||
|某副本同步线程|kafka.server:type=FetcherStats,name=BytesPerSec,clientId=ReplicaFetcherThread-0-2,brokerHost=10.179.149.201,brokerPort=7093|
|
||||
|
||||
|
||||
## 6、总结
|
||||
|
||||
|
||||
BIN
docs/zh/Kafka分享/Kafka服务端_副本管理_副本同步/assets/kafka_broker_arch.jpg
Normal file
|
After Width: | Height: | Size: 229 KiB |
|
After Width: | Height: | Size: 111 KiB |
|
After Width: | Height: | Size: 296 KiB |
|
After Width: | Height: | Size: 106 KiB |
|
After Width: | Height: | Size: 66 KiB |
|
After Width: | Height: | Size: 396 KiB |
|
After Width: | Height: | Size: 182 KiB |
|
After Width: | Height: | Size: 103 KiB |
|
After Width: | Height: | Size: 336 KiB |
BIN
docs/zh/Kafka分享/Kafka服务端_副本管理_副本同步/assets/kb_rm_leo_hw.jpg
Normal file
|
After Width: | Height: | Size: 38 KiB |
|
After Width: | Height: | Size: 124 KiB |
76
docs/zh/Kafka分享/Kafka服务端_副本管理_概述/Kafka服务端_副本管理_概述.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Kafka服务端—副本管理—概述
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
在具体的介绍副本管理模块之前,我们先来看一下Kafka服务端的架构:
|
||||
|
||||

|
||||
|
||||
副本管理处于API层和Log子系统之间,主要的功能是管理Broker上所有的副本。功能包括:
|
||||
1. 副本操作,即处理API层的向副本追加数据的Produce请求、切换副本的Leader和Follower的LeaderAndIsr请求等。
|
||||
2. 副本同步:Follower副本向远端的Leader副本进行数据同步,同时如果本地持有的是Leader副本的话,还管理着ISR信息等。
|
||||
|
||||
本次我们将对副本管理模块做概要介绍,后续我们还会针对其中的副本同步模块做深入的剖析。下面,正式开始副本模块的概要分享。
|
||||
|
||||
|
||||
## 2、类图
|
||||
|
||||
首先,我们来看一下副本管理模块的类图:
|
||||
|
||||

|
||||
|
||||
1. ReplicaManager:副本管理,对外副本管理类;
|
||||
2. LogManager:管理Log对象,然后通过Log对象实现对数据的读写。
|
||||
3. ReplicaFetchManager:副本拉取管理,管理副本同步的线程,及在同步过程中的状态变化。
|
||||
4. ReplicaAlterLogDirsManager:副本修改存储路径管理,管理修改路径后的副本同步线程及状态。
|
||||
5. Partition:分区信息,管理AR信息等,此外还持有Replica对象用于管理Follower副本同步的状态。
|
||||
6. ReplicaFetcherBlockingSend:网络IO模块。
|
||||
|
||||
|
||||
## 3、功能
|
||||
|
||||
### 3.1、副本操作
|
||||
|
||||
在Kafka服务端对外的API请求中,除了一些Controller、Coordinator还有安全管控相关的API请求之外,其余剩下的基本都是和ReplicaManager相关的API接口。
|
||||
|
||||
这些请求都是通过ReplicaManager间接的去对Broker上的副本进行增删改查等动作。具体相关的API请求包括:
|
||||
|
||||
1. Produce-追加数据
|
||||
2. Fetch-读取数据
|
||||
3. ListOffsets-查询offset信息
|
||||
4. LeaderAndIsr-切换副本为Leader或Follower
|
||||
5. StopReplica-停止副本
|
||||
6. UpdateMetadata-更新元信息
|
||||
7. DeleteRecords-删除记录
|
||||
8. OffsetForLeaderEpoch-获取Offset对应的LeaderEpoch信息
|
||||
9. AlterReplicaLogDirs-修改副本Log的存储目录
|
||||
10. DescribeLogDirs-查询Log所属目录信息
|
||||
11. ElectLeaders-选举Leaders
|
||||
|
||||
每一类API请求都非常的复杂,而且牵扯到LogManager等模块,因此我们会对每一类请求的行为做单独的分享,这里就不再做更多介绍了。
|
||||
|
||||
虽然这里不会对每一类请求做详细的说明,但是这里也加一下相关请求的分享链接:
|
||||
|
||||
- [元信息变更请求详解(LeaderAndIsr、StopReplica、UpdateMetadata)](../Kafka服务端_元信息变更请求处理/Kafka服务端_元信息变更请求处理.md)
|
||||
|
||||
|
||||
|
||||
### 3.2、副本同步
|
||||
|
||||
副本管理模块除了对副本做一些CRUD等操作之外呢,还有一个比较重要的功能需要被单独拿出来,那就是副本同步。
|
||||
|
||||
|
||||
Kafka分区下有可能有很多个副本用于实现冗余,从而进一步实现高可用,而多个副本之间的数据的一致性的管理就是依靠副本同步进行实现的。
|
||||
|
||||
具体的可以见如下这篇文章:
|
||||
|
||||
- [Kafka服务端_副本管理_副本同步](../Kafka服务端_副本管理_副本同步/Kafka服务端_副本管理_副本同步.md)
|
||||
|
||||
|
||||
|
||||
## 4、总结
|
||||
|
||||
本节仅概要介绍一下副本管理模块的功能,后续我们再针对副本操作和副本同步做详细的介绍,本次分享的内容到此结束,谢谢大家。
|
||||
|
||||
BIN
docs/zh/Kafka分享/Kafka服务端_副本管理_概述/assets/kafka_broker_arch.jpg
Normal file
|
After Width: | Height: | Size: 229 KiB |
|
After Width: | Height: | Size: 296 KiB |
33
docs/zh/Kafka分享/Kafka服务端_消费组协调器/Kafka服务端_消费组协调器.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Kafka-GroupCoordinator 详解
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、GroupCoordinator介绍
|
||||
|
||||
### 1.1、功能介绍
|
||||
|
||||
### 1.2、相关请求
|
||||
|
||||
```java
|
||||
// 消费offset提交 & 消费offset获取
|
||||
case ApiKeys.OFFSET_COMMIT => handleOffsetCommitRequest(request)
|
||||
case ApiKeys.OFFSET_FETCH => handleOffsetFetchRequest(request)
|
||||
|
||||
// 获取coordinator, 包括group-coordinator 和 transaction-coordinator
|
||||
case ApiKeys.FIND_COORDINATOR => handleFindCoordinatorRequest(request)
|
||||
|
||||
case ApiKeys.JOIN_GROUP => handleJoinGroupRequest(request)
|
||||
case ApiKeys.HEARTBEAT => handleHeartbeatRequest(request)
|
||||
case ApiKeys.LEAVE_GROUP => handleLeaveGroupRequest(request)
|
||||
case ApiKeys.SYNC_GROUP => handleSyncGroupRequest(request)
|
||||
case ApiKeys.DESCRIBE_GROUPS => handleDescribeGroupRequest(request)
|
||||
case ApiKeys.LIST_GROUPS => handleListGroupsRequest(request)
|
||||
case ApiKeys.OFFSET_FOR_LEADER_EPOCH => handleOffsetForLeaderEpochRequest(request)
|
||||
case ApiKeys.DELETE_GROUPS => handleDeleteGroupsRequest(request)
|
||||
case ApiKeys.OFFSET_DELETE => handleOffsetDeleteRequest(request)
|
||||
|
||||
|
||||
|
||||
```
|
||||
|
||||
## 2、
|
||||
747
docs/zh/Kafka分享/Kafka消费客户端_协调器分析/Kafka消费客户端_协调器分析.md
Normal file
@@ -0,0 +1,747 @@
|
||||
# Kafka消费客户端——协调器分析(Group-Coordinator)
|
||||
|
||||
<!-- [TOC] -->
|
||||
|
||||

|
||||
|
||||
## 1、前言
|
||||
|
||||
前面我们对Kafka消费客户端进行了整体的介绍,接下来我们将针对Kafka中的消费组协调器中的客户端侧进行详细的分析。
|
||||
|
||||
在细致介绍之前,先来看一下Kafka消费组协调器的功能。Kafka消费组协调器主要有两个功能,分别是:
|
||||
1. 协调多个消费客户端之间的消费分区的分配;
|
||||
2. 管理分区的消费进度;
|
||||
|
||||
因此,本节我们将对协调器如何对多个消费客户端进行协调进行分享。
|
||||
|
||||
|
||||
## 2、消费客户端—消费协调
|
||||
|
||||
### 2.1、协调原理(Rabalance原理)
|
||||
|
||||
|
||||
#### 2.1.1、基本原理
|
||||
|
||||
**1、基本原理**
|
||||
|
||||

|
||||
|
||||
|
||||
**2、能力猜想**
|
||||
|
||||
基于上述的原理,我们现在思考一下,要做到上述的协调能力,客户端需要具备哪些能力?
|
||||
|
||||
- 1、客户端支持寻找到服务端的Coordinator;
|
||||
- 2、客户端支持加入消费组;
|
||||
- 3、客户端支持离开消费组;
|
||||
- 4、客户端支持处理协调结果的能力;
|
||||
- 5、客户端异常能被及时感知发现,及时将分配到的分区转移给其他客户端(性能);
|
||||
|
||||
**3、协调请求协议**
|
||||
|
||||
完成了刚才的猜想之后,我们再来看一下客户端的实际实现。客户端主要实现了五类协议,具体如下:
|
||||
|
||||
1. FindCoordinator请求:消费客户端寻找消费组对应的Group-Coordinator;
|
||||
2. JoinGroup请求:消费客户端请求加入消费组;
|
||||
3. LeaveGroup请求:消费客户端主动告知Group-Coordinator我要离开消费组;
|
||||
4. SyncGroup请求:leader消费客户端通过Group-Coordinator将分区分配方案告知其他的消费客户端;
|
||||
5. Heartbeat请求:消费客户端定期给Group-Coordinator发送心跳来表明自己还活着;
|
||||
|
||||
|
||||
**4、客户端状态**
|
||||
|
||||
```Java
|
||||
protected enum MemberState {
|
||||
UNJOINED, // the client is not part of a group
|
||||
REBALANCING, // the client has begun rebalancing
|
||||
STABLE, // the client has joined and is sending heartbeats
|
||||
}
|
||||
```
|
||||
|
||||
**5、服务端消费组状态**
|
||||
|
||||

|
||||
|
||||
#### 2.1.2、Rebalance时机及处理
|
||||
|
||||
**Rebalance触发时机**
|
||||
|
||||
|
||||
1. 组成员数发生变化;
|
||||
- 组成员增加(客户端上线);
|
||||
- 组成员减少(客户端正常或异常下线);
|
||||
- 组成员心跳异常(包括两次poll操作时间太久导致心跳认为客户端不正常);
|
||||
2. 订阅的Topic发生变化;
|
||||
- 更改订阅的Topic;
|
||||
- 订阅的Topic扩分区;
|
||||
|
||||
##### 2.1.2.1、组成员变化
|
||||
|
||||
**组成员增加**
|
||||
|
||||

|
||||
|
||||
**组成员减少**
|
||||
|
||||

|
||||
|
||||
##### 2.1.2.2、Topic发生变化
|
||||
|
||||
**更改订阅的Topic**
|
||||
|
||||

|
||||
|
||||
**Topic扩分区**
|
||||
|
||||
1. Kafka消费客户端在元信息更新时,会感知到Topic分区的变化;
|
||||
2. 在进行poll数据的时候,发现这个变化之后,会重新发起JoinGroup的请求;
|
||||
|
||||
|
||||
|
||||
### 2.2、场景分析
|
||||
|
||||
本节主要结合实际的场景,详细的分析实际的处理过程及具体的源码实现。
|
||||
|
||||
#### 2.2.1、消费客户端—上线
|
||||
|
||||
在具体的看代码之前,我们先根据客户端的日志,看一下客户端加入消费组的过程。
|
||||
|
||||
##### 2.2.1.1、大体流程
|
||||
|
||||
**1. 寻找Coordinator(发送FindCoordinator请求);**
|
||||
**2. 开始申请加入消费组(发送JoinGroup请求);**
|
||||
**3. 进行分区分配并借助Coordinator同步结果(发送SyncGroup请求);**
|
||||
**4. 成功加入消费组,并开始心跳线程;**
|
||||
|
||||
|
||||
|
||||
##### 2.2.1.2、FindCoordinator流程
|
||||
|
||||
**时序图**
|
||||
|
||||

|
||||
|
||||
|
||||
**相关代码**
|
||||
|
||||
```Java
|
||||
protected synchronized boolean ensureCoordinatorReady(final Timer timer) {
|
||||
if (!coordinatorUnknown()) // 检查coordinator是否正常
|
||||
return true;
|
||||
|
||||
do {
|
||||
// ...... 判断先前寻找coordinator异常并做处理
|
||||
|
||||
// 寻找coordinator
|
||||
final RequestFuture<Void> future = lookupCoordinator();
|
||||
client.poll(future, timer); // 真正的IO
|
||||
|
||||
// 对寻找的结果进行处理
|
||||
} while (coordinatorUnknown() && timer.notExpired()); // 循环直到coordinator正常 或者 超时
|
||||
|
||||
return !coordinatorUnknown();
|
||||
}
|
||||
|
||||
protected synchronized RequestFuture<Void> lookupCoordinator() {
|
||||
if (findCoordinatorFuture == null) { // 当前已无处理中的find-coordinator请求
|
||||
// 寻找节点
|
||||
Node node = this.client.leastLoadedNode();
|
||||
if (node == null) {
|
||||
log.debug("No broker available to send FindCoordinator request");
|
||||
return RequestFuture.noBrokersAvailable();
|
||||
} else {
|
||||
findCoordinatorFuture = sendFindCoordinatorRequest(node); // 发送请求
|
||||
// 记录异常信息,留给上层处理
|
||||
findCoordinatorFuture.addListener(new RequestFutureListener<Void>() {
|
||||
@Override
|
||||
public void onSuccess(Void value) {} // do nothing
|
||||
@Override
|
||||
public void onFailure(RuntimeException e) {
|
||||
findCoordinatorException = e;
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
return findCoordinatorFuture;
|
||||
}
|
||||
|
||||
// 构造FindCoordinator请求
|
||||
// 发送到NetworkNetworkClient
|
||||
// 增加回调的处理类
|
||||
private RequestFuture<Void> sendFindCoordinatorRequest(Node node) {
|
||||
|
||||
log.debug("Sending FindCoordinator request to broker {}", node);
|
||||
FindCoordinatorRequest.Builder requestBuilder =
|
||||
new FindCoordinatorRequest.Builder(
|
||||
new FindCoordinatorRequestData()
|
||||
.setKeyType(CoordinatorType.GROUP.id())
|
||||
.setKey(this.rebalanceConfig.groupId));
|
||||
return client.send(node, requestBuilder)
|
||||
.compose(new FindCoordinatorResponseHandler());
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
##### 2.2.1.3、JoinGroup流程
|
||||
|
||||
**时序图**
|
||||
|
||||

|
||||
|
||||
**相关代码**
|
||||
|
||||
```Java
|
||||
// 确保消费组active,非active的话则会将其调整为active
|
||||
boolean ensureActiveGroup(final Timer timer) {
|
||||
if (!ensureCoordinatorReady(timer)) {
|
||||
// coordinator异常则直接退出
|
||||
return false;
|
||||
}
|
||||
|
||||
startHeartbeatThreadIfNeeded(); // 启动心跳
|
||||
return joinGroupIfNeeded(timer); // 加入消费组
|
||||
}
|
||||
|
||||
// 加入消费组
|
||||
boolean joinGroupIfNeeded(final Timer timer) {
|
||||
while (rejoinNeededOrPending()) {
|
||||
if (!ensureCoordinatorReady(timer)) {
|
||||
// coordinator异常则直接退出
|
||||
return false;
|
||||
}
|
||||
|
||||
if (needsJoinPrepare) {
|
||||
// Join之前如果需要预处理则进行预处理,commit-offset等
|
||||
needsJoinPrepare = false;
|
||||
onJoinPrepare(generation.generationId, generation.memberId);
|
||||
}
|
||||
|
||||
// 开始加入消费组
|
||||
final RequestFuture<ByteBuffer> future = initiateJoinGroup();
|
||||
client.poll(future, timer); // 进行
|
||||
if (!future.isDone()) {
|
||||
// we ran out of time
|
||||
return false;
|
||||
}
|
||||
|
||||
if (future.succeeded()) {
|
||||
// 此时基本已成功加入消费组了,这里我们在分析SyncGroup的时候再单独分析
|
||||
} else {
|
||||
final RuntimeException exception = future.exception();
|
||||
log.info("Join group failed with {}", exception.toString());
|
||||
resetJoinGroupFuture();
|
||||
if (exception instanceof UnknownMemberIdException || exception instanceof RebalanceInProgressException || exception instanceof IllegalGenerationException || exception instanceof MemberIdRequiredException)
|
||||
// 这些异常都可通过重试解决,因此这里continue进行重试了
|
||||
continue;
|
||||
else if (!future.isRetriable())
|
||||
throw exception;
|
||||
|
||||
timer.sleep(rebalanceConfig.retryBackoffMs);
|
||||
}
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
// 开始加入消费组
|
||||
private synchronized RequestFuture<ByteBuffer> initiateJoinGroup() {
|
||||
if (joinFuture == null) { // 没有JoinGroup的请求在处理中
|
||||
// 暂停心跳
|
||||
disableHeartbeatThread();
|
||||
|
||||
// 设置状态、lastRebalanceStartMs。。。。。。
|
||||
state = MemberState.REBALANCING;
|
||||
|
||||
// 发送JoinGroup请求
|
||||
joinFuture = sendJoinGroupRequest();
|
||||
joinFuture.addListener(new RequestFutureListener<ByteBuffer>() {
|
||||
@Override
|
||||
public void onSuccess(ByteBuffer value) {
|
||||
// SyncGroup请求处理成功之后调用的,这块等待分享SyncGroup的时候单独介绍
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onFailure(RuntimeException e) {
|
||||
// JoinGroup异常和SyncGroup异常的时候调用
|
||||
synchronized (AbstractCoordinator.this) {
|
||||
// 加入异常,后续会重新加入
|
||||
recordRebalanceFailure();
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
return joinFuture;
|
||||
}
|
||||
|
||||
// 发送JoinGroup请求
|
||||
RequestFuture<ByteBuffer> sendJoinGroupRequest() {
|
||||
if (coordinatorUnknown())
|
||||
return RequestFuture.coordinatorNotAvailable();
|
||||
|
||||
// 构造JoinGrouo请求的builder
|
||||
log.info("(Re-)joining group");
|
||||
JoinGroupRequest.Builder requestBuilder = new JoinGroupRequest.Builder(......);
|
||||
|
||||
log.debug("Sending JoinGroup ({}) to coordinator {}", requestBuilder, this.coordinator);
|
||||
|
||||
// 将JoinGroup请求发送到NetworkClient,并增加处理完成后处理器
|
||||
return client.send(coordinator, requestBuilder, joinGroupTimeoutMs)
|
||||
.compose(new JoinGroupResponseHandler());
|
||||
}
|
||||
|
||||
// JoinGroup请求Response的处理器
|
||||
private class JoinGroupResponseHandler extends CoordinatorResponseHandler<JoinGroupResponse, ByteBuffer> {
|
||||
@Override
|
||||
public void handle(JoinGroupResponse joinResponse, RequestFuture<ByteBuffer> future) {
|
||||
Errors error = joinResponse.error();
|
||||
if (error == Errors.NONE) { // response无异常
|
||||
if (isProtocolTypeInconsistent(joinResponse.data().protocolType())) {
|
||||
// 协议不一致
|
||||
} else {
|
||||
log.debug("Received successful JoinGroup response: {}", joinResponse);
|
||||
sensors.joinSensor.record(response.requestLatencyMs());
|
||||
|
||||
synchronized (AbstractCoordinator.this) {
|
||||
if (state != MemberState.REBALANCING) {
|
||||
future.raise(new UnjoinedGroupException());
|
||||
} else {
|
||||
AbstractCoordinator.this.generation = new Generation(
|
||||
joinResponse.data().generationId(),
|
||||
joinResponse.data().memberId(), joinResponse.data().protocolName());
|
||||
// 发送SyncGroup的请求,后续我们在SyncGroup的时候,再单独分析这块
|
||||
if (joinResponse.isLeader()) {
|
||||
// leader会对分区进行分配
|
||||
onJoinLeader(joinResponse).chain(future);
|
||||
} else {
|
||||
onJoinFollower().chain(future);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 注意这整块没有触发上层listenr的onSuccess的调用
|
||||
// 这里是在通过chain(future)的调用,将onSuccess的触发交给了SyncGroup的Response处理类。
|
||||
}
|
||||
}
|
||||
|
||||
// 各种异常的处理,处理后会raise一个错误
|
||||
else if (error == Errors.COORDINATOR_LOAD_IN_PROGRESS) {
|
||||
} else if (error == Errors.UNKNOWN_MEMBER_ID) {
|
||||
} else if (error == Errors.COORDINATOR_NOT_AVAILABLE || error == Errors.NOT_COORDINATOR) {
|
||||
} else if (error == Errors.FENCED_INSTANCE_ID) {
|
||||
} else if (error == Errors.INCONSISTENT_GROUP_PROTOCOL || error == Errors.INVALID_SESSION_TIMEOUT || error == Errors.INVALID_GROUP_ID || error == Errors.GROUP_AUTHORIZATION_FAILED || error == Errors.GROUP_MAX_SIZE_REACHED) {
|
||||
} else if (error == Errors.UNSUPPORTED_VERSION) {
|
||||
} else if (error == Errors.MEMBER_ID_REQUIRED) {
|
||||
} else {
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
##### 2.2.1.4、SyncGroup流程
|
||||
|
||||
**时序图**
|
||||
|
||||

|
||||
|
||||
**相关代码**
|
||||
|
||||
```Java
|
||||
// 如果收到的JoinGroupResponse表明当前客户端是leader,则进行分区分配并同步给coordinator
|
||||
private RequestFuture<ByteBuffer> onJoinLeader(JoinGroupResponse joinResponse) {
|
||||
try {
|
||||
// 对成员进行分配
|
||||
|
||||
// 构造请求 。。。。。。
|
||||
|
||||
log.debug("Sending leader SyncGroup to coordinator {} at generation {}: {}", this.coordinator, this.generation, requestBuilder);
|
||||
|
||||
// 发送SyncGroup请求
|
||||
return sendSyncGroupRequest(requestBuilder);
|
||||
} catch (RuntimeException e) {
|
||||
return RequestFuture.failure(e);
|
||||
}
|
||||
}
|
||||
|
||||
// 收到的JoinGroupResponse表明当前客户端是follower
|
||||
private RequestFuture<ByteBuffer> onJoinFollower() {
|
||||
// 发送空的分配数据
|
||||
return sendSyncGroupRequest(requestBuilder);
|
||||
}
|
||||
|
||||
private RequestFuture<ByteBuffer> sendSyncGroupRequest(SyncGroupRequest.Builder requestBuilder) {
|
||||
if (coordinatorUnknown())
|
||||
return RequestFuture.coordinatorNotAvailable();
|
||||
// 发送SyncGroup请求并增加请求结果的处理器
|
||||
return client.send(coordinator, requestBuilder)
|
||||
.compose(new SyncGroupResponseHandler());
|
||||
}
|
||||
|
||||
// SyncGroup请求的结果处理器
|
||||
private class SyncGroupResponseHandler extends CoordinatorResponseHandler<SyncGroupResponse, ByteBuffer> {
|
||||
@Override
|
||||
public void handle(SyncGroupResponse syncResponse,
|
||||
RequestFuture<ByteBuffer> future) {
|
||||
Errors error = syncResponse.error();
|
||||
if (error == Errors.NONE) {
|
||||
if (isProtocolTypeInconsistent(syncResponse.data.protocolType())) {
|
||||
} else if (isProtocolNameInconsistent(syncResponse.data.protocolName())) {
|
||||
// 存在异常,则raise异常
|
||||
} else {
|
||||
log.debug("Received successful SyncGroup response: {}", syncResponse);
|
||||
sensors.syncSensor.record(response.requestLatencyMs());
|
||||
// 结束future,同时这里的分配策略存储在future中
|
||||
future.complete(ByteBuffer.wrap(syncResponse.data.assignment()));
|
||||
}
|
||||
} else {
|
||||
// 存在异常的情况下,则标记需要进行rejoin
|
||||
requestRejoin();
|
||||
|
||||
// raise异常
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// SyncGroup处理完成之后,我们继续看上层
|
||||
boolean joinGroupIfNeeded(final Timer timer) {
|
||||
while (rejoinNeededOrPending()) {
|
||||
// 各种处理,JoinGroup中已经进行介绍,这里不再赘述
|
||||
|
||||
final RequestFuture<ByteBuffer> future = initiateJoinGroup();
|
||||
|
||||
client.poll(future, timer); // JoinGroup及SyncGroup的请求会阻塞在这个地方,等待超时或结束
|
||||
if (!future.isDone()) {
|
||||
// we ran out of time
|
||||
return false;
|
||||
}
|
||||
|
||||
if (future.succeeded()) {
|
||||
//
|
||||
// JoinGroup 及 SyncGroup 成功,已获取到分配策略,此时开始进行分配结果的处理
|
||||
//
|
||||
Generation generationSnapshot;
|
||||
synchronized (AbstractCoordinator.this) {
|
||||
generationSnapshot = this.generation;
|
||||
}
|
||||
|
||||
if (generationSnapshot != Generation.NO_GENERATION) {
|
||||
// 获取分配策略
|
||||
ByteBuffer memberAssignment = future.value().duplicate();
|
||||
|
||||
// 结束Join工作
|
||||
onJoinComplete(generationSnapshot.generationId, generationSnapshot.memberId, generationSnapshot.protocolName, memberAssignment);
|
||||
|
||||
// 相关状态重置
|
||||
resetJoinGroupFuture();
|
||||
needsJoinPrepare = true;
|
||||
} else {
|
||||
// 加入异常的处理
|
||||
}
|
||||
} else {
|
||||
// 加入异常的处理
|
||||
}
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
// 结束Join
|
||||
protected void onJoinComplete(int generation,
|
||||
String memberId,
|
||||
String assignmentStrategy,
|
||||
ByteBuffer assignmentBuffer) {
|
||||
log.debug("Executing onJoinComplete with generation {} and memberId {}", generation, memberId);
|
||||
|
||||
// 进行一些检查及一些重新的初始化 。。。。。。
|
||||
|
||||
// 反序列化分配策略
|
||||
Assignment assignment = ConsumerProtocol.deserializeAssignment(assignmentBuffer);
|
||||
|
||||
Set<TopicPartition> assignedPartitions = new HashSet<>(assignment.partitions());
|
||||
|
||||
// 分配的分区不match订阅的信息
|
||||
if (!subscriptions.checkAssignmentMatchedSubscription(assignedPartitions)) {
|
||||
requestRejoin(); //标记要重新join
|
||||
return;
|
||||
}
|
||||
|
||||
if (protocol == RebalanceProtocol.COOPERATIVE) {
|
||||
// 新增的增量rebalance的协议,后续单独分析
|
||||
}
|
||||
|
||||
// 更新信息,及标记新增Topic要元信息同步
|
||||
maybeUpdateJoinedSubscription(assignedPartitions);
|
||||
|
||||
if (autoCommitEnabled)
|
||||
// 更新下次自动commit offset时间
|
||||
this.nextAutoCommitTimer.updateAndReset(autoCommitIntervalMs);
|
||||
|
||||
// 其他
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### 2.2.2、消费客户端—下线
|
||||
|
||||
##### 2.2.2.1、下线客户端
|
||||
|
||||
客户端下线分为正常下线和异常奔溃,异常奔溃的话就直接退出了,因此这里仅对正常下线的流程进行分析。
|
||||
|
||||
**正常下线时序图**
|
||||
|
||||

|
||||
|
||||
**正常下线代码**
|
||||
|
||||
```Java
|
||||
// 调用KafkaConsumer的close方法
|
||||
public void close(Duration timeout) {
|
||||
// 一些判断及加锁
|
||||
try {
|
||||
if (!closed) {
|
||||
// 关闭客户端
|
||||
close(timeout.toMillis(), false);
|
||||
}
|
||||
} finally {
|
||||
// ......
|
||||
}
|
||||
}
|
||||
|
||||
private void close(long timeoutMs, boolean swallowException) {
|
||||
// 日志及信息记录
|
||||
try {
|
||||
if (coordinator != null)
|
||||
// 关闭coordinator
|
||||
coordinator.close(time.timer(Math.min(timeoutMs, requestTimeoutMs)));
|
||||
} catch (Throwable t) {
|
||||
firstException.compareAndSet(null, t);
|
||||
log.error("Failed to close coordinator", t);
|
||||
}
|
||||
// 关闭相关组件及异常处理 。。。。。。
|
||||
}
|
||||
|
||||
// ConsumerCoordinator.close方法
|
||||
public void close(final Timer timer) {
|
||||
client.disableWakeups();
|
||||
try {
|
||||
maybeAutoCommitOffsetsSync(timer); // 提交offset
|
||||
while (pendingAsyncCommits.get() > 0 && timer.notExpired()) {
|
||||
ensureCoordinatorReady(timer);
|
||||
client.poll(timer);
|
||||
invokeCompletedOffsetCommitCallbacks();
|
||||
}
|
||||
} finally {
|
||||
// 调用AbstractCoordinator的close方法
|
||||
super.close(timer);
|
||||
}
|
||||
}
|
||||
|
||||
// AbstractCoordinator.close方法
|
||||
protected void close(Timer timer) {
|
||||
try {
|
||||
closeHeartbeatThread(); // 关闭心跳线程
|
||||
} finally {
|
||||
synchronized (this) {
|
||||
if (rebalanceConfig.leaveGroupOnClose) {
|
||||
// 离开消费组准备
|
||||
onLeavePrepare();
|
||||
|
||||
// 离开消费组
|
||||
maybeLeaveGroup("the consumer is being closed");
|
||||
}
|
||||
|
||||
// 等待请求处理完成
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public synchronized RequestFuture<Void> maybeLeaveGroup(String leaveReason) {
|
||||
RequestFuture<Void> future = null;
|
||||
if (isDynamicMember() && !coordinatorUnknown() &&
|
||||
state != MemberState.UNJOINED && generation.hasMemberId()) {
|
||||
// 构造LeaveGroup请求
|
||||
|
||||
// 发送请求
|
||||
future = client.send(coordinator, request).compose(new LeaveGroupResponseHandler());
|
||||
client.pollNoWakeup();
|
||||
}
|
||||
resetGenerationOnLeaveGroup();
|
||||
return future;
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
##### 2.2.2.3、剩余客户端
|
||||
|
||||
刚刚是分析了下线客户端的行为,那么剩余的客户端之间是如何感知其他客户端的下线的呢?
|
||||
|
||||
简单来说就是:
|
||||
1. 某一客户端下线;
|
||||
2. 服务端Group-Coordinator通过LeaveGroup请求或者是Heartbeat请求,发现某一客户端异常时,将消费组设置成Rebalance的状态;
|
||||
3. 剩余客户端通过心跳线程,感知到当前消费组的状态是Rebalance的,则poll的时候,会重新进行JoinGroup、SyncGroup的过程;
|
||||
|
||||
poll的时候,JoinGroup、SyncGroup的处理流程我们在前面都已经进行了详细的分析,这里我们来看一下HeartBeat请求的响应的处理。
|
||||
|
||||
**Heartbeat请求的响应的处理**
|
||||
|
||||
```Java
|
||||
private class HeartbeatResponseHandler extends CoordinatorResponseHandler<HeartbeatResponse, Void> {
|
||||
private final Generation sentGeneration;
|
||||
|
||||
private HeartbeatResponseHandler(final Generation generation) {
|
||||
this.sentGeneration = generation;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void handle(HeartbeatResponse heartbeatResponse, RequestFuture<Void> future) {
|
||||
sensors.heartbeatSensor.record(response.requestLatencyMs());
|
||||
Errors error = heartbeatResponse.error();
|
||||
if (error == Errors.NONE) {
|
||||
// 心跳无错误信息
|
||||
log.debug("Received successful Heartbeat response");
|
||||
future.complete(null);
|
||||
} else if (error == Errors.COORDINATOR_NOT_AVAILABLE || error == Errors.NOT_COORDINATOR) {
|
||||
// 服务端coordinator未知
|
||||
log.info("Attempt to heartbeat failed since coordinator {} is either not started or not valid",
|
||||
coordinator());
|
||||
markCoordinatorUnknown();
|
||||
future.raise(error);
|
||||
} else if (error == Errors.REBALANCE_IN_PROGRESS) {
|
||||
// 正在进行rebalance中,这里会设置状态,即要求rejoin
|
||||
// 心跳就是通过这个异常来感知到其他客户端的上下线
|
||||
log.info("Attempt to heartbeat failed since group is rebalancing");
|
||||
requestRejoin();
|
||||
future.raise(error);
|
||||
} else if (error == Errors.ILLEGAL_GENERATION) {
|
||||
// generation信息异常,如果不是正在rebalance,则会进行重新join
|
||||
log.info("Attempt to heartbeat failed since generation {} is not current", sentGeneration.generationId);
|
||||
resetGenerationOnResponseError(ApiKeys.HEARTBEAT, error);
|
||||
future.raise(error);
|
||||
} else if (error == Errors.FENCED_INSTANCE_ID) {
|
||||
log.error("Received fatal exception: group.instance.id gets fenced");
|
||||
future.raise(error);
|
||||
} else if (error == Errors.UNKNOWN_MEMBER_ID) {
|
||||
log.info("Attempt to heartbeat failed for since member id {} is not valid.", sentGeneration.memberId);
|
||||
resetGenerationOnResponseError(ApiKeys.HEARTBEAT, error);
|
||||
future.raise(error);
|
||||
} else if (error == Errors.GROUP_AUTHORIZATION_FAILED) {
|
||||
future.raise(GroupAuthorizationException.forGroupId(rebalanceConfig.groupId));
|
||||
} else {
|
||||
future.raise(new KafkaException("Unexpected error in heartbeat response: " + error.message()));
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 3、常见问题解答
|
||||
|
||||
### 3.1、客户端协调器相关配置主要有都有哪些?作用是?
|
||||
|
||||
消费组rebalance相关的配置主要有如下:
|
||||
|
||||
|配置|说明|默认值|说明|
|
||||
|:----|:------|:------|:-------|
|
||||
|`session.timeout.ms`|session超时时间|10000毫秒|心跳线程正常情况下,和coordinator允许的最大的没有交互时间。session超时时会标记coordinator为未知,从而重新开始FindCoordinator|
|
||||
|`heartbeat.interval.ms`|心跳周期|3000毫秒|心跳周期,但是是非严格的周期。建议值是小于session.timeout.ms的三分之一|
|
||||
|`retry.backoff.ms`|请求回退时间|100毫秒||
|
||||
|`max.poll.interval.ms`|两次poll操作允许的最大间隔|300000毫秒|两次poll操作允许的最大时间,超过这个时间之后,心跳线程会向Coordinator发送LeaveGroup请求|
|
||||
|
||||
这里比较重要的是`max.poll.interval.ms`这个配置,我们在调用KafkaConsumer客户端的poll方法的时候,如果数据处理的时间太久,导致两次poll的时间超过了设置的300秒,那么将触发rebalance。
|
||||
|
||||
|
||||
### 3.3、多个Topic的消费客户端使用同一个消费组,是否会有影响?
|
||||
|
||||
多个业务的客户端消费不同的Topic,但是使用了同一个消费组的时候,如果出现任意一个客户端上下线,都会引起整体的rebalance。
|
||||
|
||||
因此建议建议在做消费的时候,消费组尽量不要和其他业务重复。
|
||||
|
||||
|
||||
## 4、总结
|
||||
|
||||
Kafka消费客户端之间的消费协调是通过FindCoordinator、JoinGroup、SyncGroup、LeaveGroup以及Heartbeat几个请求与服务端的Group-Coordinator进行配合,从而实现协调的。
|
||||
|
||||
|
||||
## 5、附录
|
||||
|
||||
### 5.1、消费客户端-正常上线日志
|
||||
|
||||
```Java
|
||||
// 往bootstrap.server的机器发送FindCoordinator请求
|
||||
2021-06-02 17:09:24.407 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Sending FindCoordinator request to broker 10.179.149.194:7093 (id: -3 rack: null)
|
||||
|
||||
// 寻找到消费组对应的Group-Coordinator
|
||||
2021-06-02 17:09:25.086 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Received FindCoordinator response ClientResponse(receivedTimeMs=1622624965085, latencyMs=427, disconnected=false, requestHeader=RequestHeader(apiKey=FIND_COORDINATOR, apiVersion=3, clientId=consumer-cg_logi_kafka_test_1-1, correlationId=0), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, errorMessage='NONE', nodeId=2, host='10.179.149.201', port=7093))
|
||||
|
||||
// 获取到消费组对应的Group-Coordinator
|
||||
2021-06-02 17:09:25.087 [main] INFO o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Discovered group coordinator 10.179.149.201:7093 (id: 2147483645 rack: null)
|
||||
|
||||
// 开始进行Join的准备
|
||||
2021-06-02 17:09:25.090 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Executing onJoinPrepare with generation -1 and memberId
|
||||
|
||||
// 启动心跳线程
|
||||
2021-06-02 17:09:25.090 [kafka-coordinator-heartbeat-thread | cg_logi_kafka_test_1] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Heartbeat thread started
|
||||
|
||||
// 提交空的offset信息
|
||||
2021-06-02 17:09:25.091 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Sending synchronous auto-commit of offsets {}
|
||||
|
||||
// 还没有启动好,这里将心跳线程设置成了disable,即不进行心跳
|
||||
2021-06-02 17:09:25.091 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Disabling heartbeat thread
|
||||
|
||||
// 开始进行JoinGroup
|
||||
2021-06-02 17:09:25.091 [main] INFO o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] (Re-)joining group
|
||||
|
||||
// 加入消费组
|
||||
2021-06-02 17:09:25.092 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Joining group with current subscription: [kmo_comminity]
|
||||
|
||||
// 发送JoinGroup请求
|
||||
2021-06-02 17:09:25.093 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Sending JoinGroup (JoinGroupRequestData(groupId='cg_logi_kafka_test_1', sessionTimeoutMs=10000, rebalanceTimeoutMs=300000, memberId='', groupInstanceId=null, protocolType='consumer', protocols=[JoinGroupRequestProtocol(name='range', metadata=[0, 1, 0, 0, 0, 1, 0, 13, 107, 109, 111, 95, 99, 111, 109, 109, 105, 110, 105, 116, 121, -1, -1, -1, -1, 0, 0, 0, 0])])) to coordinator 10.179.149.201:7093 (id: 2147483645 rack: null)
|
||||
|
||||
// JoinGroup请求失败,原因是没有成员ID,这里虽然失败了,但是Group-Coordinator会返回当前消费客户端的memberId(Group-Coordinator生成的)
|
||||
2021-06-02 17:09:25.446 [main] INFO o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Join group failed with org.apache.kafka.common.errors.MemberIdRequiredException: The group member needs to have a valid member id before actually entering a consumer group
|
||||
|
||||
// 心跳线程继续disable
|
||||
2021-06-02 17:09:25.447 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Disabling heartbeat thread
|
||||
|
||||
// 继续JoinGroup
|
||||
2021-06-02 17:09:25.448 [main] INFO o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] (Re-)joining group
|
||||
|
||||
// 加入消费组
|
||||
2021-06-02 17:09:25.448 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Joining group with current subscription: [kmo_comminity]
|
||||
|
||||
// 发送JoinGroup请求
|
||||
2021-06-02 17:09:25.449 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Sending JoinGroup (JoinGroupRequestData(groupId='cg_logi_kafka_test_1', sessionTimeoutMs=10000, rebalanceTimeoutMs=300000, memberId='consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b', groupInstanceId=null, protocolType='consumer', protocols=[JoinGroupRequestProtocol(name='range', metadata=[0, 1, 0, 0, 0, 1, 0, 13, 107, 109, 111, 95, 99, 111, 109, 109, 105, 110, 105, 116, 121, -1, -1, -1, -1, 0, 0, 0, 0])])) to coordinator 10.179.149.201:7093 (id: 2147483645 rack: null)
|
||||
|
||||
// 成功发送JoinGroup请求,返回的当前消费组的信息,包括成员列表等
|
||||
2021-06-02 17:09:28.501 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Received successful JoinGroup response: JoinGroupResponseData(throttleTimeMs=0, errorCode=0, generationId=1, protocolType='consumer', protocolName='range', leader='consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b', memberId='consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b', members=[JoinGroupResponseMember(memberId='consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b', groupInstanceId=null, metadata=[0, 1, 0, 0, 0, 1, 0, 13, 107, 109, 111, 95, 99, 111, 109, 109, 105, 110, 105, 116, 121, -1, -1, -1, -1, 0, 0, 0, 0])])
|
||||
|
||||
|
||||
2021-06-02 17:09:28.502 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Performing assignment using strategy range with subscriptions {consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b=org.apache.kafka.clients.consumer.ConsumerPartitionAssignor$Subscription@56ace400}
|
||||
|
||||
|
||||
2021-06-02 17:09:28.503 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Finished assignment for group at generation 1: {consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b=Assignment(partitions=[kmo_comminity-0, kmo_comminity-1, kmo_comminity-2])}
|
||||
|
||||
// 发送SyncGroup请求,因为是leader消费组,所以还会带上分区分配的结果
|
||||
2021-06-02 17:09:28.504 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Sending leader SyncGroup to coordinator 10.179.149.201:7093 (id: 2147483645 rack: null) at generation Generation{generationId=1, memberId='consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b', protocol='range'}: SyncGroupRequestData(groupId='cg_logi_kafka_test_1', generationId=1, memberId='consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b', groupInstanceId=null, protocolType='consumer', protocolName='range', assignments=[SyncGroupRequestAssignment(memberId='consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b', assignment=[0, 1, 0, 0, 0, 1, 0, 13, 107, 109, 111, 95, 99, 111, 109, 109, 105, 110, 105, 116, 121, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, -1, -1, -1, -1])])
|
||||
|
||||
// 收到SyncGroup的结果
|
||||
2021-06-02 17:09:28.555 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Received successful SyncGroup response: org.apache.kafka.common.requests.SyncGroupResponse@4b3c354a
|
||||
|
||||
// 成功加入消费组
|
||||
2021-06-02 17:09:28.555 [main] INFO o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Successfully joined group with generation 1
|
||||
|
||||
// enable心跳线程
|
||||
2021-06-02 17:09:28.555 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Enabling heartbeat thread
|
||||
|
||||
|
||||
2021-06-02 17:09:28.556 [main] DEBUG o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Executing onJoinComplete with generation 1 and memberId consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b
|
||||
|
||||
|
||||
2021-06-02 17:09:28.559 [main] INFO o.a.k.c.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Adding newly assigned partitions: kmo_comminity-0, kmo_comminity-1, kmo_comminity-2
|
||||
|
||||
|
||||
// ...... 提交offset相关日志
|
||||
|
||||
// 心跳线程发送心跳信息到Group-Coordinator
|
||||
2021-06-02 17:09:31.555 [kafka-coordinator-heartbeat-thread | cg_logi_kafka_test_1] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Sending Heartbeat request with generation 1 and member id consumer-cg_logi_kafka_test_1-1-e444cf99-2a47-494b-8d6a-aa061c12db1b to coordinator 10.179.149.201:7093 (id: 2147483645 rack: null)
|
||||
|
||||
// 获取到心跳response
|
||||
2021-06-02 17:09:31.671 [main] DEBUG o.a.k.c.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-cg_logi_kafka_test_1-1, groupId=cg_logi_kafka_test_1] Received successful Heartbeat response
|
||||
```
|
||||
BIN
docs/zh/Kafka分享/Kafka消费客户端_协调器分析/assets/consumer_close.jpg
Normal file
|
After Width: | Height: | Size: 212 KiB |
|
After Width: | Height: | Size: 604 KiB |
|
After Width: | Height: | Size: 409 KiB |
BIN
docs/zh/Kafka分享/Kafka消费客户端_协调器分析/assets/consumer_group_state.jpg
Normal file
|
After Width: | Height: | Size: 199 KiB |
BIN
docs/zh/Kafka分享/Kafka消费客户端_协调器分析/assets/consumer_join_group.jpg
Normal file
|
After Width: | Height: | Size: 544 KiB |
BIN
docs/zh/Kafka分享/Kafka消费客户端_协调器分析/assets/consumer_sync_group.jpg
Normal file
|
After Width: | Height: | Size: 500 KiB |
BIN
docs/zh/Kafka分享/Kafka消费客户端_协调器分析/assets/contents.jpg
Normal file
|
After Width: | Height: | Size: 222 KiB |
|
After Width: | Height: | Size: 191 KiB |
|
After Width: | Height: | Size: 145 KiB |
|
After Width: | Height: | Size: 266 KiB |
|
After Width: | Height: | Size: 204 KiB |
51
docs/zh/Kafka分享/Kafka消费客户端_数据拉取/Kafka消费客户端_数据拉取.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Kafka消费客户端—数据拉取
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
本节给大家分享一下Kafka消费客户端消费数据的过程。
|
||||
|
||||
备注:本次仅介绍数据拉取的过程,不会介绍数据拉取过程中Rebalance相关的过程。
|
||||
|
||||
## 2、数据拉取
|
||||
|
||||
数据拉取大致分为三步:
|
||||
1. 从预先拉取好的缓冲中读取数据;
|
||||
2. 发送Fetch请求;
|
||||
|
||||
|
||||
|
||||

|
||||
|
||||
|
||||
### 2.1、缓存数据读取
|
||||
|
||||
从缓存读取数据,大概有如下几个过程:
|
||||
1. 从缓存中读取;
|
||||
|
||||
#### 2.1.1、缓存结构
|
||||
|
||||

|
||||
|
||||
|
||||
#### 2.1.2、数据读取
|
||||
|
||||
|
||||
|
||||
### 2.2、发送Fetch请求
|
||||
|
||||

|
||||
|
||||
|
||||
### 2.3、
|
||||
|
||||
|
||||
## 3、偏移(Offset)管理
|
||||
|
||||
|
||||
### 3.1、初始值
|
||||
|
||||
|
||||
### 3.2、周期提交
|
||||
|
||||
BIN
docs/zh/Kafka分享/Kafka消费客户端_数据拉取/assets/consumer_data_cache.jpg
Normal file
|
After Width: | Height: | Size: 101 KiB |
|
After Width: | Height: | Size: 379 KiB |
|
After Width: | Height: | Size: 452 KiB |
129
docs/zh/Kafka分享/Kafka消费客户端_整体概述/Kafka消费客户端_整体概述.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Kafka消费客户端——整体概述
|
||||
|
||||
[TOC]
|
||||
|
||||
## 1、前言
|
||||
|
||||
本节,我们将对Kafka消费者客户端,做一个整体上的概要介绍,在介绍的时候,会按照如下顺序进行:
|
||||
1. 消费客户端简单的例子;
|
||||
2. 消费客户端-消费模型
|
||||
3. 消费客户端类图;
|
||||
4. 消费客户端的线程模型及线程处理流程;
|
||||
5. 总结
|
||||
|
||||
## 2、消费客户端-例子
|
||||
|
||||
```java
|
||||
import org.apache.kafka.clients.consumer.ConsumerRecord;
|
||||
import org.apache.kafka.clients.consumer.ConsumerRecords;
|
||||
import org.apache.kafka.clients.consumer.KafkaConsumer;
|
||||
|
||||
import java.time.Duration;
|
||||
import java.util.Arrays;
|
||||
import java.util.Properties;
|
||||
|
||||
public class SimpleConsumer {
|
||||
private static String topicName = "kafka_topic";
|
||||
private static String group = "kafka_consumer_group";
|
||||
|
||||
public static void main(String[] args) {
|
||||
Properties props = new Properties();
|
||||
props.put("bootstrap.servers", "192.168.0.1:9092,192.168.0.2:9092,192.168.0.3:9092"); // Kafka服务地址
|
||||
props.put("group.id", group);
|
||||
props.put("auto.offset.reset", "earliest"); //earliest/latest消息消费起始位置,earliest代表消费历史数据,latest代表消费最新的数据
|
||||
props.put("enable.auto.commit", "true"); // 自动commit
|
||||
props.put("auto.commit.interval.ms", "1000"); // 自动commit的间隔
|
||||
|
||||
//根据实际场景选择序列化类
|
||||
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
|
||||
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
|
||||
|
||||
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
|
||||
consumer.subscribe(Arrays.asList(topicName)); // 可消费多个Topic, 组成一个List
|
||||
|
||||
while (true) {
|
||||
try {
|
||||
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(10));
|
||||
for (ConsumerRecord<String, String> record : records) {
|
||||
System.out.println("offset = " + record.offset() + ", key = " + record.key() + ", value = " + record.value());
|
||||
}
|
||||
}catch (Throwable e){
|
||||
//TODO 输出你的异常
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 3、消费客户端-消费模型
|
||||
|
||||

|
||||
|
||||
- 每一个消费组,消费Topic的一份完整的数据。
|
||||
- 多个客户端,使用同一个消费组,消费Topic的数据。每个客户端消费部分Topic的分区。
|
||||
|
||||
|
||||
|
||||
## 4、消费客户端-类图
|
||||
|
||||
介绍完Kafka消费客户端的例子,我们再来看一下消费客户端的类图。
|
||||
|
||||

|
||||
|
||||
消费客户端主要分为三块内容,分别是:
|
||||
- KafkaConsumer:消费客户端,是对外的Kafka消费客户端类;
|
||||
- Fetch:进行数据拉取;
|
||||
- ConsumerCoordinator:消费协调器,管理消费状态,协调多个消费者等;
|
||||
|
||||
简单说,就是用户用KafkaConsumer进行消费,而KafkaConsumer则通过ConsumerCoordinator对多个消费客户端消费的分区进行协调,然后还通过Fetch进行数据的拉取。
|
||||
|
||||
其实上面还有一些其他比较重要的类,这里也简单的介绍一下:
|
||||
- ConsumerPartitionAssigner:消费分区分配策略的接口,下面有几个具体的实现策略。
|
||||
- ConsumerMetadata:消费元信息,继承Metadata类,里面存储着集群的元信息。
|
||||
- ConsumerNetworkClient:包含NetworkClient属性,主要进行请求的IO。
|
||||
- AbstractCoordinator:协调器抽象类,里面包含了心跳、Rebalance时的各种请求的处理类等。
|
||||
|
||||
|
||||
## 5、消费客户端-线程模型
|
||||
|
||||
介绍完Kafka消费客户端的类图之后,本节将从消费客户端的线程模型入手,对Kafka的消费客户端进行分享。
|
||||
|
||||
在具体分享之后,先介绍一下Kafka消费客户端的两种消费方式:
|
||||
- assign方式消费:指定分区的方式消费,不具备分区扩容的感知能力,如果需要则需要自己去感知。
|
||||
- subscribe方式消费:借助Group-Coordinator去管理多个消费客户端之间的分区分配策略,消费客户端的变化可以自动的触发分区的重分配。
|
||||
|
||||
相较于subscribe方式的消费,assign的消费方式比较简单,所以本次分享主要介绍subscribe的消费方式。以subscribe方式消费的时候,Kafka消费客户端主要有两个线程,分别是主线程和心跳线程。
|
||||
- 主线程:除心跳线程做的事情之外的所有事情,包括消费相关的请求收发,处理Rebalance状态等。
|
||||
- 心跳线程:和Group-Coordinator保持心跳的线程。
|
||||
|
||||
下面,我们正式开始分享这块的内容。
|
||||
|
||||
|
||||
### 5.1、消费客户端-主线程
|
||||
|
||||
主线程这块主要介绍两部分,第一部分是初始化,然后第二部分是循环的进行数据拉取。下面我们先看一下初始化过程。
|
||||
|
||||
#### 5.1.1、初始化流程
|
||||
|
||||
备注:类图中的红色框表示初始化过程中,一些具有代表性的组件的初始化过程。其中,红色框前面的数字代表了初始化的顺序。
|
||||
|
||||

|
||||
|
||||
#### 5.1.2、数据获取
|
||||
|
||||
备注:其中红色的框部分,我们后续会进行细致的讲解。
|
||||
|
||||

|
||||
|
||||
|
||||
### 5.2、消费客户端-心跳线程
|
||||
|
||||
消费客户端除了主线程进行数据的拉取之外呢,还会有一个心跳线程维持和Group-Coordinator的心跳。
|
||||
|
||||
<img src="./assets/consumer_heartbeat_summary.jpg" width="491px" height="609px">
|
||||
|
||||
## 6、总结
|
||||
|
||||
本次分享了Kafka消费客户端的大体内容,包括主线程初始化、主线程数据拉取及心跳线程的处理流程,接下来我们会继续针对Kafka消费时的具体数据拉取过程、客户端协调器的处理流程等进行详细的分析。
|
||||
|
||||
谢谢大家。
|
||||
BIN
docs/zh/Kafka分享/Kafka消费客户端_整体概述/assets/consumer_class_uml.jpg
Normal file
|
After Width: | Height: | Size: 532 KiB |
BIN
docs/zh/Kafka分享/Kafka消费客户端_整体概述/assets/consumer_fetch_module.jpg
Normal file
|
After Width: | Height: | Size: 229 KiB |
|
After Width: | Height: | Size: 245 KiB |
BIN
docs/zh/Kafka分享/Kafka消费客户端_整体概述/assets/consumer_init_flow.jpg
Normal file
|
After Width: | Height: | Size: 556 KiB |
BIN
docs/zh/Kafka分享/Kafka消费客户端_整体概述/assets/consumer_poll_summary.jpg
Normal file
|
After Width: | Height: | Size: 263 KiB |
131
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/Kafka生产者客户端(简化).md
Normal file
@@ -0,0 +1,131 @@
|
||||
# KafkaProducer
|
||||
<!-- TOC -->
|
||||
|
||||
- [KafkaProducer](#kafkaproducer)
|
||||
- [1、前言](#1前言)
|
||||
|
||||
<!-- /TOC -->
|
||||
|
||||
|
||||
## 1、前言
|
||||
|
||||
本次分享,将会基于Kafka 2.5 社区版本简单介绍下生产者的设计架构与消息发送流程。
|
||||
后续我们会出专项的 `Kafka源码剖析与实战结合案` 的 文章 & 视频,敬请期待...
|
||||
|
||||
开始分享前,我们先来看一个简单的生产者客户端的代码示例:
|
||||
```java
|
||||
public class ProducerTest {
|
||||
private static String topicName;
|
||||
private static int msgNum;
|
||||
|
||||
public static void main(String[] args) {
|
||||
Properties props = new Properties();
|
||||
props.put("bootstrap.servers", "xxx.xxx.xxx.xxx:9092");
|
||||
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
|
||||
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
|
||||
props.put("compression.type", "lz4");
|
||||
props.put("linger.ms", 500 );
|
||||
props.put("batch.size", 100000 );
|
||||
props.put("max.in.flight.requests.per.connection", 1 );
|
||||
topicName = "test";
|
||||
msgNum = *; // 发送的消息数
|
||||
Producer<String, String> producer = new KafkaProducer<>(props);
|
||||
for (int i = 0; i < msgNum; i++) {
|
||||
String msg = i + " This is prodecer test.";
|
||||
producer.send(new ProducerRecord<String, String>(topicName, msg));
|
||||
}
|
||||
producer.close();
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
上面的示例,我们实例化了一个无key的 `ProducerRecord` 的消息对象,然后实例化了一个 `KafkaProducer` 对象通过调用 `send` 方法来进行消息发送的.
|
||||
|
||||
我们看下生产者KafkaProducer消息发送的架构设计:
|
||||
|
||||

|
||||
|
||||
我们首先来了解下 `KafkaProducer` 设计的线程模型.
|
||||
|
||||
生产者客户端的线程主要包括 `主线程` 和 `Sender线程`.
|
||||
|
||||
`主线程` 不会去发送消息,只会把消息缓存到消息累加器中,核心逻辑都在 `KafkaProducer.send` 方法里.
|
||||
|
||||
主要逻辑:
|
||||
|
||||
1、消息首先需要经过生产者拦截器链的处理。生产者拦截器链即生产者拦截器实现类的有序集合,由生产者客户端参数 `{interceptor.classes}` 来指定。生产者拦截器的作用是用来在消息发送前做一些准备工作,比如按某个规则过滤不符合要求的消息、修改消息的内容等,也可以用来在发送回调逻辑前做一些定制化需求,比如统计类工作。自定义实现 `org.apache.kafka.clients.producer.ProducerInterceptor` 接口即可
|
||||
对应上面流程图的步骤①
|
||||
|
||||
2、获取元信息缓存中获取要发送的Topic的信息,如果未获取到,则会立马唤醒 `Sender线程` 去拉取更新元信息.
|
||||
|
||||
3、对消息的 `key` 和 `value` 进行序列化。`key` 和 `value` 对应的序列化器分别由生产者参数 `{key.serializer}` 和 `{value.serializer}` 来指定,自定义实现 `org.apache.kafka.common.serialization.Serializer` 接口即可.
|
||||
对应上面流程图的步骤②
|
||||
|
||||
4、计算消息要发往的分区号.如果生产者发送时指定了要发往的分区,则直接用该分区号; 如果未指定,则按生产者配置的 `分区器` 进行计算处理(未配置会用Kafka默认的 `DefaultParitioner` 分区器处理). 分区器由生产者参数 `{partitioner.class}` 来指定,自定义实现 `org.apache.kafka.clients.producer.Partitioner` 接口即可.
|
||||
对应上面流程图的步骤③
|
||||
|
||||
5、将要发送的消息追加到消息累加器中.
|
||||
对应上面流程图的步骤④
|
||||
|
||||
6、有条件地唤醒 `Sender线程`.
|
||||
|
||||
`send(ProducerRecord<K, V> record)` 方法代码执行时序图:
|
||||
|
||||

|
||||
|
||||
`Sender线程`:
|
||||
|
||||
```java
|
||||
/**
|
||||
* The background thread that handles the sending of produce requests to the Kafka cluster. This thread makes metadata
|
||||
* requests to renew its view of the cluster and then sends produce requests to the appropriate nodes.
|
||||
*
|
||||
* 用来处理生产者消息发送请求到Kafka集群的守护线程. 该守护线程会发起请求来更新集群元数据信息然后来发送生产者的请求到合适的broker节点.
|
||||
*/
|
||||
public class Sender implements Runnable {
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
通过官方拟定的注释,可以清楚地知道 `Sender线程` 主要做的两件事:
|
||||
|
||||
1、发起集群元信息更新请求,更新集群元数据信息.
|
||||
|
||||
2、对生产者的消息发送请求进行处理,选出合适的broker节点并发往.
|
||||
|
||||
`Sender` 是一个实现 `Runnable` 的线程类,所以大家在阅读源码的时候,可以直接去看它的run方法的执行逻辑.
|
||||
|
||||
`Sender线程`对消息的处理步骤:
|
||||
|
||||
1、从消息累加器中抽取消息.
|
||||
对应上面流程图的步骤⑤
|
||||
|
||||
2、经过一系列地处理封装成 `ClientRequest` 类型的请求.
|
||||
对应上面流程图的步骤⑥
|
||||
|
||||
3、然后交由给 `NetworkClient.send` 将消息发送请求进行缓存,并调用 `Selecotr.send` 添加到待发送请求队列中,等待被轮询执行.
|
||||
对应上面流程图的步骤⑦
|
||||
|
||||
4、调用 `NetworkClient.poll`,最终调用 `Selecotr.poll` 进行实际的网络I/O处理.
|
||||
对应上面流程图的步骤⑧
|
||||
|
||||
`消息累加器 ---- RecordAccumulator`:
|
||||
|
||||
消息累加器的作用:
|
||||
|
||||
来缓存消息,方便 `Sender` 线程可以批量发送,进而减少网络传输的资源消耗来提升性能.
|
||||
|
||||
消息累加器中维护了 `主题分区(TopicPartition)` 与 `存放主题消息批次(ProducerBatch)的双端队列` 的映射关系,`主线程` 追加到消息累加器中的消息会按照 `主题分区` 维度进行"聚堆",添加到对应的双端队列尾部的 `ProducerBatch` 批次中.
|
||||
|
||||
消息累加器中用到了Kafka自实现的缓冲池 `BufferPool` ,用来实现对 `java.nio.ByteBuffer` 的复用,来对消息批次进行内存上的创建和释放.
|
||||
|
||||
缓冲池的大小决定了消息累加器可以缓存消息的大小. 这个大小由生产者客户端参数 `{buffer.memory}`来指定。如果生产者客户端需要向很多分区发送消息,适当调大该参数可以增加整体的吞吐量。
|
||||
|
||||
`网络模型-KafkaSelector`
|
||||
Kafka定义的 `Selector` 是Kafka对于java原生的Selector的功能增强,用于满足Kafka的具体需求,来进行真正的网络I/O传输。核心是利用 `NetworkSend`、`NetworkReceive` 与 `KafkaChannel` 这些Kafka定义的模型,完成I/O读写操作。归根结底,使用 `nio` 的`Channel` 与 `ByteBuffer` 交互完成更底层的读写操作。
|
||||
|
||||
Kafka服务端处理发送请求后会将响应通过网络I/O传输返回到 `KafkaSelector`,对应上面流程图的步骤⑨.
|
||||
|
||||
`KafkaSelector` 再将响应交给 `NetworkClient` 进行处理,对应上面流程图的步骤⑩.
|
||||
|
||||
`NetworkClient` 对响应进行处理后,将成功发送的消息批次在消息累加器中进行清除,未发送成功的进行是否可重试判断, 对应上面流程图的步骤⑪
|
||||
1239
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/Kafka生产者客户端.md
Normal file
BIN
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/assets/append_diagram.png
Normal file
|
After Width: | Height: | Size: 175 KiB |
BIN
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/assets/arraydeque.png
Normal file
|
After Width: | Height: | Size: 938 KiB |
BIN
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/assets/callback_1.png
Normal file
|
After Width: | Height: | Size: 889 KiB |
BIN
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/assets/callback_2.png
Normal file
|
After Width: | Height: | Size: 583 KiB |
BIN
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/assets/compression_type_1.png
Normal file
|
After Width: | Height: | Size: 1.4 MiB |
BIN
docs/zh/Kafka分享/Kafka生产者客户端_元信息管理/assets/compression_type_2.png
Normal file
|
After Width: | Height: | Size: 1.9 MiB |