Chapter 12. Finding information on Kafka restarts
After the Cluster Operator restarts a Kafka pod in an OpenShift cluster, it emits an OpenShift event into the pod’s namespace explaining why the pod restarted. For help in understanding cluster behavior, you can check restart events from the command line.
You can export and monitor restart events using metrics collection tools like Prometheus. Use the metrics tool with an event exporter that can export the output in a suitable format.
12.1. Reasons for a restart event
The Cluster Operator initiates a restart event for a specific reason. You can check the reason by fetching information on the restart event.
The reason given depends on whether you are using StrimziPodSet
or StatefulSet
resources for the creation and management of pods.
StrimziPodSet | StatefulSet | Description |
---|---|---|
CaCertHasOldGeneration | CaCertHasOldGeneration | The pod is still using a server certificate signed with an old CA, so needs to be restarted as part of the certificate update. |
CaCertRemoved | CaCertRemoved | Expired CA certificates have been removed, and the pod is restarted to run with the current certificates. |
CaCertRenewed | CaCertRenewed | CA certificates have been renewed, and the pod is restarted to run with the updated certificates. |
ClientCaCertKeyReplaced | ClientCaCertKeyReplaced | The key used to sign clients CA certificates has been replaced, and the pod is being restarted as part of the CA renewal process. |
ClusterCaCertKeyReplaced | ClusterCaCertKeyReplaced | The key used to sign the cluster’s CA certificates has been replaced, and the pod is being restarted as part of the CA renewal process. |
ConfigChangeRequiresRestart | ConfigChangeRequiresRestart | Some Kafka configuration properties are changed dynamically, but others require that the broker be restarted. |
CustomListenerCaCertChanged | CustomListenerCaCertChanged | The CA certificate used to secure the Kafka network listeners has changed, and the pod is restarted to use it. |
FileSystemResizeNeeded | FileSystemResizeNeeded | The file system size has been increased, and a restart is needed to apply it. |
KafkaCertificatesChanged | KafkaCertificatesChanged | One or more TLS certificates used by the Kafka broker have been updated, and a restart is needed to use them. |
ManualRollingUpdate | ManualRollingUpdate |
A user annotated the pod, or the |
PodForceRestartOnError | PodForceRestartOnError | An error occurred that requires a pod restart to rectify. |
PodHasOldRevision | JbodVolumesChanged |
A disk was added or removed from the Kafka volumes, and a restart is needed to apply the change. When using |
PodHasOldRevision | PodHasOldGeneration |
The |
PodStuck | PodStuck | The pod is still pending, and is not scheduled or cannot be scheduled, so the operator has restarted the pod in a final attempt to get it running. |
PodUnresponsive | PodUnresponsive | AMQ Streams was unable to connect to the pod, which can indicate a broker not starting correctly, so the operator restarted it in an attempt to resolve the issue. |
12.2. Restart event filters
When checking restart events from the command line, you can specify a field-selector
to filter on OpenShift event fields.
The following fields are available when filtering events with field-selector
.
regardingObject.kind
-
The object that was restarted, and for restart events, the kind is always
Pod
. regarding.namespace
- The namespace that the pod belongs to.
regardingObject.name
-
The pod’s name, for example,
strimzi-cluster-kafka-0
. regardingObject.uid
- The unique ID of the pod.
reason
-
The reason the pod was restarted, for example,
JbodVolumesChanged
. reportingController
-
The reporting component is always
strimzi.io/cluster-operator
for AMQ Streams restart events. source
-
source
is an older version ofreportingController
. The reporting component is alwaysstrimzi.io/cluster-operator
for AMQ Streams restart events. type
-
The event type, which is either
Warning
orNormal
. For AMQ Streams restart events, the type isNormal
.
In older versions of OpenShift, the fields using the regarding
prefix might use an involvedObject
prefix instead. reportingController
was previously called reportingComponent
.
12.3. Checking Kafka restarts
Use a oc
command to list restart events initiated by the Cluster Operator. Filter restart events emitted by the Cluster Operator by setting the Cluster Operator as the reporting component using the reportingController
or source
event fields.
Prerequisites
- The Cluster Operator is running in the OpenShift cluster.
Procedure
Get all restart events emitted by the Cluster Operator:
oc -n kafka get events --field-selector reportingController=strimzi.io/cluster-operator
Example showing events returned
LAST SEEN TYPE REASON OBJECT MESSAGE 2m Normal CaCertRenewed pod/strimzi-cluster-kafka-0 CA certificate renewed 58m Normal PodForceRestartOnError pod/strimzi-cluster-kafka-1 Pod needs to be forcibly restarted due to an error 5m47s Normal ManualRollingUpdate pod/strimzi-cluster-kafka-2 Pod was manually annotated to be rolled
You can also specify a
reason
or otherfield-selector
options to constrain the events returned.Here, a specific reason is added:
oc -n kafka get events --field-selector reportingController=strimzi.io/cluster-operator,reason=PodForceRestartOnError
Use an output format, such as YAML, to return more detailed information about one or more events.
oc -n kafka get events --field-selector reportingController=strimzi.io/cluster-operator,reason=PodForceRestartOnError -o yaml
Example showing detailed events output
apiVersion: v1 items: - action: StrimziInitiatedPodRestart apiVersion: v1 eventTime: "2022-05-13T00:22:34.168086Z" firstTimestamp: null involvedObject: kind: Pod name: strimzi-cluster-kafka-1 namespace: kafka kind: Event lastTimestamp: null message: Pod needs to be forcibly restarted due to an error metadata: creationTimestamp: "2022-05-13T00:22:34Z" generateName: strimzi-event name: strimzi-eventwppk6 namespace: kafka resourceVersion: "432961" uid: 29fcdb9e-f2cf-4c95-a165-a5efcd48edfc reason: PodForceRestartOnError reportingController: strimzi.io/cluster-operator reportingInstance: strimzi-cluster-operator-6458cfb4c6-6bpdp source: {} type: Normal kind: List metadata: resourceVersion: "" selfLink: ""
The following fields are deprecated, so they are not populated for these events:
-
firstTimestamp
-
lastTimestamp
-
source