Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 1. Troubleshooting Red Hat Edge Manager
When working with devices in Red Hat Edge Manager, troubleshooting begins with interpreting the structured status messages provided by the device. By identifying the specific phase and component where a failure occurred, you can quickly determine whether an issue is caused by local resource constraints, network connectivity, or configuration errors.
1.1. Troubleshooting device error codes Link kopierenLink in die Zwischenablage kopiert!
To improve security and performance, Red Hat Edge Manager uses structured error codes in device status responses. These codes replace verbose system logs with categorized, actionable summaries, ensuring sensitive data (like credentials) is never exposed in the API or UI.
1.1.1. Error message anatomy Link kopierenLink in die Zwischenablage kopiert!
Every error message follows a standardized 250-character format to help you quickly pinpoint the phase, component, and specific cause of a failure.
The error message format is as follows:
[timestamp] While <Phase>, <Component> failed [for "<Element>"]: <Category> issue - <STATUS_CODE>
| Field | Description | Examples |
|---|---|---|
| Phase | The stage of the operation where the error occurred. |
|
| Component | The specific system area affected. |
|
| Element | The specific resource (file, service, or image). |
|
| Category | The functional area of the failure. |
|
| Status Code | The standardized gRPC-based error code. |
|
1.1.1.1. Error reference & resolution Link kopierenLink in die Zwischenablage kopiert!
Use the table below to identify the root cause of a status code and the recommended next steps for resolution.
| Category | Status Code | Common Causes | Recommended Action |
|---|---|---|---|
| Network |
| DNS failure, registry unreachable, or connection timeout. Image non-existent or inaccessible due to registry permissions. | Check device internet connectivity and firewall rules for registry access. Verify the image name/tag and registry-level access permissions. |
| Security |
| Invalid credentials, expired tokens, or insufficient permissions. | Verify registry credentials and ensure the device identity is valid. |
| Configuration |
| Syntax errors in YAML/JSON or missing mandatory fields. Invalid element, token, or path format. | Validate your configuration spec against the schema. |
| Filesystem |
| Missing files, directory conflicts, or path errors. | Verify the existence of required local resources or mount points. |
| Resource |
| Disk full, Out of Memory (OOM), or CPU throttling. | Check device telemetry for disk usage and memory pressure. |
| System |
| Unexpected system faults or unclassified errors. | See Deep Dive Debugging below to correlate with journal logs. |
1.1.1.2. Rollback and failed OS updates Link kopierenLink in die Zwischenablage kopiert!
If an OS update fails, the device automatically rolls back to the previous version. The phase may appear as RollingBack; when rollback completes, the update condition reason is Error. The device does not retry the failed version automatically. For how to recognize a rollback and what to do next, see Troubleshooting OS update rollback.
1.1.1.3. Deep dive debugging Link kopierenLink in die Zwischenablage kopiert!
While API status responses are sanitized for security, full error details—including stack traces and raw Go error chains—are preserved in the local device journal.
Procedure
If you encounter an UNKNOWN or INTERNAL error, or if the status message is truncated, you can map the status code to the detailed log:
Retrieve the Device Status, making sure to note the
timestampandcomponentfrom the message field.flightctl get device/<device-name> -o yamlAccess the device logs: Search the local journal for the corresponding error context to see the unredacted failure:
journalctl -u fleet-agent | grep "failed to reload systemd daemon"
API responses are limited to 250 characters. For the full diagnostic context—including raw Go error strings and detailed stack traces—refer to the local logs on the device.
1.2. Troubleshooting OS update rollback Link kopierenLink in die Zwischenablage kopiert!
Recognize when a device has rolled back after a failed OS update and what to do next.
When an OS update fails, Red Hat Edge Manager uses greenboot to automatically roll back the device to the previous working OS version. This section helps you recognize when a rollback occurred and what to do next.
1.2.1. Recognizing a rollback or failed update Link kopierenLink in die Zwischenablage kopiert!
Check the device status to see whether an update failed and the device rolled back:
Retrieve the device status:
flightctl get device/<device_name> -o yamlIn the output, check:
-
status.updated.status: After a rollback, the device is typicallyOutOfDate(the device is running the previous OS version, not the version that was requested). -
status.conditions: Look for theUpdatingcondition. If the condition’sreasonisError, the update failed and the device has rolled back to the pre-update OS and configuration. If the reason wasRollingBack, the agent was in the process of rolling back when it last reported.
-
The status.updated.info field may contain a short message about the last state transition.
1.2.2. Viewing greenboot and rollback logs Link kopierenLink in die Zwischenablage kopiert!
When troubleshooting a rollback, the most useful logs are from greenboot itself. On the device, use these commands to view them:
To view health check output (
greenboothealth check results), run:sudo journalctl -o cat -u greenboot-healthcheck.serviceThe following example shows journal output typical of a failed
greenboothealth check. Use it to pattern-match what you see on a device:Running Required Health Check Scripts... [20_check_flightctl_agent.sh] INFO: === flightctl-agent greenboot health check started === [20_check_flightctl_agent.sh] INFO: GRUB boot variables: boot_success=0 boot_counter=2 ... time="..." level=error msg="health: Service check failed: service is not enabled (state: disabled)" [20_check_flightctl_agent.sh] ERROR: flightctl-agent health check failedTo view pre-rollback diagnostic output (scripts that run before rollback), run:
sudo journalctl -o cat -u redboot-task-runner.serviceTo quickly check whether the last boot was declared successful by
greenboot, inspect the GRUB environment on the device:sudo grub2-editenv - list | grep ^boot_successA value of
boot_success=1meansgreenbootdeclared the boot healthy. A value of0means either health checks are still running or the boot was declared failed.
1.2.3. Enabling persistent journal storage Link kopierenLink in die Zwischenablage kopiert!
By default, the systemd journal service stores data in the volatile /run/log/journal directory, which does not persist across reboots. To retain greenboot and agent logs for post-rollback analysis, enable persistent storage.
Procedure
Create the journal configuration directory:
sudo mkdir -p /etc/systemd/journald.conf.dCreate the configuration file:
cat <<EOF | sudo tee /etc/systemd/journald.conf.d/flightctl.conf &>/dev/null [Journal] Storage=persistent SystemMaxUse=1G RuntimeMaxUse=1G EOF-
Edit the configuration file values for your size requirements. For example, adjust
SystemMaxUseandRuntimeMaxUsein/etc/systemd/journald.conf.d/flightctl.conf. - Restart the journal service to apply the configuration:
sudo systemctl restart systemd-journald
1.2.4. Post-rollback recovery and diagnostics Link kopierenLink in die Zwischenablage kopiert!
-
Verify the device is running: The device should be online and running the previous OS version. Confirm that
status.summary.statusisOnlineorDegradedand thatstatus.os.imagematches the previous (working) image. -
Investigate the failure: Use the device status message and, if you have access, the device logs. Prefer the
greenbootjournal output (see Viewing greenboot and rollback logs); you can also check the agent journal (for example,journalctl -u flightctl-agent.service) to determine why the update failed. Common causes include health check failures after reboot, network or registry issues, or resource constraints. See Troubleshooting device error codes for error categories and recommended actions. Fix and try a new version: Address the underlying issue (for example, fix the OS image or configuration, or resolve network or resource problems). When ready, update the device spec to a new OS image version or a corrected image so the agent can attempt an update again.
NoteThe agent does not retry a failed version. It marks the failed version and skips it in future reconciliation. Pushing the same OS image again without change will not trigger a retry; you must push a new image version (different digest).
1.2.5. When to escalate Link kopierenLink in die Zwischenablage kopiert!
Consider escalating or opening a support case if:
- The device does not come back online after a rollback.
- Rollbacks happen repeatedly for the same or different OS versions.
-
The device status remains in
RollingBackorErrorfor an extended period with no recovery. - You need to force a retry of a previously failed version and the product does not provide a supported way to do so.
1.3. Generating a device log bundle Link kopierenLink in die Zwischenablage kopiert!
Use the integrated flightctl-must-gather script directly on the device to generate a comprehensive bundle of diagnostic logs. This log bundle, in a standard .tar format, provides the necessary data to debug the device agent and assists in efficient troubleshooting and bug reporting.
Run the following command on the device and include the .tar file in the bug report.
This depends on an SSH connection to extract the .tar file.
sudo flightctl-must-gather
1.4. Viewing a device’s effective target configuration Link kopierenLink in die Zwischenablage kopiert!
The device manifest returned by the flightctl get device command still only has references to external configuration and secret objects. Only when the device agent queries the service, the service replaces the references with the actual configuration and secret data.
While this better protects potentially sensitive data, it also makes troubleshooting faulty configurations hard. This is why a user can be authorized to query the effective configuration as rendered by the service to the agent.
Procedure
To query the effective configuration, use the following command:
flightctl get device/${device_name} --rendered | jq