このコンテンツは選択した言語では利用できません。

Chapter 1. Troubleshooting Red Hat Edge Manager


When working with devices in Red Hat Edge Manager, troubleshooting begins with interpreting the structured status messages provided by the device. By identifying the specific phase and component where a failure occurred, you can quickly determine whether an issue is caused by local resource constraints, network connectivity, or configuration errors.

1.1. Troubleshooting device error codes

To improve security and performance, Red Hat Edge Manager uses structured error codes in device status responses. These codes replace verbose system logs with categorized, actionable summaries, ensuring sensitive data (like credentials) is never exposed in the API or UI.

1.1.1. Error message anatomy

Every error message follows a standardized 250-character format to help you quickly pinpoint the phase, component, and specific cause of a failure.

The error message format is as follows:

[timestamp] While <Phase>, <Component> failed [for "<Element>"]: <Category> issue - <STATUS_CODE>
Expand
FieldDescriptionExamples

Phase

The stage of the operation where the error occurred.

Preparing, ApplyingUpdate, Rebooting, RollingBack

Component

The specific system area affected.

os, config, applications, systemd

Element

The specific resource (file, service, or image).

/etc/app.conf, fleet-agent.service, quay.io/app

Category

The functional area of the failure.

Network, Security, Resource

Status Code

The standardized gRPC-based error code.

UNAVAILABLE, PERMISSION_DENIED, INTERNAL

1.1.1.1. Error reference & resolution

Use the table below to identify the root cause of a status code and the recommended next steps for resolution.

Expand
CategoryStatus CodeCommon CausesRecommended Action

Network

UNAVAILABLE / DEADLINE_EXCEEDED

DNS failure, registry unreachable, or connection timeout. Image non-existent or inaccessible due to registry permissions.

Check device internet connectivity and firewall rules for registry access. Verify the image name/tag and registry-level access permissions.

Security

PERMISSION_DENIED / UNAUTHENTICATED

Invalid credentials, expired tokens, or insufficient permissions.

Verify registry credentials and ensure the device identity is valid.

Configuration

INVALID_ARGUMENT / FAILED_PRECONDITION

Syntax errors in YAML/JSON or missing mandatory fields. Invalid element, token, or path format.

Validate your configuration spec against the schema.

Filesystem

NOT_FOUND / ALREADY_EXISTS

Missing files, directory conflicts, or path errors.

Verify the existence of required local resources or mount points.

Resource

RESOURCE_EXHAUSTED

Disk full, Out of Memory (OOM), or CPU throttling.

Check device telemetry for disk usage and memory pressure.

System

INTERNAL / UNKNOWN

Unexpected system faults or unclassified errors.

See Deep Dive Debugging below to correlate with journal logs.

1.1.1.2. Rollback and failed OS updates

If an OS update fails, the device automatically rolls back to the previous version. The phase may appear as RollingBack; when rollback completes, the update condition reason is Error. The device does not retry the failed version automatically. For how to recognize a rollback and what to do next, see Troubleshooting OS update rollback.

1.1.1.3. Deep dive debugging

While API status responses are sanitized for security, full error details—including stack traces and raw Go error chains—are preserved in the local device journal.

Procedure

If you encounter an UNKNOWN or INTERNAL error, or if the status message is truncated, you can map the status code to the detailed log:

  1. Retrieve the Device Status, making sure to note the timestamp and component from the message field.

    flightctl get device/<device-name> -o yaml
  2. Access the device logs: Search the local journal for the corresponding error context to see the unredacted failure:

    journalctl -u fleet-agent | grep "failed to reload systemd daemon"

API responses are limited to 250 characters. For the full diagnostic context—including raw Go error strings and detailed stack traces—refer to the local logs on the device.

1.2. Troubleshooting OS update rollback

Recognize when a device has rolled back after a failed OS update and what to do next.

When an OS update fails, Red Hat Edge Manager uses greenboot to automatically roll back the device to the previous working OS version. This section helps you recognize when a rollback occurred and what to do next.

1.2.1. Recognizing a rollback or failed update

Check the device status to see whether an update failed and the device rolled back:

  1. Retrieve the device status:

    flightctl get device/<device_name> -o yaml
  2. In the output, check:

    • status.updated.status: After a rollback, the device is typically OutOfDate (the device is running the previous OS version, not the version that was requested).
    • status.conditions: Look for the Updating condition. If the condition’s reason is Error, the update failed and the device has rolled back to the pre-update OS and configuration. If the reason was RollingBack, the agent was in the process of rolling back when it last reported.

The status.updated.info field may contain a short message about the last state transition.

1.2.2. Viewing greenboot and rollback logs

When troubleshooting a rollback, the most useful logs are from greenboot itself. On the device, use these commands to view them:

  1. To view health check output (greenboot health check results), run:

    sudo journalctl -o cat -u greenboot-healthcheck.service

    The following example shows journal output typical of a failed greenboot health check. Use it to pattern-match what you see on a device:

    Running Required Health Check Scripts...
    [20_check_flightctl_agent.sh] INFO: === flightctl-agent greenboot health check started ===
    [20_check_flightctl_agent.sh] INFO: GRUB boot variables:
    boot_success=0
    boot_counter=2
    ...
    time="..." level=error msg="health: Service check failed: service is not enabled (state: disabled)"
    [20_check_flightctl_agent.sh] ERROR: flightctl-agent health check failed
  2. To view pre-rollback diagnostic output (scripts that run before rollback), run:

    sudo journalctl -o cat -u redboot-task-runner.service
  3. To quickly check whether the last boot was declared successful by greenboot, inspect the GRUB environment on the device:

    sudo grub2-editenv - list | grep ^boot_success

    A value of boot_success=1 means greenboot declared the boot healthy. A value of 0 means either health checks are still running or the boot was declared failed.

1.2.3. Enabling persistent journal storage

By default, the systemd journal service stores data in the volatile /run/log/journal directory, which does not persist across reboots. To retain greenboot and agent logs for post-rollback analysis, enable persistent storage.

Procedure

  1. Create the journal configuration directory:

    sudo mkdir -p /etc/systemd/journald.conf.d
  2. Create the configuration file:

    cat <<EOF | sudo tee /etc/systemd/journald.conf.d/flightctl.conf &>/dev/null
    [Journal]
    Storage=persistent
    SystemMaxUse=1G
    RuntimeMaxUse=1G
    EOF
  3. Edit the configuration file values for your size requirements. For example, adjust SystemMaxUse and RuntimeMaxUse in /etc/systemd/journald.conf.d/flightctl.conf.
  4. Restart the journal service to apply the configuration:
sudo systemctl restart systemd-journald

1.2.4. Post-rollback recovery and diagnostics

  • Verify the device is running: The device should be online and running the previous OS version. Confirm that status.summary.status is Online or Degraded and that status.os.image matches the previous (working) image.
  • Investigate the failure: Use the device status message and, if you have access, the device logs. Prefer the greenboot journal output (see Viewing greenboot and rollback logs); you can also check the agent journal (for example, journalctl -u flightctl-agent.service) to determine why the update failed. Common causes include health check failures after reboot, network or registry issues, or resource constraints. See Troubleshooting device error codes for error categories and recommended actions.
  • Fix and try a new version: Address the underlying issue (for example, fix the OS image or configuration, or resolve network or resource problems). When ready, update the device spec to a new OS image version or a corrected image so the agent can attempt an update again.

    Note

    The agent does not retry a failed version. It marks the failed version and skips it in future reconciliation. Pushing the same OS image again without change will not trigger a retry; you must push a new image version (different digest).

1.2.5. When to escalate

Consider escalating or opening a support case if:

  • The device does not come back online after a rollback.
  • Rollbacks happen repeatedly for the same or different OS versions.
  • The device status remains in RollingBack or Error for an extended period with no recovery.
  • You need to force a retry of a previously failed version and the product does not provide a supported way to do so.

1.3. Generating a device log bundle

Use the integrated flightctl-must-gather script directly on the device to generate a comprehensive bundle of diagnostic logs. This log bundle, in a standard .tar format, provides the necessary data to debug the device agent and assists in efficient troubleshooting and bug reporting.

  • Run the following command on the device and include the .tar file in the bug report.

    This depends on an SSH connection to extract the .tar file.

    sudo flightctl-must-gather

1.4. Viewing a device’s effective target configuration

The device manifest returned by the flightctl get device command still only has references to external configuration and secret objects. Only when the device agent queries the service, the service replaces the references with the actual configuration and secret data.

While this better protects potentially sensitive data, it also makes troubleshooting faulty configurations hard. This is why a user can be authorized to query the effective configuration as rendered by the service to the agent.

Procedure

  • To query the effective configuration, use the following command:

    flightctl get device/${device_name} --rendered | jq
Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

Red Hat ドキュメントについて

Legal Notice

Theme

© 2026 Red Hat
トップに戻る