18.9. Detecting and replacing a broken NVDIMM device


If you find error messages related to Non-Volatile Dual In-line Memory Modules (NVDIMM) reported in your system log or by S.M.A.R.T., it might mean an NVDIMM device is failing.

In that case, it is necessary to:

  1. Detect which NVDIMM device is failing
  2. Back up data stored on it
  3. Physically replace the device

Procedure

  1. Detect the broken device:

    # ndctl list --dimms --regions --health
    {
      "dimms":[
        {
          "dev":"nmem1",
          "id":"8089-a2-1834-00001f13",
          "handle":17,
          "phys_id":32,
          "security":"disabled",
          "health":{
            "health_state":"ok",
            "temperature_celsius":35.0,
            [...]
          }
    [...]
    }
  2. Find the phys_id attribute of the broken NVDIMM:

    # ndctl list --dimms --human

    From the previous example, you know that nmem0 is the broken NVDIMM. Therefore, find the phys_id attribute of nmem0.

    In the following example, the phys_id is 0x10:

    # ndctl list --dimms --human
    
    [
      {
        "dev":"nmem1",
        "id":"XXXX-XX-XXXX-XXXXXXXX",
        "handle":"0x120",
        "phys_id":"0x1c"
      },
      {
        "dev":"nmem0",
        "id":"XXXX-XX-XXXX-XXXXXXXX",
        "handle":"0x20",
        "phys_id":"0x10",
        "flag_failed_flush":true,
        "flag_smart_event":true
      }
    ]
  3. Find the memory slot of the broken NVDIMM:

    # dmidecode

    In the output, find the entry where the Handle identifier matches the phys_id attribute of the broken NVDIMM. The Locator field lists the memory slot used by the broken NVDIMM.

    In the following example, the nmem0 device matches the 0x0010 identifier and uses the DIMM-XXX-YYYY memory slot:

    # dmidecode
    
    ...
    Handle 0x0010, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x0004
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 125 GB
            Form Factor: DIMM
            Set: 1
            Locator: DIMM-XXX-YYYY
            Bank Locator: Bank0
            Type: Other
            Type Detail: Non-Volatile Registered (Buffered)
    ...
  4. Back up all data in the namespaces on the NVDIMM. If you do not back up the data before replacing the NVDIMM, the data will be lost when you remove the NVDIMM from your system.

    警告

    In some cases, such as when the NVDIMM is completely broken, the backup might fail.

    To prevent this, regularly monitor your NVDIMM devices using S.M.A.R.T. as described in Monitoring NVDIMM health using S.M.A.R.T. and replace failing NVDIMMs before they break.

  5. List the namespaces on the NVDIMM:

    # ndctl list --namespaces --dimm=DIMM-ID-number

    In the following example, the nmem0 device contains the namespace0.0 and namespace0.2 namespaces, which you need to back up:

    # ndctl list --namespaces --dimm=0
    
    [
      {
        "dev":"namespace0.2",
        "mode":"sector",
        "size":67042312192,
        "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "sector_size":4096,
        "blockdev":"pmem0.2s",
        "numa_node":0
      },
      {
        "dev":"namespace0.0",
        "mode":"sector",
        "size":67042312192,
        "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "sector_size":4096,
        "blockdev":"pmem0s",
        "numa_node":0
      }
    ]
  6. Replace the broken NVDIMM physically.
Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

Red Hat をお使いのお客様が、信頼できるコンテンツが含まれている製品やサービスを活用することで、イノベーションを行い、目標を達成できるようにします。 最新の更新を見る.

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

Theme

© 2026 Red Hat
トップに戻る