18.9. Detecting and replacing a broken NVDIMM device


If you find error messages related to Non-Volatile Dual In-line Memory Modules (NVDIMM) reported in your system log or by S.M.A.R.T., it might mean an NVDIMM device is failing.

In that case, it is necessary to:

  1. Detect which NVDIMM device is failing
  2. Back up data stored on it
  3. Physically replace the device

Procedure

  1. Detect the broken device:

    # ndctl list --dimms --regions --health
    {
      "dimms":[
        {
          "dev":"nmem1",
          "id":"8089-a2-1834-00001f13",
          "handle":17,
          "phys_id":32,
          "security":"disabled",
          "health":{
            "health_state":"ok",
            "temperature_celsius":35.0,
            [...]
          }
    [...]
    }
  2. Find the phys_id attribute of the broken NVDIMM:

    # ndctl list --dimms --human

    From the previous example, you know that nmem0 is the broken NVDIMM. Therefore, find the phys_id attribute of nmem0.

    In the following example, the phys_id is 0x10:

    # ndctl list --dimms --human
    
    [
      {
        "dev":"nmem1",
        "id":"XXXX-XX-XXXX-XXXXXXXX",
        "handle":"0x120",
        "phys_id":"0x1c"
      },
      {
        "dev":"nmem0",
        "id":"XXXX-XX-XXXX-XXXXXXXX",
        "handle":"0x20",
        "phys_id":"0x10",
        "flag_failed_flush":true,
        "flag_smart_event":true
      }
    ]
  3. Find the memory slot of the broken NVDIMM:

    # dmidecode

    In the output, find the entry where the Handle identifier matches the phys_id attribute of the broken NVDIMM. The Locator field lists the memory slot used by the broken NVDIMM.

    In the following example, the nmem0 device matches the 0x0010 identifier and uses the DIMM-XXX-YYYY memory slot:

    # dmidecode
    
    ...
    Handle 0x0010, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x0004
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 125 GB
            Form Factor: DIMM
            Set: 1
            Locator: DIMM-XXX-YYYY
            Bank Locator: Bank0
            Type: Other
            Type Detail: Non-Volatile Registered (Buffered)
    ...
  4. Back up all data in the namespaces on the NVDIMM. If you do not back up the data before replacing the NVDIMM, the data will be lost when you remove the NVDIMM from your system.

    주의

    In some cases, such as when the NVDIMM is completely broken, the backup might fail.

    To prevent this, regularly monitor your NVDIMM devices using S.M.A.R.T. as described in Monitoring NVDIMM health using S.M.A.R.T. and replace failing NVDIMMs before they break.

  5. List the namespaces on the NVDIMM:

    # ndctl list --namespaces --dimm=DIMM-ID-number

    In the following example, the nmem0 device contains the namespace0.0 and namespace0.2 namespaces, which you need to back up:

    # ndctl list --namespaces --dimm=0
    
    [
      {
        "dev":"namespace0.2",
        "mode":"sector",
        "size":67042312192,
        "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "sector_size":4096,
        "blockdev":"pmem0.2s",
        "numa_node":0
      },
      {
        "dev":"namespace0.0",
        "mode":"sector",
        "size":67042312192,
        "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
        "sector_size":4096,
        "blockdev":"pmem0s",
        "numa_node":0
      }
    ]
  6. Replace the broken NVDIMM physically.
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2026 Red Hat
맨 위로 이동