18.9. Detecting and replacing a broken NVDIMM device
If you find error messages related to Non-Volatile Dual In-line Memory Modules (NVDIMM) reported in your system log or by S.M.A.R.T., it might mean an NVDIMM device is failing.
In that case, it is necessary to:
- Detect which NVDIMM device is failing
- Back up data stored on it
- Physically replace the device
Procedure
Detect the broken device:
# ndctl list --dimms --regions --health { "dimms":[ { "dev":"nmem1", "id":"8089-a2-1834-00001f13", "handle":17, "phys_id":32, "security":"disabled", "health":{ "health_state":"ok", "temperature_celsius":35.0, [...] } [...] }Find the
phys_idattribute of the broken NVDIMM:# ndctl list --dimms --humanFrom the previous example, you know that
nmem0is the broken NVDIMM. Therefore, find thephys_idattribute ofnmem0.In the following example, the
phys_idis0x10:# ndctl list --dimms --human [ { "dev":"nmem1", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x120", "phys_id":"0x1c" }, { "dev":"nmem0", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x20", "phys_id":"0x10", "flag_failed_flush":true, "flag_smart_event":true } ]Find the memory slot of the broken NVDIMM:
# dmidecodeIn the output, find the entry where the Handle identifier matches the
phys_idattribute of the broken NVDIMM. The Locator field lists the memory slot used by the broken NVDIMM.In the following example, the
nmem0device matches the0x0010identifier and uses theDIMM-XXX-YYYYmemory slot:# dmidecode ... Handle 0x0010, DMI type 17, 40 bytes Memory Device Array Handle: 0x0004 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 125 GB Form Factor: DIMM Set: 1 Locator: DIMM-XXX-YYYY Bank Locator: Bank0 Type: Other Type Detail: Non-Volatile Registered (Buffered) ...Back up all data in the namespaces on the NVDIMM. If you do not back up the data before replacing the NVDIMM, the data will be lost when you remove the NVDIMM from your system.
주의In some cases, such as when the NVDIMM is completely broken, the backup might fail.
To prevent this, regularly monitor your NVDIMM devices using S.M.A.R.T. as described in Monitoring NVDIMM health using S.M.A.R.T. and replace failing NVDIMMs before they break.
List the namespaces on the NVDIMM:
# ndctl list --namespaces --dimm=DIMM-ID-numberIn the following example, the
nmem0device contains thenamespace0.0andnamespace0.2namespaces, which you need to back up:# ndctl list --namespaces --dimm=0 [ { "dev":"namespace0.2", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0.2s", "numa_node":0 }, { "dev":"namespace0.0", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0s", "numa_node":0 } ]- Replace the broken NVDIMM physically.