28.5. Troubleshooting NVDIMM
28.5.1. Monitoring NVDIMM Health Using S.M.A.R.T.
Some NVDIMMs support Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) interfaces for retrieving health information.
Monitor NVDIMM health regularly to prevent data loss. If S.M.A.R.T. reports problems with the health status of an NVDIMM, replace it as described in Section 28.5.2, “Detecting and Replacing a Broken NVDIMM”.
Prerequisites
- On some systems, the acpi_ipmi driver must be loaded to retrieve health information using the following command:
#
modprobe acpi_ipmi
Procedure
- To access the health information, use the following command:
#
ndctl list --dimms --health
... { "dev":"nmem0", "id":"802c-01-1513-b3009166", "handle":1, "phys_id":22, "health": { "health_state":"ok", "temperature_celsius":25.000000, "spares_percentage":99, "alarm_temperature":false, "alarm_spares":false, "temperature_threshold":50.000000, "spares_threshold":20, "life_used_percentage":1, "shutdown_state":"clean" } } ...
28.5.2. Detecting and Replacing a Broken NVDIMM
If you find error messages related to NVDIMM reported in your system log or by S.M.A.R.T., it might mean an NVDIMM device is failing. In that case, it is necessary to:
- Detect which NVDIMM device is failing,
- Back up data stored on it, and
- Physically replace the device.
Procedure 28.3. Detecting and Replacing a Broken NVDIMM
- To detect the broken DIMM, use the following command:
# ndctl list --dimms --regions --health --media-errors --human
Thebadblocks
field shows which NVDIMM is broken. Note its name in thedev
field. In the following example, the NVDIMM namednmem0
is broken:Example 28.1. Health Status of NVDIMM Devices
# ndctl list --dimms --regions --health --media-errors --human ... "regions":[ { "dev":"region0", "size":"250.00 GiB (268.44 GB)", "available_size":0, "type":"pmem", "numa_node":0, "iset_id":"0xXXXXXXXXXXXXXXXX", "mappings":[ { "dimm":"nmem1", "offset":"0x10000000", "length":"0x1f40000000", "position":1 }, { "dimm":"nmem0", "offset":"0x10000000", "length":"0x1f40000000", "position":0 } ], "badblock_count":1, "badblocks":[ { "offset":65536, "length":1, "dimms":[ "nmem0" ] } ], "persistence_domain":"memory_controller" } ] }
- Use the following command to find the
phys_id
attribute of the broken NVDIMM:# ndctl list --dimms --human
From the previous example, you know thatnmem0
is the broken NVDIMM. Therefore, find thephys_id
attribute ofnmem0
. In the following example, thephys_id
is0x10
:Example 28.2. The phys_id Attributes of NVDIMMs
# ndctl list --dimms --human [ { "dev":"nmem1", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x120", "phys_id":"0x1c" }, { "dev":"nmem0", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x20", "phys_id":"0x10", "flag_failed_flush":true, "flag_smart_event":true } ]
- Use the following command to find the memory slot of the broken NVDIMM:
# dmidecode
In the output, find the entry where theHandle
identifier matches thephys_id
attribute of the broken NVDIMM. TheLocator
field lists the memory slot used by the broken NVDIMM. In the following example, thenmem0
device matches the0x0010
identifier and uses theDIMM-XXX-YYYY
memory slot:Example 28.3. NVDIMM Memory Slot Listing
# dmidecode ... Handle 0x0010, DMI type 17, 40 bytes Memory Device Array Handle: 0x0004 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 125 GB Form Factor: DIMM Set: 1 Locator: DIMM-XXX-YYYY Bank Locator: Bank0 Type: Other Type Detail: Non-Volatile Registered (Buffered) ...
- Back up all data in the namespaces on the NVDIMM. If you do not back up the data before replacing the NVDIMM, the data will be lost when you remove the NVDIMM from your system.
Warning
In some cases, such as when the NVDIMM is completely broken, the backup might fail.To prevent this, regularly monitor your NVDIMM devices using S.M.A.R.T. as described in Section 28.5.1, “Monitoring NVDIMM Health Using S.M.A.R.T.” and replace failing NVDIMMs before they break.Use the following command to list the namespaces on the NVDIMM:# ndctl list --namespaces --dimm=DIMM-ID-number
In the following example, thenmem0
device contains thenamespace0.0
andnamespace0.2
namespaces, which you need to back up:Example 28.4. NVDIMM Namespaces Listing
# ndctl list --namespaces --dimm=0 [ { "dev":"namespace0.2", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0.2s", "numa_node":0 }, { "dev":"namespace0.0", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0s", "numa_node":0 } ]
- Replace the broken NVDIMM physically.