Este conteúdo não está disponível no idioma selecionado.
28.5. Troubleshooting NVDIMM
28.5.1. Monitoring NVDIMM Health Using S.M.A.R.T. Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
Some NVDIMMs support Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) interfaces for retrieving health information.
Monitor NVDIMM health regularly to prevent data loss. If S.M.A.R.T. reports problems with the health status of an NVDIMM, replace it as described in Section 28.5.2, “Detecting and Replacing a Broken NVDIMM”.
Prerequisites
- On some systems, the acpi_ipmi driver must be loaded to retrieve health information using the following command:
# modprobe acpi_ipmi
Procedure
- To access the health information, use the following command:
# ndctl list --dimms --health ... { "dev":"nmem0", "id":"802c-01-1513-b3009166", "handle":1, "phys_id":22, "health": { "health_state":"ok", "temperature_celsius":25.000000, "spares_percentage":99, "alarm_temperature":false, "alarm_spares":false, "temperature_threshold":50.000000, "spares_threshold":20, "life_used_percentage":1, "shutdown_state":"clean" } } ...
28.5.2. Detecting and Replacing a Broken NVDIMM Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
If you find error messages related to NVDIMM reported in your system log or by S.M.A.R.T., it might mean an NVDIMM device is failing. In that case, it is necessary to:
- Detect which NVDIMM device is failing,
- Back up data stored on it, and
- Physically replace the device.
Procedure 28.3. Detecting and Replacing a Broken NVDIMM
- To detect the broken DIMM, use the following command:
# ndctl list --dimms --regions --health --media-errors --humanThebadblocksfield shows which NVDIMM is broken. Note its name in thedevfield. In the following example, the NVDIMM namednmem0is broken:Example 28.1. Health Status of NVDIMM Devices
# ndctl list --dimms --regions --health --media-errors --human ... "regions":[ { "dev":"region0", "size":"250.00 GiB (268.44 GB)", "available_size":0, "type":"pmem", "numa_node":0, "iset_id":"0xXXXXXXXXXXXXXXXX", "mappings":[ { "dimm":"nmem1", "offset":"0x10000000", "length":"0x1f40000000", "position":1 }, { "dimm":"nmem0", "offset":"0x10000000", "length":"0x1f40000000", "position":0 } ], "badblock_count":1, "badblocks":[ { "offset":65536, "length":1, "dimms":[ "nmem0" ] } ], "persistence_domain":"memory_controller" } ] } - Use the following command to find the
phys_idattribute of the broken NVDIMM:# ndctl list --dimms --humanFrom the previous example, you know thatnmem0is the broken NVDIMM. Therefore, find thephys_idattribute ofnmem0. In the following example, thephys_idis0x10:Example 28.2. The phys_id Attributes of NVDIMMs
# ndctl list --dimms --human [ { "dev":"nmem1", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x120", "phys_id":"0x1c" }, { "dev":"nmem0", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x20", "phys_id":"0x10", "flag_failed_flush":true, "flag_smart_event":true } ] - Use the following command to find the memory slot of the broken NVDIMM:
# dmidecodeIn the output, find the entry where theHandleidentifier matches thephys_idattribute of the broken NVDIMM. TheLocatorfield lists the memory slot used by the broken NVDIMM. In the following example, thenmem0device matches the0x0010identifier and uses theDIMM-XXX-YYYYmemory slot:Example 28.3. NVDIMM Memory Slot Listing
# dmidecode ... Handle 0x0010, DMI type 17, 40 bytes Memory Device Array Handle: 0x0004 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 125 GB Form Factor: DIMM Set: 1 Locator: DIMM-XXX-YYYY Bank Locator: Bank0 Type: Other Type Detail: Non-Volatile Registered (Buffered) ... - Back up all data in the namespaces on the NVDIMM. If you do not back up the data before replacing the NVDIMM, the data will be lost when you remove the NVDIMM from your system.
Warning
In some cases, such as when the NVDIMM is completely broken, the backup might fail.To prevent this, regularly monitor your NVDIMM devices using S.M.A.R.T. as described in Section 28.5.1, “Monitoring NVDIMM Health Using S.M.A.R.T.” and replace failing NVDIMMs before they break.Use the following command to list the namespaces on the NVDIMM:# ndctl list --namespaces --dimm=DIMM-ID-numberIn the following example, thenmem0device contains thenamespace0.0andnamespace0.2namespaces, which you need to back up:Example 28.4. NVDIMM Namespaces Listing
# ndctl list --namespaces --dimm=0 [ { "dev":"namespace0.2", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0.2s", "numa_node":0 }, { "dev":"namespace0.0", "mode":"sector", "size":67042312192, "uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "raw_uuid":"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", "sector_size":4096, "blockdev":"pmem0s", "numa_node":0 } ] - Replace the broken NVDIMM physically.