21.6. 检查硬件错误
红帽企业 Linux 7 引入了新的硬件事件报告机制 (HERM.)这种机制收集系统报告的内存错误,以及错误检测和更正 (EDAC)机制报告的错误,用于双行内存模块(DIMM),并将它们报告给用户空间。用户空间守护进程 rasdaemon
捕获和处理来自内核追踪机制的所有 可靠性、可用性和可维护性 (RAS)错误事件,并记录它们。以前由 edac-utils
提供的函数现在由 rasdaemon
替代。
要安装 install rasdaemon
,以 root
用户身份输入以下命令:
~]# yum install rasdaemon
按如下所示启动服务:
~]# systemctl start rasdaemon
要使服务在系统启动时运行,请输入以下命令:
~]# systemctl enable rasdaemon
The ras-mc-ctl
实用程序提供了一种使用 EDAC 驱动程序的方法。输入以下命令查看命令选项列表:
~]$ ras-mc-ctl --help
Usage: ras-mc-ctl [OPTIONS...]
--quiet Quiet operation.
--mainboard Print mainboard vendor and model for this hardware.
--status Print status of EDAC drivers.
output truncated
要查看内存控制器事件摘要,以 root
用户身份运行:
~]# ras-mc-ctl --summary Memory controller events summary: Corrected on DIMM Label(s): 'CPU_SrcID#0_Ha#0_Chan#0_DIMM#0' location: 0:0:0:-1 errors: 1 No PCIe AER errors. No Extlog errors. MCE records summary: 1 MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error errors 2 No Error errors
要查看内存控制器报告的错误列表,以 root
用户身份运行:
~]# ras-mc-ctl --errors Memory controller events: 1 3172-02-17 00:47:01 -0500 1 Corrected error(s): memory read error at CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 location: 0:0:0:-1, addr 65928, grain 7, syndrome 0 area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0 No PCIe AER errors. No Extlog errors. MCE events: 1 3171-11-09 06:20:21 -0500 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x01000c16, status=0x8c00004000010090, addr=0x1018893000, misc=0x15020a086, walltime=0x57e96780, cpuid=0x00050663, bank=0x00000007 2 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x0000abcd, walltime=0x57e967ea, cpuid=0x00050663, bank=0x00000001 3 3205-06-22 00:13:41 -0400 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x01000c16, status=0x9400000000000000, addr=0x00001234, walltime=0x57e967ea, cpu=0x00000001, cpuid=0x00050663, apicid=0x00000002, bank=0x00000002
这些命令在 ras-mc-ctl(8)man
page 中进行了说明。