einj 注入内存ue/ce故障_代码007(未授权)

缩写

缩写	全称及说明
SRAO	Software Reco vera ble Ac t ion Op t ional SRAO通过MCE或CMCI上报。SRAO Error表示系统中某些数据损坏，但是未被消费。软件恢复措施是可选的，可以根据MCACOD采取恢复策略。
SRAR	Softwa re Reco vera ble Ac t ion Requi red SRAR通过MCE上报。SRAR Error表示系统中某些数据损坏且正在被消费，软件必须在此CPU任务调度前采取reco very action（通常是kill当前 cpu上进程，但不限于此）。如果无法恢复，比如无法获取Ad dr或Ta sk 信息，则应该Panic。
EDAC	Error Detec t ion And Coreec t ion
CE	Corrected Error
UCE	Uncorrected Error
UCR	Uncorrected Recovera ble Error（硬件无法自修复，但软件可采取行为修复的错误）
MCA	Mac h ine Chec k Arc hitecture
MCE	Machine Check Ex c eption（异常）
CMCI	Corrected Machine Check Error Interr upt（中断）机器检查错误表示内存中发生了可纠正的错误，通常是由于软件或硬件故障引起的。这种错误不会导致系统崩溃，但可能会导致系统性能下降。
UCNA	Uncorrected No Action Requi red UCNA通过CMCI上报。UCNA Error表示系统中的某些数据已损坏，但数据尚未被消费（即没被read），并且处理器的状态可用，程序可以继续在处理器上执行。

https://kernel.google source.com/pub/scm/linux/kernel/git/a egl/ra s–tools/

如果支持EINJ确实在boot 日志里看到EINJ的相关信息
ACPI: EINJ 0x0000000049D553E0 000150 (v01 ALASKA A M I 00000001 INTL 00000001)

加载einj.ko.xz驱动时需要先配置好BIOS的相关选项WHEA Error Injection Support
1、驱动路径 – /lib/modules/xx/kernel/drivers/acpi/apei/einj.ko.xz
2、安装驱动 – insmode /lib/modules/xx/kernel/drivers/acpi/apei/einj.ko.xz
3、可以查看对应einj对应节点 ll /sys/kernel/debug/apei/einj/

--w------- 1 root root 0 Sep 11 10:37 error_inject
-rw------- 1 root root 0 Sep 11 10:37 error_type
-rw------- 1 root root 0 Sep 11 10:37 flags
-rw------- 1 root root 0 Sep 11 10:37 notrigger
-rw------- 1 root root 0 Sep 11 10:37 param1
-rw------- 1 root root 0 Sep 11 10:37 param2
-rw------- 1 root root 0 Sep 11 10:37 param3
-rw------- 1 root root 0 Sep 11 10:37 param4
-r-------- 1 root root 0 Sep 11 10:37 vendor
-rw------- 1 root root 0 Sep 11 10:37 vendor_flags
-r-------- 1 root root 0 Sep 11 10:37 available_error_type

$ ins mod einj.ko.xz — 该驱动打开内核相关选项之后，被编译为一个内核驱动。
驱动加载成功可以在内核日志看到如下信息

EINJ: Error INJection is initialized.

Documentation/acpi/apei/einj.txt

			APEI Error INJection
			~~~~~~~~~~~~~~~~~~~~

EINJ provides a hardware error injection mechanism. It is very useful
for debugging and testing APEI and RAS features in general.

You need to check whether your BIOS supports EINJ first. For that, look
for early boot messages similar to this one:

ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL           00000001 INTL 00000001)

which shows that the BIOS is exposing an EINJ table - it is the
mechanism through which the injection is done.

Alternatively(或者), look in /sys/firmware/acpi/tables for an "EINJ" file,
which is a different representation of the same thing.

It doesn't necessarily mean(并不一定意味着) that EINJ is not supported if those above(以上)
don't exist: before you give up, go into BIOS setup to see if the BIOS
has an option to enable error injection. Look for something called WHEA
or similar. Often, you need to enable an ACPI5 support option prior(事先), in
order to see the APEI,EINJ,... functionality supported and exposed by
the BIOS menu.

To use EINJ, make sure the following are options enabled in your kernel
configuration:

CONFIG_DEBUG_FS
CONFIG_ACPI_APEI
CONFIG_ACPI_APEI_EINJ

The EINJ user interface is in <debugfs mount point>/apei/einj.

The following files belong to it:

- available_error_type

  This file shows which error types are supported:

  Error Type Value	Error Description
  ================	=================
  0x00000001		Processor Correctable
  0x00000002		Processor Uncorrectable non-fatal
  0x00000004		Processor Uncorrectable fatal
  0x00000008		Memory Correctable
  0x00000010		Memory Uncorrectable non-fatal
  0x00000020		Memory Uncorrectable fatal
  0x00000040		PCI Express Correctable
  0x00000080		PCI Express Uncorrectable fatal
  0x00000100		PCI Express Uncorrectable non-fatal
  0x00000200		Platform Correctable
  0x00000400		Platform Uncorrectable non-fatal
  0x00000800		Platform Uncorrectable fatal

  The format of the file contents are as above, except present are only
  the available error types.

- error_type

  Set the value of the error type being injected. Possible error types
  are defined in the file available_error_type above.

- error_inject

  Write any integer to this file to trigger the error injection. Make
  sure you have specified all necessary error parameters, i.e. this
  write should be the last step when injecting errors.

- flags

  Present(目前) for kernel versions 3.13 and above. Used to specify(说明) which
  of param{1..4} are valid and should be used by the firmware during
  injection. Value is a bitmask as specified in ACPI5.0 spec for the
  SET_ERROR_TYPE_WITH_ADDRESS data structure:

	Bit 0 - Processor APIC field valid (see param3 below).
	Bit 1 - Memory address and mask valid (param1 and param2).
	Bit 2 - PCIe (seg,bus,dev,fn) valid (see param4 below).

  If set to zero, legacy behavior is mimicked(模仿) where the type of
  injection specifies just one bit set, and param1 is multiplexed.

- param1

  This file is used to set the first error parameter value. Its effect
  depends on the error type specified in error_type. For example, if
  error type is memory related type, the param1 should be a valid
  physical memory address. [Unless "flag" is set - see above]

- param2

  Same use as param1 above. For example, if error type is of memory
  related type, then param2 should be a physical memory address mask.
  Linux requires page or narrower granularity(更窄粒度), say, 0xfffffffffffff000.

- param3

  Used when the 0x1 bit is set in "flags" to specify the APIC id

- param4
  Used when the 0x4 bit is set in "flags" to specify target PCIe device

- notrigger

  The error injection mechanism is a two-step process. First inject the
  error, then perform some actions to trigger it. Setting "notrigger"
  to 1 skips the trigger phase(阶段), which *may* allow the user to cause the
  error in some other context by a simple access to the CPU, memory
  location, or device that is the target of the error injection. Whether
  this actually works depends on what operations the BIOS actually
  includes in the trigger phase.

BIOS versions based on the ACPI 4.0 specification have limited options
in controlling where the errors are injected. Your BIOS may support an
extension (enabled with the param_extension=1 module parameter, or boot
command line einj.param_extension=1). This allows the address and mask
for memory injections to be specified by the param1 and param2 files in
apei/einj.

BIOS versions based on the ACPI 5.0 specification have more control over
the target of the injection. For processor-related errors (type 0x1, 0x2
and 0x4), you can set flags to 0x3 (param3 for bit 0, and param1 and
param2 for bit 1) so that you have more information added to the error
signature being injected. The actual data passed is this:

	memory_address = param1;
	memory_address_range = param2;
	apicid = param3;
	pcie_sbdf = param4;

For memory errors (type 0x8, 0x10 and 0x20) the address is set using
param1 with a mask in param2 (0x0 is equivalent to all ones). For PCI
express errors (type 0x40, 0x80 and 0x100) the segment, bus, device and
function are specified using param1:

         31     24 23    16 15    11 10      8  7        0
	+-------------------------------------------------+
	| segment |   bus  | device | function | reserved |
	+-------------------------------------------------+

Anyway, you get the idea, if there's doubt just take a look at the code
in drivers/acpi/apei/einj.c.

An ACPI 5.0 BIOS may also allow vendor-specific errors to be injected.
In this case a file named vendor will contain identifying information
from the BIOS that hopefully will allow an application wishing to use
the vendor-specific extension to tell that they are running on a BIOS
that supports it. All vendor extensions have the 0x80000000 bit set in
error_type. A file vendor_flags controls the interpretation of param1
and param2 (1 = PROCESSOR, 2 = MEMORY, 4 = PCI). See your BIOS vendor
documentation for details (and expect changes to this API if vendors
creativity in using this feature expands beyond our expectations).


An error injection example:

# cd /sys/kernel/debug/apei/einj
# cat available_error_type		# See which errors can be injected
0x00000002	Processor Uncorrectable non-fatal
0x00000008	Memory Correctable
0x00000010	Memory Uncorrectable non-fatal
# echo 0x12345000 > param1		# Set memory address for injection
# echo $((-1 << 12)) > param2		# Mask 0xfffffffffffff000 - anywhere in this page
# echo 0x8 > error_type			# Choose correctable memory error
# echo 1 > error_inject			# Inject now

You should see something like this in dmesg:

[22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR
[22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
[22715.834759] EDAC sbridge MC3: TSC 0
[22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86
[22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0
[22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)

For more information about EINJ, please refer to ACPI specification
version 4.0, section 17.5 and ACPI 5.0, section 18.6.

./einj_mem_uc: invalid option -- '-'
Usage: ./einj_mem_uc [-a][-c count][-d delay][-f][-i][j][k] [-m runup:size:align][testname]
  Testname Fatal Description
  single   no    Single read in pipeline to target address, generates SRAR machine check
  double   no    Double read in pipeline to target address, generates SRAR machine check
  split    YES   Unaligned read crosses cacheline from good to bad. Probably fatal
  THP      no    Try to inject in transparent huge page, generates SRAR machine check
  hugetlb  no    Try to inject in hugetlb page, generates SRAR machine check
  store    no    Write to target address. Should generate a UCNA/CMCI
  prefetch no    Prefetch data into L1 cache. Should generate CMCI
  memcpy   YES   Streaming read from target address. Probably fatal
  instr    no    Instruction fetch. Generates SRAR that OS should transparently fix
  patrol   no    Patrol scrubber, generates SRAO machine check
  thread   no    Single read by two threads to target address at the same time, generates SRAR machine check
  share    no    Share memory is read by two tasks to target address, generates SRAR machine check
  overflow YES   Read to two target addresses at the same time, Probably fatal
  llc      no    Cache write-back, generates SRAO machine check
  copyin   YES   Kernel copies data from user. Probably fatal
  copyout  YES   Kernel copies data to user. Probably fatal
  copy-on-write YES   Kernel copies user page. Probably fatal
  futex    YES   Kernel access to futex(2). Probably fatal
  mlock    no    mlock target page then inject/read to generates SRAR machine check
  core_ce  no    Core corrected error
  core_non_fatal no    Core deferred error
  core_fatal YES   Core uncorrected error. Should fatal

词	析
Single read in pipeline	个人理解这里的`pipeline`指的是`cpu`的流水线。
split	未对齐读取从好到坏跨越缓存行，可能致命。
THP	Transparent Huge Pages是在运行时动态分配的大页内存，而标准的HugePages是在系统启动时预先分配内存，并在系统运行时不再改变。
hugetlb	hugetlb 相当于是 huge page 页面管理者，页面的分配及释放，都由此模块负责。
pref etch	将数据（存放在内存中）预取到一级缓存中
patrol	巡检
futex	The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the con‐text of shared-memory synchronization.When using futexes, the majority of the synchronization operations are performed in user space.
mlock	系统调用 mlock 家族允许程序在物理内存上锁住它的部分或全部地址空间。这将阻止Linux 将这个内存页调度到交换空间（swap space），即使该程序已有一段时间没有访问这段空间。

通过free看到的缓存 cache统计对应的缓存页都是文件页，它们都对应着系统中的文件、数据。如果没有与之对应的文件，我们就称其为匿名页。File-backed Pages在内存不足的时候可以直接写回对应的硬盘文件里，即Page-out，以释放内存，需要时从磁盘再次读取数据。比如我们可以通过echo 3 > /proc/sys/vm/drop_caches的方式释放大部分cache。

应用程序使用的堆，栈，数据段等，没有文件背景的页面被称为匿名页，它们不是以文件形式存在，因此无法和磁盘文件交换，但可以通过硬盘上划分额外的swap交换分区或使用交换文件进行交换，即Swap-out。匿名页与用户进程共存，进程退出则匿名页释放，而Page Cache即使在进程退出后还可以缓存。

被应用程序修改过，并且暂时还没写入磁盘的数据使用的内存页被称为脏页（Dirty Page）。如果要释放这些页面，就得先写入磁盘。这些脏页，一般可以通过两种方式写入磁盘。一个是通过系统调用fsync，把脏页刷到磁盘中；也可以交给系统，由内核线程Pdflush将脏页刷到磁盘。

为了降低TLB miss的概率，Linux引入了Hugepages机制，可以设定Page大小为2MB或者1GB。2MB的Hugepages机制下，同样256GB内存需要的页表项降低为256GB/2MB=131072，仅需要2MB。因此Hugepages的页表可以全量缓存在CPU cache中。通过sysctl -w vm.nr_hugepages=1024可以设置hugepages的个数为1024，总大小为4GB。需要注意是，设置huagepages会从系统申请连续2MB的内存块并进行保留（不能用于正常内存申请），如果系统运行一段时间导致内存碎片较多时，再申请hugepages会失败。

vDSO是virtual dynamic shared object的缩写，表示这段mapping实际包含的是一个ELF共享目标文件，也就是俗称的.so

类型	说明
CE	服务器在运行过程中，发生了错误，但错误可以通过ECC（Error Checking and Correcting）来纠正。所以有时又将CE错误称为ECC错误。偶发性的地址命令错误、x4颗粒内存的单颗粒多bit错误、x8颗粒内存的单颗粒单bit错误都有可能导致ECC错误。CE错误对系统没有影响。
CE Storm	BIOS对每次SMI中断处理时记录时间戳，每两个SMI中断的时间间隔小于1分钟就会连续计数，当计数达到10个时就判定为CE风暴。
CE Overflow	内存可纠正错误以rank为单位提供计数及阈值设置，当rank内的可纠正错误达到阈值溢出时，触发SMI中断。同时，考虑到时间维度，加入硬件漏斗，即每间隔一定的时间，未出现CE错误时，故障计数器自动减1。
UCE	服务器在运行过程中，发生了错误并且错误无法通过ECC来纠正。x8颗粒内存的多bit错误、x4颗粒的多颗粒多bit错误、持续的地址命令错误都有可能导致UCE、register芯片损坏。


内存读写错误（Corrected read/write Error）	服务器运行过程中，业务处理时进行数据交换，内存出现故障导致数据错误，传输过程中，Intel CPU检测到后上报告警。
内存巡检错误	服务器运行过程中，Intel CPU会针对内存进行巡检，若发现内存UCE故障则上报OS告警，但很多情况下内存实际并未发生故障，数据校验机制有潜在Bug，导致产生误报，可以降级成CE处理。
Corrected Patrol Scrub Error	在空闲的时候读取内存中的内容，如果读出的数据存在可以纠正的错误（不可恢复的错误即为Downgraded Uncorrected Patrol Scrubbing Error），将纠正后的数据重新写入到内存中。