linux/include
Breno Leitao 3fa805c37d vmcoreinfo: track and log recoverable hardware errors
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that are visible to the OS but does not cause a panic)
and record them for vmcore consumption.  This aids post-mortem crash
analysis tools by preserving a count and timestamp for the last occurrence
of such errors.  On the other side, correctable errors, which the OS
typically remains unaware of because the underlying hardware handles them
transparently, are less relevant for crash dump and therefore are NOT
tracked in this infrastructure.

Add centralized logging for sources of recoverable hardware errors based
on the subsystem it has been notified.

hwerror_data is write-only at kernel runtime, and it is meant to be read
from vmcore using tools like crash/drgn.  For example, this is how it
looks like when opening the crashdump from drgn.

	>>> prog['hwerror_data']
	(struct hwerror_info[1]){
		{
			.count = (int)844,
			.timestamp = (time64_t)1752852018,
		},
		...

This helps fleet operators quickly triage whether a crash may be
influenced by hardware recoverable errors (which executes a uncommon code
path in the kernel), especially when recoverable errors occurred shortly
before a panic, such as the bug fixed by commit ee62ce7a1d ("page_pool:
Track DMA-mapped pages and unmap them when destroying the pool")

This is not intended to replace full hardware diagnostics but provides a
fast way to correlate hardware events with kernel panics quickly.

Rare machine check exceptions—like those indicated by mce_flags.p5 or
mce_flags.winchip—are not accounted for in this method, as they fall
outside the intended usage scope for this feature's user base.

[leitao@debian.org: add hw-recoverable-errors to toctree]
  Link: https://lkml.kernel.org/r/20251127-vmcoreinfo_fix-v1-1-26f5b1c43da9@debian.org
Link: https://lkml.kernel.org/r/20251010-vmcore_hw_error-v5-1-636ede3efe44@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>	[APEI]
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:44 -08:00
..
acpi More power management updates for 6.18-rc1 2025-10-07 09:39:51 -07:00
asm-generic kbuild: align modinfo section for Secureboot Authenticode EDK2 compat 2025-10-27 16:21:24 -07:00
clocksource clocksource/drivers/arm_arch_timer_mmio: Switch over to standalone driver 2025-09-23 12:31:50 +02:00
crypto This update includes the following changes: 2025-10-04 14:59:29 -07:00
cxl
drm kbuild: Let kernel-doc.py use PYTHON3 override 2025-11-08 19:42:22 -07:00
dt-bindings There's a bunch of patches here across drivers/clk/ to migrate drivers to use 2025-10-07 09:28:37 -07:00
hyperv hyperv: Remove the spurious null directive line 2025-10-02 21:21:24 +00:00
keys KEYS: trusted_tpm1: Move private functionality out of public header 2025-09-27 21:05:06 +03:00
kunit linux_kselftest-kunit-6.18-rc1 2025-10-01 19:15:11 -07:00
kvm KVM: arm64: Kill leftovers of ad-hoc timer userspace access 2025-10-13 14:42:41 +01:00
linux vmcoreinfo: track and log recoverable hardware errors 2025-11-27 14:24:44 -08:00
math-emu
media
memory
misc
net hardening fixes for v6.18-rc5 2025-11-06 11:54:59 -08:00
pcmcia
ras
rdma
rv kernel-6.18-rc1.clone3 2025-09-29 10:36:50 -07:00
scsi scsi: core: Fix the unit attention counter implementation 2025-10-21 21:09:36 -04:00
soc There's a bunch of patches here across drivers/clk/ to migrate drivers to use 2025-10-07 09:28:37 -07:00
sound ASoC: tas2781: Support more newly-released amplifiers tas58xx in the driver 2025-10-13 11:08:09 +01:00
target
trace trace: tcp: add three metrics to trace_tcp_rcvbuf_grow() 2025-10-29 17:30:18 -07:00
uapi vmcoreinfo: track and log recoverable hardware errors 2025-11-27 14:24:44 -08:00
ufs scsi: ufs: core: Add a quirk to suppress link_startup_again 2025-10-29 23:20:19 -04:00
vdso Updates for the VDSO subsystem: 2025-09-30 16:58:21 -07:00
video
xen
Kbuild