linux/kernel
Shakeel Butt 3309b63a22 cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
means it is only safe for the local CPU and not for multiple CPUs.
Recently the commit 36df6e3dbd ("cgroup: make css_rstat_updated nmi
safe") make css_rstat_updated lockless and uses lockless list to allow
reentrancy. Since css_rstat_updated can invoked from process context,
IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will
inset the lockless lnode into the global per-cpu lockless list.

However the commit missed one case where lockless node of a cgroup can
be accessed and modified by another CPU doing the flushing. Basically
llist_del_first_init() in css_process_update_tree().

On a cursory look, it can be questioned how css_process_update_tree()
can see a lockless node in global lockless list where the updater is at
this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated().
This can indeed happen in the presence of IRQs/NMI.

Consider this scenario: Updater for cgroup stat C on CPU A in process
context is after llist_on_list() check and before this_cpu_cmpxchg() in
css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
context, a new updater calls css_rstat_updated() for same cgroup C and
successfully inserts rstatc_pcpu->lnode.

Now concurrently CPU B is running the flusher and it calls
llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
which was added by the IRQ/NMI updater.

Now imagine CPU B calling init_llist_node() on cgroup C's
rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.

The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
prefix.

In Meta's fleet running the kernel with the commit 36df6e3dbd, we are
observing on some machines the memcg stats are getting skewed by more
than the actual memory on the system. On close inspection, we noticed
that lockless node for a workload for specific CPU was in the bad state
and thus all the updates on that CPU for that cgroup was being lost.

To confirm if this skew was indeed due to this CMPXCHG without LOCK in
css_rstat_updated(), we created a repro (using AI) at [1] which shows
that CMPXCHG without LOCK creates almost the same lnode corruption as
seem in Meta's fleet and with LOCK CMPXCHG the issue does not
reproduces.

Link: http://lore.kernel.org/efiagdwmzfwpdzps74fvcwq3n4cs36q33ij7eebcpssactv3zu@se4hqiwxcfxq [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: stable@vger.kernel.org # v6.17+
Fixes: 36df6e3dbd ("cgroup: make css_rstat_updated nmi safe")
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 08:26:56 -10:00
..
bpf Some filesystems use a kinda-sorta controlled dentry refcount leak to pin 2025-12-05 14:36:21 -08:00
cgroup cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated 2025-12-08 08:26:56 -10:00
configs hung_task: panic when there are more than N hung tasks at the same time 2025-11-12 10:00:14 -08:00
debug kdb: Adapt kdb_msg_write to work with NBCON consoles 2025-10-24 12:56:20 +02:00
dma dma-mapping updates for Linux 6.19: 2025-12-06 09:25:05 -08:00
entry rseq: Switch to TIF_RSEQ if supported 2025-11-04 08:35:37 +01:00
events Performance events changes for v6.19: 2025-12-01 20:42:01 -08:00
futex Scoped user mode access and related changes: 2025-12-02 08:01:39 -08:00
gcov gcov: add support for GCC 15 2025-11-09 21:19:44 -08:00
irq soc: driver updates for 6.19 2025-12-05 17:29:04 -08:00
kcsan Kernel Concurrency Sanitizer (KCSAN) updates for v6.18 2025-10-02 08:31:44 -07:00
livepatch livepatching changes for 6.19 2025-12-03 13:46:48 -08:00
liveupdate kho: fix restoring of contiguous ranges of order-0 pages 2025-11-27 14:24:44 -08:00
locking RCU pull request for v6.19 2025-12-03 12:18:07 -08:00
module Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
power Merge branches 'pm-qos' and 'pm-tools' 2025-11-28 16:50:45 +01:00
printk soc: driver updates for 6.19 2025-12-05 17:29:04 -08:00
rcu RCU pull request for v6.19 2025-12-03 12:18:07 -08:00
sched Miscellaneous scheduler fixes/cleanups: 2025-12-06 12:31:21 -08:00
time soc: driver updates for 6.19 2025-12-05 17:29:04 -08:00
trace tracing fixes for v6.19: 2025-12-06 13:49:40 -08:00
unwind unwind_user/x86: Teach FP unwind about start of function 2025-10-29 10:29:58 +01:00
.gitignore kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-06-24 20:30:37 +09:00
acct.c act: use credential guards in acct_write_process() 2025-11-04 12:36:49 +01:00
async.c
audit.c audit: fix skb leak when audit rate limit is exceeded 2025-09-10 19:55:00 -04:00
audit.h audit: fix comment misindentation in audit.h 2025-10-22 19:28:06 -04:00
audit_fsnotify.c VFS/audit: introduce kern_path_parent() for audit 2025-09-23 12:37:35 +02:00
audit_tree.c mount-related stuff for this cycle 2025-10-03 10:19:44 -07:00
audit_watch.c VFS/audit: introduce kern_path_parent() for audit 2025-09-23 12:37:35 +02:00
auditfilter.c audit: Use kzalloc() instead of kmalloc()/memset() in audit_krule_to_data() 2025-11-07 16:38:34 -05:00
auditsc.c audit: merge loops in __audit_inode_child() 2025-11-07 16:50:42 -05:00
backtracetest.c
bounds.c x86/asm: Remove ANNOTATE_DATA_SPECIAL usage 2025-12-03 16:53:19 +01:00
capability.c capability: Remove unused has_capability 2025-03-07 22:03:09 -06:00
cfi.c cfi: Move BPF CFI types and helpers to generic code 2025-07-31 18:23:53 -07:00
compat.c
configs.c
context_tracking.c context_tracking: Make RCU watch ct_kernel_exit_state() warning 2025-03-04 18:44:29 -08:00
cpu.c cpumask: Cache num_possible_cpus() 2025-11-25 19:45:40 +01:00
cpu_pm.c syscore: Pass context data to callbacks 2025-11-14 10:01:52 +01:00
crash_core.c crash: fix crashkernel resource shrink 2025-11-15 10:52:01 -08:00
crash_core_test.c crash: add KUnit tests for crash_exclude_mem_range 2025-09-13 17:32:55 -07:00
crash_dump_dm_crypt.c crash_dump: retrieve dm crypt keys in kdump kernel 2025-05-21 10:48:21 -07:00
crash_reserve.c crash: let architecture decide crash memory export to iomem_resource 2025-11-12 10:00:15 -08:00
cred.c kernel-6.19-rc1.cred 2025-12-01 13:45:41 -08:00
delayacct.c delayacct: remove redundant code and adjust indentation 2025-05-27 19:40:33 -07:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
exit.h
extable.c
fail_function.c
fork.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
freezer.c freezer: Clarify that only cgroup1 freezer uses PM freezer 2025-10-30 20:10:27 +01:00
gen_kheaders.sh kheaders: make it possible to override TAR 2025-08-06 10:23:36 +09:00
groups.c
hung_task.c hung_task: add hung_task_sys_info sysctl to dump sys info on task-hung 2025-11-20 14:03:43 -08:00
iomem.c mm/memremap: Pass down MEMREMAP_* flags to arch_memremap_wb() 2025-02-21 15:05:38 +01:00
irq_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
jump_label.c jump_label: Use RCU in all users of __module_text_address(). 2025-03-10 11:54:46 +01:00
kallsyms.c kallsyms: Fix wrong "big" kernel symbol type read from procfs 2025-11-24 16:56:24 +01:00
kallsyms_internal.h
kallsyms_selftest.c kallsyms: use kmalloc_array() instead of kmalloc() 2025-09-28 11:36:14 -07:00
kallsyms_selftest.h
kcmp.c kcmp: improve performance adding an unlikely hint to task comparisons 2025-02-21 10:25:33 +01:00
Kconfig.freezer
Kconfig.hz kernel: Fix "select" wording on HZ_250 description 2025-02-21 09:20:30 +01:00
Kconfig.kexec liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
Kconfig.locks
Kconfig.preempt softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT 2025-09-17 16:25:41 +02:00
kcov.c kcov: use write memory barrier after memcpy() in kcov_move_area() 2025-09-13 17:32:44 -07:00
kexec.c kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kexec_core.c kexec: move sysfs entries to /sys/kernel/kexec 2025-11-27 14:24:42 -08:00
kexec_elf.c kexec: initialize ELF lowest address to ULONG_MAX 2025-03-16 22:30:47 -07:00
kexec_file.c x86/kexec: carry forward the boot DTB on kexec 2025-09-13 17:32:43 -07:00
kexec_internal.h kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kheaders.c kheaders: Simplify attribute through __BIN_ATTR_SIMPLE_RO() 2024-12-24 09:46:49 +01:00
kprobes.c kprobes: Add missing kerneldoc for __get_insn_slot 2025-07-15 18:45:34 +09:00
kstack_erase.c sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument 2025-11-27 15:44:53 +01:00
ksyms_common.c
ksysfs.c kexec: move sysfs entries to /sys/kernel/kexec 2025-11-27 14:24:42 -08:00
kthread.c sched: Rename do_set_cpus_allowed() 2025-10-16 11:13:53 +02:00
latencytop.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Makefile liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
module_signature.c
notifier.c
nscommon.c ns: rename is_initial_namespace() 2025-11-11 10:01:31 +01:00
nsproxy.c nsproxy: fix free_nsproxy() and simplify create_new_namespaces() 2025-11-14 13:10:38 +01:00
nstree.c nstree: fix kernel-doc comments for internal functions 2025-11-14 13:10:38 +01:00
padata.c padata: remove __padata_list_init() 2025-11-14 18:15:49 +08:00
panic.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
params.c params: Replace deprecated strcpy() with strscpy() and memcpy() 2025-08-16 21:47:25 +02:00
pid.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
pid_namespace.c pid: rely on common reference count behavior 2025-11-11 10:01:32 +01:00
pid_sysctl.h treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
profile.c
ptrace.c rseq: Introduce struct rseq_data 2025-11-04 08:30:50 +01:00
range.c
reboot.c - The 7 patch series "powerpc/crash: use generic crashkernel 2025-04-01 10:06:52 -07:00
regset.c
relay.c relay: update relay to use mmap_prepare 2025-11-16 17:28:11 -08:00
resource.c Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()" 2025-11-27 14:24:45 -08:00
resource_kunit.c
rseq.c rseq: Switch to fast path processing on exit to user 2025-11-04 08:34:39 +01:00
scftorture.c scftorture: Handle NULL argument passed to scf_add_to_free_list(). 2024-11-14 16:09:51 -08:00
scs.c scs: fix a wrong parameter in __scs_magic 2025-11-12 10:00:13 -08:00
seccomp.c Performance events updates for v6.18: 2025-09-30 11:11:21 -07:00
signal.c signal: Move MMCID exit out of sighand lock 2025-11-25 19:45:40 +01:00
smp.c smp: Introduce a helper function to check for pending IPIs 2025-11-19 18:06:50 +01:00
smpboot.c sched/smp: Use the SMP version of idle_thread_set_boot_cpu() 2025-06-13 08:47:20 +02:00
smpboot.h
softirq.c softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT 2025-09-17 16:25:41 +02:00
stacktrace.c
static_call.c
static_call_inline.c Modules changes for 6.15-rc1 2025-03-30 15:44:36 -07:00
stop_machine.c sched/core: Fix migrate_swap() vs. hotplug 2025-07-01 15:02:03 +02:00
sys.c Patch series in this pull request: 2025-10-02 18:44:54 -07:00
sys_ni.c uprobes/x86: Add uprobe syscall to speed up uprobe 2025-08-21 20:09:20 +02:00
sysctl-test.c sysctl: move u8 register test to lib/test_sysctl.c 2025-04-14 14:13:41 +02:00
sysctl.c sysctl: Wrap do_proc_douintvec with the public function proc_douintvec_conv 2025-11-27 15:45:38 +01:00
task_work.c task_work: Fix NMI race condition 2025-10-29 10:29:54 +01:00
taskstats.c
torture.c torture: Delay CPU-hotplug operations until boot completes 2025-08-14 15:26:30 -07:00
tracepoint.c tracepoint: Print the function symbol when tracepoint_debug is set 2025-03-21 15:30:10 -04:00
tsacct.c pid: change bacct_add_tsk() to use task_ppid_nr_ns() 2025-08-19 13:38:20 +02:00
ucount.c ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() 2025-08-02 12:01:38 -07:00
uid16.c
uid16.h
umh.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
up.c
user-return-notifier.c
user.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
user_namespace.c ns: move ns type into struct ns_common 2025-09-25 09:23:54 +02:00
utsname.c namespace-6.18-rc1 2025-09-29 11:20:29 -07:00
utsname_sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
vhost_task.c vhost: Take a reference on the task in struct vhost_task. 2025-09-21 17:44:20 -04:00
vmcore_info.c vmcoreinfo: track and log recoverable hardware errors 2025-11-27 14:24:44 -08:00
watch_queue.c watch_queue: Use local kmap in post_one_notification() 2025-11-19 12:17:28 +01:00
watchdog.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
watchdog_buddy.c watchdog: fix opencoded cpumask_next_wrap() in watchdog_next_cpu() 2025-07-31 11:28:03 -04:00
watchdog_perf.c watchdog: skip checks when panic is in progress 2025-09-13 17:32:53 -07:00
workqueue.c workqueue: Don't rely on wq->rescuer to stop rescuer 2025-11-21 09:45:36 -10:00
workqueue_internal.h