linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-14 02:06:15 +01:00

Author	SHA1	Message	Date
Anton Protopopov	7feff23cdf	bpf: force BPF_F_RDONLY_PROG on insn array creation The original implementation added a hack to check_mem_access() to prevent programs from writing into insn arrays. To get rid of this hack, enforce BPF_F_RDONLY_PROG on map creation. Also fix the corresponding selftest, as the error message changes with this patch. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-28 15:15:43 -08:00
Edward Adam Davis	688b745401	bpf: Fix exclusive map memory leak When excl_prog_hash is 0 and excl_prog_hash_size is non-zero, the map also needs to be freed. Otherwise, the map memory will not be reclaimed, just like the memory leak problem reported by syzbot [1]. syzbot reported: BUG: memory leak backtrace (crc 7b9fb9b4): map_create+0x322/0x11e0 kernel/bpf/syscall.c:1512 __sys_bpf+0x3556/0x3610 kernel/bpf/syscall.c:6131 Fixes: `baefdbdf68` ("bpf: Implement exclusive map creation") Reported-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cf08c551fecea9fd1320 Tested-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/tencent_3F226F882CE56DCC94ACE90EED1ECCFC780A@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-26 11:23:27 -08:00
Leon Hwang	8f6ddc0587	bpf: Introduce internal bpf_map_check_op_flags helper function It is to unify map flags checking for lookup_elem, update_elem, lookup_batch and update_batch APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20251125145857.98134-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-25 15:27:48 -08:00
Menglong Dong	402e44b31e	bpf: implement "jmp" mode for trampoline Implement the "jmp" mode for the bpf trampoline. For the ftrace_managed case, we need only to set the FTRACE_OPS_FL_JMP on the tr->fops if "jmp" is needed. For the bpf poke case, we will check the origin poke type with the "origin_flags", and current poke type with "tr->flags". The function bpf_trampoline_update_fentry() is introduced to do the job. The "jmp" mode will only be enabled with CONFIG_DYNAMIC_FTRACE_WITH_JMP enabled and BPF_TRAMP_F_SHARE_IPMODIFY is not set. With BPF_TRAMP_F_SHARE_IPMODIFY, we need to get the origin call ip from the stack, so we can't use the "jmp" mode. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20251118123639.688444-7-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-24 09:47:04 -08:00
Menglong Dong	ae4a3160d1	bpf: specify the old and new poke_type for bpf_arch_text_poke In the origin logic, the bpf_arch_text_poke() assume that the old and new instructions have the same opcode. However, they can have different opcode if we want to replace a "call" insn with a "jmp" insn. Therefore, add the new function parameter "old_t" along with the "new_t", which are used to indicate the old and new poke type. Meanwhile, adjust the implement of bpf_arch_text_poke() for all the archs. "BPF_MOD_NOP" is added to make the code more readable. In bpf_arch_text_poke(), we still check if the new and old address is NULL to determine if nop insn should be used, which I think is more safe. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20251118123639.688444-6-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-24 09:47:03 -08:00
Menglong Dong	25e4e3565d	ftrace: Introduce FTRACE_OPS_FL_JMP For now, the "nop" will be replaced with a "call" instruction when a function is hooked by the ftrace. However, sometimes the "call" can break the RSB and introduce extra overhead. Therefore, introduce the flag FTRACE_OPS_FL_JMP, which indicate that the ftrace_ops should be called with a "jmp" instead of "call". For now, it is only used by the direct call case. When a direct ftrace_ops is marked with FTRACE_OPS_FL_JMP, the last bit of the ops->direct_call will be set to 1. Therefore, we can tell if we should use "jmp" for the callback in ftrace_call_replace(). Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20251118123639.688444-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-24 09:46:24 -08:00
Anton Protopopov	fad804002e	bpf: cleanup aux->used_maps after jit In commit `b4ce5923e7` ("bpf, x86: add new map type: instructions array") env->used_map was copied to func[i]->aux->used_maps before jitting. Clear these fields out after jitting such that pointer to freed memory (env->used_maps is freed later) are not kept in a live data structure. The reason why the copies were initially added is explained in https://lore.kernel.org/bpf/20251105090410.1250500-1-a.s.protopopov@gmail.com Suggested-by: Alexei Starovoitov <ast@kernel.org> Fixes: `b4ce5923e7` ("bpf, x86: add new map type: instructions array") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251124151515.2543403-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-24 09:39:55 -08:00
Puranjay Mohan	4167096cb9	bpf: support nested rcu critical sections Currently, nested rcu critical sections are rejected by the verifier and rcu_lock state is managed by a boolean variable. Add support for nested rcu critical sections by make active_rcu_locks a counter similar to active_preempt_locks. bpf_rcu_read_lock() increments this counter and bpf_rcu_read_unlock() decrements it, MEM_RCU -> PTR_UNTRUSTED transition happens when active_rcu_locks drops to 0. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251117200411.25563-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 18:34:59 -08:00
Eduard Zingerman	e40f5a6bf8	bpf: correct stack liveness for tail calls This updates bpf_insn_successors() reflecting that control flow might jump over the instructions between tail call and function exit, verifier might assume that some writes to parent stack always happen, which is not the case. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu> Link: https://lore.kernel.org/r/20251119160355.1160932-4-martin.teichmann@xfel.eu Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 17:45:30 -08:00
Martin Teichmann	e3245f8990	bpf: properly verify tail call behavior A successful ebpf tail call does not return to the caller, but to the caller-of-the-caller, often just finishing the ebpf program altogether. Any restrictions that the verifier needs to take into account - notably the fact that the tail call might have modified packet pointers - are to be checked on the caller-of-the-caller. Checking it on the caller made the verifier refuse perfectly fine programs that would use the packet pointers after a tail call, which is no problem as this code is only executed if the tail call was unsuccessful, i.e. nothing happened. This patch simulates the behavior of a tail call in the verifier. A conditional jump to the code after the tail call is added for the case of an unsucessful tail call, and a return to the caller is simulated for a successful tail call. For the successful case we assume that the tail call returns an int, as tail calls are currently only allowed in functions that return and int. We always assume that the tail call modified the packet pointers, as we do not know what the tail call did. For the unsuccessful case we know nothing happened, so we do not need to add new constraints. This approach also allows to check other problems that may occur with tail calls, namely we are now able to check that precision is properly propagated into subprograms using tail calls, as well as checking the live slots in such a subprogram. Fixes: `1a4607ffba` ("bpf: consider that tail calls invalidate packet pointers") Link: https://lore.kernel.org/bpf/20251029105828.1488347-1-martin.teichmann@xfel.eu/ Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251119160355.1160932-2-martin.teichmann@xfel.eu Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 17:45:30 -08:00
Anton Protopopov	4dd3a48d13	bpf: Add a check to make static analysers happy In [1] Dan Carpenter reported that the following code makes the Smatch static analyser unhappy: 17904 value = map->ops->map_lookup_elem(map, &i); 17905 if (!value) 17906 return -EINVAL; --> 17907 items[i - start] = value->xlated_off; The analyser assumes that the `value` variable may contain an error and thus it should be properly checked before the dereference. On practice this will never happen as array maps do not return error values in map_lookup_elem, but to make the Smatch and other possible analysers happy this patch adds a formal check. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/bpf/aR2BN1Ix--8tmVrN@stanley.mountain/ [1] Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20251119112517.1091793-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-21 17:01:14 -08:00
Amery Hung	f484f4a3e0	bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage Replace bpf memory allocator with kmalloc_nolock() to reduce memory wastage due to preallocation. In bpf_selem_free(), an selem now needs to wait for a RCU grace period before being freed when reuse_now == true. Therefore, rcu_barrier() should be always be called in bpf_local_storage_map_free(). In bpf_local_storage_free(), since smap->storage_ma is no longer needed to return the memory, the function is now independent from smap. Remove the outdated comment in bpf_local_storage_alloc(). We already free selem after an RCU grace period in bpf_local_storage_update() when bpf_local_storage_alloc() failed the cmpxchg since commit `c0d63f3091` ("bpf: Add bpf_selem_free()"). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-5-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	39a460c425	bpf: Save memory alloction info in bpf_local_storage Save the memory allocation method used for bpf_local_storage in the struct explicitly so that we don't need to go through the hassle to find out the info. When a later patch replaces BPF memory allocator with kmalloc_noloc(), bpf_local_storage_free() will no longer need smap->storage_ma to return the memory and completely remove the dependency on smap in bpf_local_storage_free(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	e76a33e1c7	bpf: Remove smap argument from bpf_selem_free() Since selem already saves a pointer to smap, use it instead of an additional argument in bpf_selem_free(). This requires moving the SDATA(selem)->smap assignment from bpf_selem_link_map() to bpf_selem_alloc() since bpf_selem_free() may be called without the selem being linked to smap in bpf_local_storage_update(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Amery Hung	0e854e5535	bpf: Always charge/uncharge memory when allocating/unlinking storage elements Since commit `a96a44aba5` ("bpf: bpf_sk_storage: Fix invalid wait context lockdep report"), {charge,uncharge}_mem are always true when allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem from local storage, so drop these arguments. No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-18 16:20:25 -08:00
Pu Lehui	7dc211c115	bpf: Fix invalid prog->stats access when update_effective_progs fails Syzkaller triggers an invalid memory access issue following fault injection in update_effective_progs. The issue can be described as follows: __cgroup_bpf_detach update_effective_progs compute_effective_progs bpf_prog_array_alloc <-- fault inject purge_effective_progs /* change to dummy_bpf_prog / array->items[index] = &dummy_bpf_prog.prog ---softirq start--- __do_softirq ... __cgroup_bpf_run_filter_skb __bpf_prog_run_save_cb bpf_prog_run stats = this_cpu_ptr(prog->stats) / invalid memory access */ flags = u64_stats_update_begin_irqsave(&stats->syncp) ---softirq end--- static_branch_dec(&cgroup_bpf_enabled_key[atype]) The reason is that fault injection caused update_effective_progs to fail and then changed the original prog into dummy_bpf_prog.prog in purge_effective_progs. Then a softirq came, and accessing the members of dummy_bpf_prog.prog in the softirq triggers invalid mem access. To fix it, skip updating stats when stats is NULL. Fixes: `492ecee892` ("bpf: enable program stats") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-17 22:35:51 -08:00
Altgelt, Max (Nextron)	4722981cca	bpf: don't skip other information if xlated_prog_insns is skipped If xlated_prog_insns should not be exposed, other information (such as func_info) still can and should be filled in. Therefore, instead of directly terminating in this case, continue with the normal flow. Signed-off-by: Max Altgelt <max.altgelt@nextron-systems.com> Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 18:55:06 -08:00
Puranjay Mohan	4f7bc83b98	bpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off() Metadata about a kfunc call is added to the kfunc_tab in add_kfunc_call() but the call instruction itself could get removed by opt_remove_dead_code() later if it is not reachable. If the call instruction is removed, specialize_kfunc() is never called for it and the desc->imm in the kfunc_tab is never initialized for this kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in do_misc_fixups() doesn't sort the table correctly. This is a problem for s390 as its JIT uses this table to find the addresses for kfuncs, and if this table is not sorted properly, JIT may fail to find addresses for valid kfunc calls. This was exposed by: commit `d869d56ca8` ("bpf: verifier: refactor kfunc specialization") as before this commit, desc->imm was initialised in add_kfunc_call() which happens before dead code elimination. Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes the problem and also saves us from having the same logic in add_kfunc_call() and specialize_kfunc(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114154023.12801-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 17:55:18 -08:00
Alexei Starovoitov	e47b68bda4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+ Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 17:43:41 -08:00
Linus Torvalds	cbba5d1b53	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmkXpZUACgkQ6rmadz2v bTrGCw//UCx+KBXbzvv7m0A1QGOUL3oHL/Qd+OJA3RW3B+saVbYYzn9jjl0SRgFP X0q/DwbDOjFtOSORV9oFgJkrucn7+BM/yxPaC4sE1SQZJAjDFA/CSaF0r8duuGsM Mvat9TTiwwetOMAkNB9WZ1e6AKGovBLguLFGAWZc6vLeQZopcER5+pFwS44a9RrK dq0Th8O/oY3VmUDgSKJ2KyY51KxpJU7k2ipifiIbu1M1MWZ7s2vERkMEkzJ/lB8/ nldMsTZUdknGFzVH/W6Rc9ScFYlH+h/x1gkOHwTibMsqDBm92mWVo6O7hvuUbsEO NlPDgMtkhBp7PDSx9SA0UBcriMs1M6ovNBOpj/cI4AL1k8WNubf/FHZtrBwoy8C9 3HaM+8lkA2uiHVPUvT5dImzWqshweN0GXoXAoa9xPSQPchJ38UdzCHqYRAg/kWFZ 5jUK2j4e5+yyII44pD7Xti0PrfoP81giliqmTbGFV8+Y89dQnk+WK12vnbv34ER7 unLwId8HLtq0ZN7FVG4F6s/4qNdEMKqXbAkve0WWFXn4vKZMCju4ol6NYVGisRAg zcn7Yk+weSuY3UOzC+/4SxhfTEAD0Kg6fUoG/1JdflgNsm8XhLBja0DZaAlIVO0p xz5UaljwcNvjAKGGMYbCGrf3XN2tOmGpVyJkMj17Vcq88y3bJBU= =JJui -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix interaction between livepatch and BPF fexit programs (Song Liu) With Steven and Masami acks. - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa) With Steven and Masami acks. - Fix out of bounds access in widen_imprecise_scalars() in the verifier (Eduard Zingerman) - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen) - Fix net_sched storage collision with BPF data_meta/data_end (Eric Dumazet) - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test widen_imprecise_scalars() with different stack depth bpf: account for current allocated stack depth in widen_imprecise_scalars() bpf: Add bpf_prog_run_data_pointers() selftests/bpf: Add mptcp test with sockmap mptcp: Fix proto fallback detection with BPF mptcp: Disallow MPTCP subflows from sockmap selftests/bpf: Add stacktrace ips test for raw_tp selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()" bpf: add _impl suffix for bpf_stream_vprintk() kfunc bpf:add _impl suffix for bpf_task_work_schedule* kfuncs selftests/bpf: Add tests for livepatch + bpf trampoline ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct() ftrace: Fix BPF fexit with livepatch	2025-11-14 15:39:39 -08:00
Menglong Dong	fea3f5e83c	bpf: Handle return value of ftrace_set_filter_ip in register_fentry The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: `00963a2e75` ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 13:31:30 -08:00
Eduard Zingerman	e5d2e34e72	bpf: Add missing checks to avoid verbose verifier log There are a few places where log level is not checked before calling "verbose()". This forces programs working only at BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large log buffers. Add missing checks. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114200542.912386-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 13:29:12 -08:00
Sahil Chandna	c1da3df719	bpf: Prevent nesting overflow in bpf_try_get_buffers bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a per-CPU nesting counter. This mechanism expects that buffers are not endlessly acquired before being returned. migrate_disable() ensures that a task remains on the same CPU, but it does not prevent the task from being preempted by another task on that CPU. Without disabled preemption, a task may be preempted while holding a buffer, allowing another task to run on same CPU and acquire an additional buffer. Several such preemptions can cause the per-CPU nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around buffer acquisition and release prevents this task preemption and preserves the intended bounded nesting behavior. Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/ Fixes: `4223bf833c` ("bpf: Remove preempt_disable in bpf_try_get_buffers") Suggested-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 13:06:47 -08:00
Eduard Zingerman	b0c8e6d3d8	bpf: account for current allocated stack depth in widen_imprecise_scalars() The usage pattern for widen_imprecise_scalars() looks as follows: prev_st = find_prev_entry(env, ...); queued_st = push_stack(...); widen_imprecise_scalars(env, prev_st, queued_st); Where prev_st is an ancestor of the queued_st in the explored states tree. This ancestor is not guaranteed to have same allocated stack depth as queued_st. E.g. in the following case: def main(): for i in 1..2: foo(i) // same callsite, differnt param def foo(i): if i == 1: use 128 bytes of stack iterator based loop Here, for a second 'foo' call prev_st->allocated_stack is 128, while queued_st->allocated_stack is much smaller. widen_imprecise_scalars() needs to take this into account and avoid accessing bpf_verifier_state->frame[*]->stack out of bounds. Fixes: `2793a8b015` ("bpf: exact states comparison for iterator convergence checks") Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-14 09:26:05 -08:00
Linus Torvalds	aecba2e013	Power management fixes for 6.18-rc6 - Fix issues related to using inadequate data types and incorrect use of atomic variables in the compressed hibernation images handling code that were introduced during the 6.9 development cycle (Mario Limonciello) - Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate to address a regression preventing users from enabling turbo frequencies post-boot (Srinivas Pandruvada) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmkWOnoSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO14gkH/08TYjtNEvC6acD2r5gJTpacdnExWopL NFmRyhuZM21Ja2gd1q0xtPvcTdAz3rkhB4vqn9KQ0oLkfXj08/+zpRyOP3PzVVSp bvE/Am28s/VChjDg/MFcP7o/fLSNoL73wK6er+i721KIV1uscK4FydkPNs6gpBHw 03FkUJX8jRjil0Cp6km2O0Zo5SEgm/U6wDjR5Azpdru8VKbI1RaxCMsR0/HnlA9Y pUAph9NX1UBBjdMFFdn8++Vna8XJX4qe9CiYT7KwGbGx5jUpVBaT9d/hPm0O/mJt VvNe3Dl5soFM/3yibsvV4sTcZHNPTsIKjuKIqwL4F0TCGug9kxwjFBk= =r3Fq -----END PGP SIGNATURE----- Merge tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix issues related to the handling of compressed hibernation images and a recent intel_pstate driver regression: - Fix issues related to using inadequate data types and incorrect use of atomic variables in the compressed hibernation images handling code that were introduced during the 6.9 development cycle (Mario Limonciello) - Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate to address a regression preventing users from enabling turbo frequencies post-boot (Srinivas Pandruvada)" * tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Check IDA only before MSR_IA32_PERF_CTL writes PM: hibernate: Fix style issues in save_compressed_image() PM: hibernate: Use atomic64_t for compressed_size variable PM: hibernate: Emit an error when image writing fails	2025-11-13 16:31:07 -08:00
Rafael J. Wysocki	161284b26f	Merge branch 'pm-sleep' Merge fixes for issues related to the handling of compressed hibernation images that were introduced during the 6.9 development cycle. * pm-sleep: PM: hibernate: Fix style issues in save_compressed_image() PM: hibernate: Use atomic64_t for compressed_size variable PM: hibernate: Emit an error when image writing fails	2025-11-13 21:05:46 +01:00
Leon Hwang	6af6e49a76	bpf: Free special fields when update [lru_,]percpu_hash maps As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed. Fix this by calling 'bpf_obj_free_fields()' after 'copy_map_value[,_long]()' in 'pcpu_copy_value()'. Fixes: `65334e64a4` ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-13 09:14:15 -08:00
Kumar Kartikeya Dwivedi	3249e8a17e	bpf: Adjust return value for queue destruction in rqspinlock Return -ETIMEDOUT whenever non-head waiters are signalled by head, and fix oversight in commit `7bd6e5ce5b` ("rqspinlock: Disable queue destruction for deadlocks"). We no longer signal on deadlocks. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20251111013827.1853484-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-12 11:17:39 -08:00
Pratyush Yadav	b05addf6f0	kho: warn and exit when unpreserved page wasn't preserved Calling __kho_unpreserve() on a pair of (pfn, end_pfn) that wasn't preserved is a bug. Currently, if that is done, the physxa or bits can be NULL. This results in a soft lockup since a NULL physxa or bits results in redoing the loop without ever making any progress. Return when physxa or bits are not found, but WARN first to loudly indicate invalid behaviour. Link: https://lkml.kernel.org/r/20251103180235.71409-3-pratyush@kernel.org Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:47 -08:00
Pratyush Yadav	7ecd2e439d	kho: fix unpreservation of higher-order vmalloc preservations kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as pfn + 1. This happens to work for 0-order pages, but leaks higher order pages. For example, say order 2 pages back the allocation. During preservation, they get preserved in the order 2 bitmaps, but kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the order 0 bitmaps, which should not have these bits set anyway, leaving the order 2 bitmaps untouched. This results in the pages being carried over to the next kernel. Nothing will free those pages in the next boot, leaking them. Fix this by taking the order into account when calculating the end PFN for __kho_unpreserve(). Link: https://lkml.kernel.org/r/20251103180235.71409-2-pratyush@kernel.org Fixes: `a667300bd5` ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:47 -08:00
Pratyush Yadav	0b07092d09	kho: fix out-of-bounds access of vmalloc chunk The list of pages in a vmalloc chunk is NULL-terminated. So when looping through the pages in a vmalloc chunk, both kho_restore_vmalloc() and kho_vmalloc_unpreserve_chunk() rightly make sure to stop when encountering a NULL page. But when the chunk is full, the loops do not stop and go past the bounds of chunk->phys, resulting in out-of-bounds memory access, and possibly the restoration or unpreservation of an invalid page. Fix this by making sure the processing of chunk stops at the end of the array. Link: https://lkml.kernel.org/r/20251103110159.8399-1-pratyush@kernel.org Fixes: `a667300bd5` ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:47 -08:00
Peter Oberparleiter	ec4d11fc4b	gcov: add support for GCC 15 Using gcov on kernels compiled with GCC 15 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 15. Tested with GCC 14.3.0 and GCC 15.2.0. Link: https://lkml.kernel.org/r/20251028115125.1319410-1-oberpar@linux.ibm.com Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: https://github.com/linux-test-project/lcov/issues/445 Tested-by: Matthieu Baerts <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:44 -08:00
Pasha Tatashin	fa759cd75b	kho: allocate metadata directly from the buddy allocator KHO allocates metadata for its preserved memory map using the slab allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory. A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock. When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded. To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of slab. Link: https://lkml.kernel.org/r/20251021000852.2924827-4-pasha.tatashin@soleen.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Matlack <dmatlack@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:42 -08:00
Pasha Tatashin	a2fff99f92	kho: increase metadata bitmap size to PAGE_SIZE KHO memory preservation metadata is preserved in 512 byte chunks which requires their allocation from slab allocator. Slabs are not safe to be used with KHO because of kfence, and because partial slabs may lead leaks to the next kernel. Change the size to be PAGE_SIZE. The kfence specifically may cause memory corruption, where it randomly provides slab objects that can be within the scratch area. The reason for that is that kfence allocates its objects prior to KHO scratch is marked as CMA region. While this change could potentially increase metadata overhead on systems with sparsely preserved memory, this is being mitigated by ongoing work to reduce sparseness during preservation via 1G guest pages. Furthermore, this change aligns with future work on a stateless KHO, which will also use page-sized bitmaps for its radix tree metadata. Link: https://lkml.kernel.org/r/20251021000852.2924827-3-pasha.tatashin@soleen.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:41 -08:00
Pasha Tatashin	e38f65d317	kho: warn and fail on metadata or preserved memory in scratch area Patch series "KHO: kfence + KHO memory corruption fix", v3. This series fixes a memory corruption bug in KHO that occurs when KFENCE is enabled. The root cause is that KHO metadata, allocated via kzalloc(), can be randomly serviced by kfence_alloc(). When a kernel boots via KHO, the early memblock allocator is restricted to a "scratch area". This forces the KFENCE pool to be allocated within this scratch area, creating a conflict. If KHO metadata is subsequently placed in this pool, it gets corrupted during the next kexec operation. Google is using KHO and have had obscure crashes due to this memory corruption, with stacks all over the place. I would prefer this fix to be properly backported to stable so we can also automatically consume it once we switch to the upstream KHO. Patch 1/3 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG) that adds checks to detect and fail any operation that attempts to place KHO metadata or preserved memory within the scratch area. This serves as a validation and diagnostic tool to confirm the problem without affecting production builds. Patch 2/3 Increases bitmap to PAGE_SIZE, so buddy allocator can be used. Patch 3/3 Provides the fix by modifying KHO to allocate its metadata directly from the buddy allocator instead of slab. This bypasses the KFENCE interception entirely. This patch (of 3): It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption. Add checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected. [rppt@kernel.org: fix KEXEC_HANDOVER_DEBUG Kconfig dependency] Link: https://lkml.kernel.org/r/aQHUyyFtiNZhx8jo@kernel.org [pasha.tatashin@soleen.com: build fix] Link: https://lkml.kernel.org/r/CA+CK2bBnorfsTymKtv4rKvqGBHs=y=MjEMMRg_tE-RME6n-zUw@mail.gmail.com Link: https://lkml.kernel.org/r/20251021000852.2924827-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251021000852.2924827-2-pasha.tatashin@soleen.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:41 -08:00
Linus Torvalds	b5c0946029	Fix a group-throttling bug in the fair scheduler. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmkPQRgRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jtjQ//UM9YfdauTXacEE2dPG4b0QwxHPgMsFmS GMT7c6H7ApCQoiBnGwv3K0d0bu4FZwnWBMulUv/yhe49vHXdIPdZAlMkJrM3IA7X 80IVCfLHtRkbMaAUIDc3MGwkzneCUIuzKPPH1iXqn9/R0zBB8S7qxt4XHyBHzlX7 uCNbFkUFhrh5s3sWond2ogAlCvGiZ5Qo7/oTfNrpOmYGvXfNIh4T1zDOWpPrsKLX Md6rucBs0bcV1vlyKwNrobqOuaS0mdSxjt+SKDuI1CdCj6mNbYvjLPinnAi2n3zZ CLaI8+rBe3JpLOH+kXuOf+CUatXDBjF4GO1k6XXwvcsK4ARqmKcbk9Xs3i/Tn/bm Ls7IdLkCrekXaGU2MlYLVg0twe+O5oUgwBpa4Ap/IObbI+fIKP/Pj2blZlpT1RlY J455LrldsMUy1NWaqVd13gCGOPGzR6SrD+ruOJS4BAK3JFyRw8rdDW4zJ7qEiCRj yejZfiFcCAoD7cqFJoCon6rt+WC3T5I1/Sc40JCmfKH/GhLzTAGt/8cRaic4ntRX Yv8T/lJVgjhGqfDWmcRYZVF/SiyZq8IP+wqrpr3ETAfRhqB5ZhQHPUGZPPDDeeqH QHEYSrWdbmxKqYpBK3nwgwAIz9dSJcdQUqWLvP85rjRTeLr08/reXha4P80RJ3wz XmLzR7KQ9mI= =zLw8 -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a group-throttling bug in the fair scheduler" * tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining	2025-11-08 08:59:05 -08:00
Linus Torvalds	133262cae9	Fix a system hang caused by cpu-clock events. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmkPQBYRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j3tQ//f63vT4+Wn87Ukm+5sOdNcaCaYuYdW7Pa lMEegZgPxmtOsrN79qEyeaZzA5v3KH/ijE1ENKb4GvDdUBKFrxiVZ/LxxMs0+4PC O40pHPFKBBT7Aolu+TSp8LJgo0hvFKMpBFV5nLPsf/iwr4otgI7UFaMfWtdZQ4J7 e1iPuEt2hVrlcbgDn0HdT6YQfrWJNHyWLu2a16TMsklryuJRoA3lJJusDMVPry1o REacolLCH4c+zvlLcCGx33LQl9k560RZqQVnZwkSRlvv2pkf7pa8XG8f2nNAsxFf DbXRLNfXrVMOciLkgDUeJ6Vb9feMNDF0+pNYWOX4hZ6iOX4bmg5CEScmDwTq6xpU XMefVvZMhyTKxUHynGzvVgTyTgjNbawvLqLwXjNYcSrl9+WSOnnBgLn/YgrCF8lY W/wh2jAGzmKk3wm9r65pyjLW+GJwVT8zEKD2J6UzpRd52ITbHMuZ+StRdPSOEb5+ 1fgD5FTXerYYV7FwC7SgcWpJ+BtBptfngi2PzPVCQ+VexvGJPqdCFQ0EcoXPpznc XsjmuLgef2mRJlNjKSVwLFjXOwfUe4Dsj54cAurUEh0xDINwQelS1NRykmXQfgod i6a/nCCmSdVDBydMnM7AhH/hqRUz9CErNmIPUJS5VLjyKlW12qm0KsD9a0K0oei8 67GKLQhj/FI= =SulU -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix a system hang caused by cpu-clock events deadlock" * tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix system hang caused by cpu-clock usage	2025-11-08 08:54:13 -08:00
Linus Torvalds	e6f55fe790	Fix (well, cut in half) a futex performance regression on PowerPC. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmkPP2YRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hTlg//WBWA4717W46TetZJAYauc1rDeunhwoXQ nVQntZ4/jWrDroWhOC5iElaQGQlO974pEqt+z99IbnIFJ273vwgf03u6txcEOydu 2zmL067IC1TPNyrsnZ+yimeMEkpFKXYg00RhhSLr9HK6YPEbLh/SI8nYqSUvLcaJ eY60Ck8xgGf6oc4atiEPyUz6oX3EcEMVYUMXopsv8KksP6MUzbB6bbRMWklQn03s qFoDfrOF0aBlsMbjS5GsqeTMUTlYs/Py1L2IWMoKbSKhhVsT8eOazJZvcAKPCyhK OWrTZpW0vFcxS0dmLArGOejv8SyZsqQ35yOs5GfTa1qYOiQJ/trfi660tkglvddm DPXC2hPbR3XqLFmTKu1u2oJ46rfHYITWPilE3PfbaKOCVW1QEC2Q8p+cGPswvlFi CH6lU2UiiyiRIkvqGevEjAd9oVGyCQg01Pxce7WDRYmLYoFf3V3xrVwvC3IqoUM8 kLfUAxAGZN0CQFAwaIwi1hWSmCmEaHjIETZ321OsJYGSs/VbkM9TgXTs3J2fkz4i nY4FLCoxAwTvOrk9aW8fBKdGs51pUL/zWA3dGcrz3pJunYWnEiZ2Cvg+fEn6n46m fYT6KjOhqbap6V5Y7W7ny5qKiD3x7rEGX++nWHvL02TsPCe4fC/paob5RL/AmqNA D4S+6mpLwlQ= =ehm1 -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix (well, cut in half) a futex performance regression on PowerPC" * tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Optimize per-cpu reference counting	2025-11-08 08:51:22 -08:00
Linus Torvalds	5b95a50001	Fixes for tracing: - Check for reader catching up in ring_buffer_map_get_reader() If the reader catches up to the writer in the memory mapped ring buffer then calling rb_get_reader_page() will return NULL as there's no pages left. But this isn't checked for before calling rb_get_reader_page() and the return of NULL causes a warning. If it is detected that the reader caught up to the writer, then simply exit the routine. - Fix memory leak in histogram create_field_var() The couple of the error paths in create_field_var() did not properly clean up what was allocated. Make sure everything is freed properly on error. - Fix help message of tools latency_collector The help message incorrectly stated that "-t" was the same as "--threads" whereas "--threads" is actually represented by "-e". -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaQ3wOxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qrYvAP9zLYz/pCTTCY64/Yx2gMimFt7g9XhO b5xL+mZWoiYJigD+Ma7IpRC1QVyAk5YgxkWJqpEyHrxE84fBIBevoTRBTQE= =+x8m -----END PGP SIGNATURE----- Merge tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Check for reader catching up in ring_buffer_map_get_reader() If the reader catches up to the writer in the memory mapped ring buffer then calling rb_get_reader_page() will return NULL as there's no pages left. But this isn't checked for before calling rb_get_reader_page() and the return of NULL causes a warning. If it is detected that the reader caught up to the writer, then simply exit the routine - Fix memory leak in histogram create_field_var() The couple of the error paths in create_field_var() did not properly clean up what was allocated. Make sure everything is freed properly on error - Fix help message of tools latency_collector The help message incorrectly stated that "-t" was the same as "--threads" whereas "--threads" is actually represented by "-e" * tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/tools: Fix incorrcet short option in usage text for --threads tracing: Fix memory leaks in create_field_var() ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up	2025-11-07 08:07:11 -08:00
Mario Limonciello (AMD)	0b6c10cb84	PM: hibernate: Fix style issues in save_compressed_image() Address two issues indicated by checkpatch: - Trailing statements should be on next line. - Prefer 'unsigned int' to bare use of 'unsigned'. Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251106045158.3198061-4-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-11-07 16:53:57 +01:00
Mario Limonciello (AMD)	66ededc694	PM: hibernate: Use atomic64_t for compressed_size variable `compressed_size` can overflow, showing nonsensical values. Change from `atomic_t` to `atomic64_t` to prevent overflow. Fixes: `a06c6f5d3c` ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ Link: https://patch.msgid.link/20251106045158.3198061-3-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-11-07 16:53:56 +01:00
Mario Limonciello (AMD)	62b9ca1706	PM: hibernate: Emit an error when image writing fails If image writing fails, a return code is passed up to the caller, but none of the callers log anything to the log and so the only record of it is the return code that userspace gets. Adjust the logging so that the image size and speed of writing is only emitted on success and if there is an error, it's saved to the logs. Fixes: `a06c6f5d3c` ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ [ rjw: Added missing braces after "else", changelog edits ] Link: https://patch.msgid.link/20251106045158.3198061-2-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-11-07 16:53:56 +01:00
Zilin Guan	80f0d631dc	tracing: Fix memory leaks in create_field_var() The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: `02205a6752` ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-06 19:51:33 -05:00
Steven Rostedt	aa997d2d2a	ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up The function ring_buffer_map_get_reader() is a bit more strict than the other get reader functions, and except for certain situations the rb_get_reader_page() should not return NULL. If it does, it triggers a warning. This warning was triggering but after looking at why, it was because another acceptable situation was happening and it wasn't checked for. If the reader catches up to the writer and there's still data to be read on the reader page, then the rb_get_reader_page() will return NULL as there's no new page to get. In this situation, the reader page should not be updated and no warning should trigger. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/ Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home Fixes: `117c39200d` ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-06 19:38:54 -05:00
Puranjay Mohan	f8c67d8550	bpf: Use kmalloc_nolock() in range tree The range tree uses bpf_mem_alloc() that is safe to be called from all contexts and uses a pre-allocated pool of memory to serve these allocations. Replace bpf_mem_alloc() with kmalloc_nolock() as it can be called safely from all contexts and is more scalable than bpf_mem_alloc(). Remove the migrate_disable/enable pairs as they were only needed for bpf_mem_alloc() as it does per-cpu operations, kmalloc_nolock() doesn't need this. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251106170608.4800-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-06 15:55:19 -08:00
Masami Hiramatsu (Google)	c91afa7610	tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobe __unregister_trace_fprobe() checks tf->tuser to put it when removing tprobe. However, disable_trace_fprobe() does not use it and only calls unregister_fprobe(). Thus it forgets to disable tracepoint_user. If the trace_fprobe has tuser, put it for unregistering the tracepoint callbacks when disabling tprobe correctly. Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/ Fixes: `2867495dea` ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>	2025-11-07 07:36:20 +09:00
Masami Hiramatsu (Google)	10d9dda426	tracing: tprobe-events: Fix to register tracepoint correctly Since __tracepoint_user_init() calls tracepoint_user_register() without initializing tuser->tpoint with given tracpoint, it does not register tracepoint stub function as callback correctly, and tprobe does not work. Initializing tuser->tpoint correctly before tracepoint_user_register() so that it sets up tracepoint callback. I confirmed below example works fine again. echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable cat /sys/kernel/tracing/trace_pipe Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/ Fixes: `2867495dea` ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>	2025-11-07 07:32:55 +09:00
Peter Zijlstra	4cb5ac2626	futex: Optimize per-cpu reference counting Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely). Further optimize the per-cpu reference counter by: - switching from RCU to preempt; - using __this_cpu_() since we now have preempt disabled; - switching from smp_load_acquire() to READ_ONCE(). This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock(). Having preemption disabled allows using __this_cpu_() provided the only access to the variable is in task context -- which is the case here. Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier(). This is very similar to the percpu_down_read_internal() fast-path. The reason this is significant for PowerPC is that it uses the generic this_cpu_() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch. Combined this reduces the performance gap by half, down to some 5%. Fixes: `760e6f7bef` ("futex: Remove support for IMMUTABLE") Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net	2025-11-06 12:30:54 +01:00
Aaron Lu	956dfda6a7	sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining When a cfs_rq is to be throttled, its limbo list should be empty and that's why there is a warn in tg_throttle_down() for non empty cfs_rq->throttled_limbo_list. When running a test with the following hierarchy: root / \ A* ... / \| \ ... B / \ C* where both A and C have quota settings, that warn on non empty limbo list is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu part of the cfs_rq for the sake of simpler representation). Debug showed it happened like this: Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), cfs_rq_c is initialized with runtime_enabled set, runtime_remaining equals to 0 and unthrottled. Before any tasks are enqueued to cfs_rq_c, multiple throttled tasks can migrate to cfs_rq_c (e.g., due to task group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), these throttled tasks are directly placed into cfs_rq_c's limbo list by enqueue_throttled_task(). Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these tasks. The first enqueue triggers check_enqueue_throttle(), and with zero runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it can't get more runtime and enters tg_throttle_down(), where the warning is hit due to remaining tasks in the limbo list. I think it's a chaos to trigger throttle on unthrottle path, the status of a being unthrottled cfs_rq can be in a mixed state in the end, so fix this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures cfs_rq_c has a positive runtime_remaining when initialized as unthrottled and cannot enter tg_unthrottle_up() with zero runtime_remaining. Also, update outdated comments in tg_throttle_down() since unthrottle_cfs_rq() is no longer called with zero runtime_remaining. While at it, remove a redundant assignment to se in tg_throttle_down(). Fixes: `e1fad12dcb` ("sched/fair: Switch to task based throttle model") Reviewed-By: Benjamin Segall <bsegall@google.com> Suggested-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Hao Jia <jiahao1@lixiang.com> Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com	2025-11-06 12:30:52 +01:00
Anton Protopopov	bc414d3583	bpf: disasm: add support for BPF_JMP\|BPF_JA\|BPF_X Add support for indirect jump instruction. Example output from bpftool: 0: (79) r3 = (u64 )(r1 +0) 1: (25) if r3 > 0x4 goto pc+666 2: (67) r3 <<= 3 3: (18) r1 = 0xffffbeefspameggs 5: (0f) r1 += r3 6: (79) r1 = (u64 )(r1 +0) 7: (0d) gotox r1 Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-10-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-11-05 17:53:23 -08:00

1 2 3 4 5 ...

49659 commits