Commit graph

49659 commits

Author SHA1 Message Date
Anton Protopopov
7feff23cdf bpf: force BPF_F_RDONLY_PROG on insn array creation
The original implementation added a hack to check_mem_access()
to prevent programs from writing into insn arrays. To get rid
of this hack, enforce BPF_F_RDONLY_PROG on map creation.

Also fix the corresponding selftest, as the error message changes
with this patch.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-28 15:15:43 -08:00
Edward Adam Davis
688b745401 bpf: Fix exclusive map memory leak
When excl_prog_hash is 0 and excl_prog_hash_size is non-zero, the map also
needs to be freed. Otherwise, the map memory will not be reclaimed, just
like the memory leak problem reported by syzbot [1].

syzbot reported:
BUG: memory leak
  backtrace (crc 7b9fb9b4):
    map_create+0x322/0x11e0 kernel/bpf/syscall.c:1512
    __sys_bpf+0x3556/0x3610 kernel/bpf/syscall.c:6131

Fixes: baefdbdf68 ("bpf: Implement exclusive map creation")
Reported-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cf08c551fecea9fd1320
Tested-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/tencent_3F226F882CE56DCC94ACE90EED1ECCFC780A@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-26 11:23:27 -08:00
Leon Hwang
8f6ddc0587 bpf: Introduce internal bpf_map_check_op_flags helper function
It is to unify map flags checking for lookup_elem, update_elem,
lookup_batch and update_batch APIs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20251125145857.98134-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-25 15:27:48 -08:00
Menglong Dong
402e44b31e bpf: implement "jmp" mode for trampoline
Implement the "jmp" mode for the bpf trampoline. For the ftrace_managed
case, we need only to set the FTRACE_OPS_FL_JMP on the tr->fops if "jmp"
is needed.

For the bpf poke case, we will check the origin poke type with the
"origin_flags", and current poke type with "tr->flags". The function
bpf_trampoline_update_fentry() is introduced to do the job.

The "jmp" mode will only be enabled with CONFIG_DYNAMIC_FTRACE_WITH_JMP
enabled and BPF_TRAMP_F_SHARE_IPMODIFY is not set. With
BPF_TRAMP_F_SHARE_IPMODIFY, we need to get the origin call ip from the
stack, so we can't use the "jmp" mode.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20251118123639.688444-7-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-24 09:47:04 -08:00
Menglong Dong
ae4a3160d1 bpf: specify the old and new poke_type for bpf_arch_text_poke
In the origin logic, the bpf_arch_text_poke() assume that the old and new
instructions have the same opcode. However, they can have different opcode
if we want to replace a "call" insn with a "jmp" insn.

Therefore, add the new function parameter "old_t" along with the "new_t",
which are used to indicate the old and new poke type. Meanwhile, adjust
the implement of bpf_arch_text_poke() for all the archs.

"BPF_MOD_NOP" is added to make the code more readable. In
bpf_arch_text_poke(), we still check if the new and old address is NULL to
determine if nop insn should be used, which I think is more safe.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20251118123639.688444-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-24 09:47:03 -08:00
Menglong Dong
25e4e3565d ftrace: Introduce FTRACE_OPS_FL_JMP
For now, the "nop" will be replaced with a "call" instruction when a
function is hooked by the ftrace. However, sometimes the "call" can break
the RSB and introduce extra overhead. Therefore, introduce the flag
FTRACE_OPS_FL_JMP, which indicate that the ftrace_ops should be called
with a "jmp" instead of "call". For now, it is only used by the direct
call case.

When a direct ftrace_ops is marked with FTRACE_OPS_FL_JMP, the last bit of
the ops->direct_call will be set to 1. Therefore, we can tell if we should
use "jmp" for the callback in ftrace_call_replace().

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20251118123639.688444-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-24 09:46:24 -08:00
Anton Protopopov
fad804002e bpf: cleanup aux->used_maps after jit
In commit b4ce5923e7 ("bpf, x86: add new map type: instructions array")
env->used_map was copied to func[i]->aux->used_maps before jitting.
Clear these fields out after jitting such that pointer to freed memory
(env->used_maps is freed later) are not kept in a live data structure.

The reason why the copies were initially added is explained in
https://lore.kernel.org/bpf/20251105090410.1250500-1-a.s.protopopov@gmail.com

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: b4ce5923e7 ("bpf, x86: add new map type: instructions array")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251124151515.2543403-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-24 09:39:55 -08:00
Puranjay Mohan
4167096cb9 bpf: support nested rcu critical sections
Currently, nested rcu critical sections are rejected by the verifier and
rcu_lock state is managed by a boolean variable. Add support for nested
rcu critical sections by make active_rcu_locks a counter similar to
active_preempt_locks. bpf_rcu_read_lock() increments this counter and
bpf_rcu_read_unlock() decrements it, MEM_RCU -> PTR_UNTRUSTED transition
happens when active_rcu_locks drops to 0.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251117200411.25563-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-21 18:34:59 -08:00
Eduard Zingerman
e40f5a6bf8 bpf: correct stack liveness for tail calls
This updates bpf_insn_successors() reflecting that control flow might
jump over the instructions between tail call and function exit, verifier
might assume that some writes to parent stack always happen, which is
not the case.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu>
Link: https://lore.kernel.org/r/20251119160355.1160932-4-martin.teichmann@xfel.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-21 17:45:30 -08:00
Martin Teichmann
e3245f8990 bpf: properly verify tail call behavior
A successful ebpf tail call does not return to the caller, but to the
caller-of-the-caller, often just finishing the ebpf program altogether.

Any restrictions that the verifier needs to take into account - notably
the fact that the tail call might have modified packet pointers - are to
be checked on the caller-of-the-caller. Checking it on the caller made
the verifier refuse perfectly fine programs that would use the packet
pointers after a tail call, which is no problem as this code is only
executed if the tail call was unsuccessful, i.e. nothing happened.

This patch simulates the behavior of a tail call in the verifier. A
conditional jump to the code after the tail call is added for the case
of an unsucessful tail call, and a return to the caller is simulated for
a successful tail call.

For the successful case we assume that the tail call returns an int,
as tail calls are currently only allowed in functions that return and
int. We always assume that the tail call modified the packet pointers,
as we do not know what the tail call did.

For the unsuccessful case we know nothing happened, so we do not need to
add new constraints.

This approach also allows to check other problems that may occur with
tail calls, namely we are now able to check that precision is properly
propagated into subprograms using tail calls, as well as checking the
live slots in such a subprogram.

Fixes: 1a4607ffba ("bpf: consider that tail calls invalidate packet pointers")
Link: https://lore.kernel.org/bpf/20251029105828.1488347-1-martin.teichmann@xfel.eu/
Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251119160355.1160932-2-martin.teichmann@xfel.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-21 17:45:30 -08:00
Anton Protopopov
4dd3a48d13 bpf: Add a check to make static analysers happy
In [1] Dan Carpenter reported that the following code makes the
Smatch static analyser unhappy:

        17904       value = map->ops->map_lookup_elem(map, &i);
        17905       if (!value)
        17906               return -EINVAL;
    --> 17907       items[i - start] = value->xlated_off;

The analyser assumes that the `value` variable may contain an error
and thus it should be properly checked before the dereference.
On practice this will never happen as array maps do not return
error values in map_lookup_elem, but to make the Smatch and other
possible analysers happy this patch adds a formal check.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/bpf/aR2BN1Ix--8tmVrN@stanley.mountain/ [1]
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251119112517.1091793-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-21 17:01:14 -08:00
Amery Hung
f484f4a3e0 bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage
Replace bpf memory allocator with kmalloc_nolock() to reduce memory
wastage due to preallocation.

In bpf_selem_free(), an selem now needs to wait for a RCU grace period
before being freed when reuse_now == true. Therefore, rcu_barrier()
should be always be called in bpf_local_storage_map_free().

In bpf_local_storage_free(), since smap->storage_ma is no longer needed
to return the memory, the function is now independent from smap.

Remove the outdated comment in bpf_local_storage_alloc(). We already
free selem after an RCU grace period in bpf_local_storage_update() when
bpf_local_storage_alloc() failed the cmpxchg since commit c0d63f3091
("bpf: Add bpf_selem_free()").

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-5-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18 16:20:25 -08:00
Amery Hung
39a460c425 bpf: Save memory alloction info in bpf_local_storage
Save the memory allocation method used for bpf_local_storage in the
struct explicitly so that we don't need to go through the hassle to
find out the info. When a later patch replaces BPF memory allocator
with kmalloc_noloc(), bpf_local_storage_free() will no longer need
smap->storage_ma to return the memory and completely remove the
dependency on smap in bpf_local_storage_free().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-4-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18 16:20:25 -08:00
Amery Hung
e76a33e1c7 bpf: Remove smap argument from bpf_selem_free()
Since selem already saves a pointer to smap, use it instead of an
additional argument in bpf_selem_free(). This requires moving the
SDATA(selem)->smap assignment from bpf_selem_link_map() to
bpf_selem_alloc() since bpf_selem_free() may be called without the
selem being linked to smap in bpf_local_storage_update().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18 16:20:25 -08:00
Amery Hung
0e854e5535 bpf: Always charge/uncharge memory when allocating/unlinking storage elements
Since commit a96a44aba5 ("bpf: bpf_sk_storage: Fix invalid wait
context lockdep report"), {charge,uncharge}_mem are always true when
allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem
from local storage, so drop these arguments. No functional change.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18 16:20:25 -08:00
Pu Lehui
7dc211c115 bpf: Fix invalid prog->stats access when update_effective_progs fails
Syzkaller triggers an invalid memory access issue following fault
injection in update_effective_progs. The issue can be described as
follows:

__cgroup_bpf_detach
  update_effective_progs
    compute_effective_progs
      bpf_prog_array_alloc <-- fault inject
  purge_effective_progs
    /* change to dummy_bpf_prog */
    array->items[index] = &dummy_bpf_prog.prog

---softirq start---
__do_softirq
  ...
    __cgroup_bpf_run_filter_skb
      __bpf_prog_run_save_cb
        bpf_prog_run
          stats = this_cpu_ptr(prog->stats)
          /* invalid memory access */
          flags = u64_stats_update_begin_irqsave(&stats->syncp)
---softirq end---

  static_branch_dec(&cgroup_bpf_enabled_key[atype])

The reason is that fault injection caused update_effective_progs to fail
and then changed the original prog into dummy_bpf_prog.prog in
purge_effective_progs. Then a softirq came, and accessing the members of
dummy_bpf_prog.prog in the softirq triggers invalid mem access.

To fix it, skip updating stats when stats is NULL.

Fixes: 492ecee892 ("bpf: enable program stats")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-17 22:35:51 -08:00
Altgelt, Max (Nextron)
4722981cca bpf: don't skip other information if xlated_prog_insns is skipped
If xlated_prog_insns should not be exposed, other information
(such as func_info) still can and should be filled in.
Therefore, instead of directly terminating in this case,
continue with the normal flow.

Signed-off-by: Max Altgelt <max.altgelt@nextron-systems.com>
Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14 18:55:06 -08:00
Puranjay Mohan
4f7bc83b98 bpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off()
Metadata about a kfunc call is added to the kfunc_tab in
add_kfunc_call() but the call instruction itself could get removed by
opt_remove_dead_code() later if it is not reachable.

If the call instruction is removed, specialize_kfunc() is never called
for it and the desc->imm in the kfunc_tab is never initialized for this
kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in
do_misc_fixups() doesn't sort the table correctly.
This is a problem for s390 as its JIT uses this table to find the
addresses for kfuncs, and if this table is not sorted properly, JIT may
fail to find addresses for valid kfunc calls.

This was exposed by:

commit d869d56ca8 ("bpf: verifier: refactor kfunc specialization")

as before this commit, desc->imm was initialised in add_kfunc_call()
which happens before dead code elimination.

Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes
the problem and also saves us from having the same logic in
add_kfunc_call() and specialize_kfunc().

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114154023.12801-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14 17:55:18 -08:00
Alexei Starovoitov
e47b68bda4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+
Cross-merge BPF and other fixes after downstream PR.

Minor conflict in kernel/bpf/helpers.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14 17:43:41 -08:00
Linus Torvalds
cbba5d1b53 bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmkXpZUACgkQ6rmadz2v
 bTrGCw//UCx+KBXbzvv7m0A1QGOUL3oHL/Qd+OJA3RW3B+saVbYYzn9jjl0SRgFP
 X0q/DwbDOjFtOSORV9oFgJkrucn7+BM/yxPaC4sE1SQZJAjDFA/CSaF0r8duuGsM
 Mvat9TTiwwetOMAkNB9WZ1e6AKGovBLguLFGAWZc6vLeQZopcER5+pFwS44a9RrK
 dq0Th8O/oY3VmUDgSKJ2KyY51KxpJU7k2ipifiIbu1M1MWZ7s2vERkMEkzJ/lB8/
 nldMsTZUdknGFzVH/W6Rc9ScFYlH+h/x1gkOHwTibMsqDBm92mWVo6O7hvuUbsEO
 NlPDgMtkhBp7PDSx9SA0UBcriMs1M6ovNBOpj/cI4AL1k8WNubf/FHZtrBwoy8C9
 3HaM+8lkA2uiHVPUvT5dImzWqshweN0GXoXAoa9xPSQPchJ38UdzCHqYRAg/kWFZ
 5jUK2j4e5+yyII44pD7Xti0PrfoP81giliqmTbGFV8+Y89dQnk+WK12vnbv34ER7
 unLwId8HLtq0ZN7FVG4F6s/4qNdEMKqXbAkve0WWFXn4vKZMCju4ol6NYVGisRAg
 zcn7Yk+weSuY3UOzC+/4SxhfTEAD0Kg6fUoG/1JdflgNsm8XhLBja0DZaAlIVO0p
 xz5UaljwcNvjAKGGMYbCGrf3XN2tOmGpVyJkMj17Vcq88y3bJBU=
 =JJui
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix interaction between livepatch and BPF fexit programs (Song Liu)
   With Steven and Masami acks.

 - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa)
   With Steven and Masami acks.

 - Fix out of bounds access in widen_imprecise_scalars() in the verifier
   (Eduard Zingerman)

 - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen)

 - Fix net_sched storage collision with BPF data_meta/data_end (Eric
   Dumazet)

 - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking
   them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test widen_imprecise_scalars() with different stack depth
  bpf: account for current allocated stack depth in widen_imprecise_scalars()
  bpf: Add bpf_prog_run_data_pointers()
  selftests/bpf: Add mptcp test with sockmap
  mptcp: Fix proto fallback detection with BPF
  mptcp: Disallow MPTCP subflows from sockmap
  selftests/bpf: Add stacktrace ips test for raw_tp
  selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi
  x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe
  Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"
  bpf: add _impl suffix for bpf_stream_vprintk() kfunc
  bpf:add _impl suffix for bpf_task_work_schedule* kfuncs
  selftests/bpf: Add tests for livepatch + bpf trampoline
  ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()
  ftrace: Fix BPF fexit with livepatch
2025-11-14 15:39:39 -08:00
Menglong Dong
fea3f5e83c bpf: Handle return value of ftrace_set_filter_ip in register_fentry
The error that returned by ftrace_set_filter_ip() in register_fentry() is
not handled properly. Just fix it.

Fixes: 00963a2e75 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14 13:31:30 -08:00
Eduard Zingerman
e5d2e34e72 bpf: Add missing checks to avoid verbose verifier log
There are a few places where log level is not checked before calling
"verbose()". This forces programs working only at
BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large
log buffers. Add missing checks.

Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114200542.912386-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14 13:29:12 -08:00
Sahil Chandna
c1da3df719 bpf: Prevent nesting overflow in bpf_try_get_buffers
bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a
per-CPU nesting counter. This mechanism expects that buffers are not
endlessly acquired before being returned. migrate_disable() ensures that a
task remains on the same CPU, but it does not prevent the task from being
preempted by another task on that CPU.

Without disabled preemption, a task may be preempted while holding a
buffer, allowing another task to run on same CPU and acquire an
additional buffer. Several such preemptions can cause the per-CPU
nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in
bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around
buffer acquisition and release prevents this task preemption and
preserves the intended bounded nesting behavior.

Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/
Fixes: 4223bf833c ("bpf: Remove preempt_disable in bpf_try_get_buffers")
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com>
Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14 13:06:47 -08:00
Eduard Zingerman
b0c8e6d3d8 bpf: account for current allocated stack depth in widen_imprecise_scalars()
The usage pattern for widen_imprecise_scalars() looks as follows:

    prev_st = find_prev_entry(env, ...);
    queued_st = push_stack(...);
    widen_imprecise_scalars(env, prev_st, queued_st);

Where prev_st is an ancestor of the queued_st in the explored states
tree. This ancestor is not guaranteed to have same allocated stack
depth as queued_st. E.g. in the following case:

    def main():
      for i in 1..2:
        foo(i)        // same callsite, differnt param

    def foo(i):
      if i == 1:
        use 128 bytes of stack
      iterator based loop

Here, for a second 'foo' call prev_st->allocated_stack is 128,
while queued_st->allocated_stack is much smaller.
widen_imprecise_scalars() needs to take this into account and avoid
accessing bpf_verifier_state->frame[*]->stack out of bounds.

Fixes: 2793a8b015 ("bpf: exact states comparison for iterator convergence checks")
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14 09:26:05 -08:00
Linus Torvalds
aecba2e013 Power management fixes for 6.18-rc6
- Fix issues related to using inadequate data types and incorrect use
    of atomic variables in the compressed hibernation images handling
    code that were introduced during the 6.9 development cycle (Mario
    Limonciello)
 
  - Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places
    where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate
    to address a regression preventing users from enabling turbo
    frequencies post-boot (Srinivas Pandruvada)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmkWOnoSHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO14gkH/08TYjtNEvC6acD2r5gJTpacdnExWopL
 NFmRyhuZM21Ja2gd1q0xtPvcTdAz3rkhB4vqn9KQ0oLkfXj08/+zpRyOP3PzVVSp
 bvE/Am28s/VChjDg/MFcP7o/fLSNoL73wK6er+i721KIV1uscK4FydkPNs6gpBHw
 03FkUJX8jRjil0Cp6km2O0Zo5SEgm/U6wDjR5Azpdru8VKbI1RaxCMsR0/HnlA9Y
 pUAph9NX1UBBjdMFFdn8++Vna8XJX4qe9CiYT7KwGbGx5jUpVBaT9d/hPm0O/mJt
 VvNe3Dl5soFM/3yibsvV4sTcZHNPTsIKjuKIqwL4F0TCGug9kxwjFBk=
 =r3Fq
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix issues related to the handling of compressed hibernation
  images and a recent intel_pstate driver regression:

   - Fix issues related to using inadequate data types and incorrect use
     of atomic variables in the compressed hibernation images handling
     code that were introduced during the 6.9 development cycle (Mario
     Limonciello)

   - Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places
     where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate
     to address a regression preventing users from enabling turbo
     frequencies post-boot (Srinivas Pandruvada)"

* tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Check IDA only before MSR_IA32_PERF_CTL writes
  PM: hibernate: Fix style issues in save_compressed_image()
  PM: hibernate: Use atomic64_t for compressed_size variable
  PM: hibernate: Emit an error when image writing fails
2025-11-13 16:31:07 -08:00
Rafael J. Wysocki
161284b26f Merge branch 'pm-sleep'
Merge fixes for issues related to the handling of compressed hibernation
images that were introduced during the 6.9 development cycle.

* pm-sleep:
  PM: hibernate: Fix style issues in save_compressed_image()
  PM: hibernate: Use atomic64_t for compressed_size variable
  PM: hibernate: Emit an error when image writing fails
2025-11-13 21:05:46 +01:00
Leon Hwang
6af6e49a76 bpf: Free special fields when update [lru_,]percpu_hash maps
As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing
calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the
memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
map gets freed.

Fix this by calling 'bpf_obj_free_fields()' after
'copy_map_value[,_long]()' in 'pcpu_copy_value()'.

Fixes: 65334e64a4 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-13 09:14:15 -08:00
Kumar Kartikeya Dwivedi
3249e8a17e bpf: Adjust return value for queue destruction in rqspinlock
Return -ETIMEDOUT whenever non-head waiters are signalled by head, and fix
oversight in commit 7bd6e5ce5b ("rqspinlock: Disable queue destruction for
deadlocks"). We no longer signal on deadlocks.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20251111013827.1853484-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-12 11:17:39 -08:00
Pratyush Yadav
b05addf6f0 kho: warn and exit when unpreserved page wasn't preserved
Calling __kho_unpreserve() on a pair of (pfn, end_pfn) that wasn't
preserved is a bug.  Currently, if that is done, the physxa or bits can be
NULL.  This results in a soft lockup since a NULL physxa or bits results
in redoing the loop without ever making any progress.

Return when physxa or bits are not found, but WARN first to loudly
indicate invalid behaviour.

Link: https://lkml.kernel.org/r/20251103180235.71409-3-pratyush@kernel.org
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:47 -08:00
Pratyush Yadav
7ecd2e439d kho: fix unpreservation of higher-order vmalloc preservations
kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as
pfn + 1.  This happens to work for 0-order pages, but leaks higher order
pages.

For example, say order 2 pages back the allocation.  During preservation,
they get preserved in the order 2 bitmaps, but
kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the order
0 bitmaps, which should not have these bits set anyway, leaving the order
2 bitmaps untouched.  This results in the pages being carried over to the
next kernel.  Nothing will free those pages in the next boot, leaking
them.

Fix this by taking the order into account when calculating the end PFN for
__kho_unpreserve().

Link: https://lkml.kernel.org/r/20251103180235.71409-2-pratyush@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:47 -08:00
Pratyush Yadav
0b07092d09 kho: fix out-of-bounds access of vmalloc chunk
The list of pages in a vmalloc chunk is NULL-terminated.  So when looping
through the pages in a vmalloc chunk, both kho_restore_vmalloc() and
kho_vmalloc_unpreserve_chunk() rightly make sure to stop when encountering
a NULL page.  But when the chunk is full, the loops do not stop and go
past the bounds of chunk->phys, resulting in out-of-bounds memory access,
and possibly the restoration or unpreservation of an invalid page.

Fix this by making sure the processing of chunk stops at the end of the
array.

Link: https://lkml.kernel.org/r/20251103110159.8399-1-pratyush@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:47 -08:00
Peter Oberparleiter
ec4d11fc4b gcov: add support for GCC 15
Using gcov on kernels compiled with GCC 15 results in truncated 16-byte
long .gcda files with no usable data.  To fix this, update GCOV_COUNTERS
to match the value defined by GCC 15.

Tested with GCC 14.3.0 and GCC 15.2.0.

Link: https://lkml.kernel.org/r/20251028115125.1319410-1-oberpar@linux.ibm.com
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reported-by: Matthieu Baerts <matttbe@kernel.org>
Closes: https://github.com/linux-test-project/lcov/issues/445
Tested-by: Matthieu Baerts <matttbe@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:44 -08:00
Pasha Tatashin
fa759cd75b kho: allocate metadata directly from the buddy allocator
KHO allocates metadata for its preserved memory map using the slab
allocator via kzalloc().  This metadata is temporary and is used by the
next kernel during early boot to find preserved memory.

A problem arises when KFENCE is enabled.  kzalloc() calls can be randomly
intercepted by kfence_alloc(), which services the allocation from a
dedicated KFENCE memory pool.  This pool is allocated early in boot via
memblock.

When booting via KHO, the memblock allocator is restricted to a "scratch
area", forcing the KFENCE pool to be allocated within it.  This creates a
conflict, as the scratch area is expected to be ephemeral and
overwriteable by a subsequent kexec.  If KHO metadata is placed in this
KFENCE pool, it leads to memory corruption when the next kernel is loaded.

To fix this, modify KHO to allocate its metadata directly from the buddy
allocator instead of slab.

Link: https://lkml.kernel.org/r/20251021000852.2924827-4-pasha.tatashin@soleen.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Matlack <dmatlack@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:42 -08:00
Pasha Tatashin
a2fff99f92 kho: increase metadata bitmap size to PAGE_SIZE
KHO memory preservation metadata is preserved in 512 byte chunks which
requires their allocation from slab allocator.  Slabs are not safe to be
used with KHO because of kfence, and because partial slabs may lead leaks
to the next kernel.  Change the size to be PAGE_SIZE.

The kfence specifically may cause memory corruption, where it randomly
provides slab objects that can be within the scratch area.  The reason for
that is that kfence allocates its objects prior to KHO scratch is marked
as CMA region.

While this change could potentially increase metadata overhead on systems
with sparsely preserved memory, this is being mitigated by ongoing work to
reduce sparseness during preservation via 1G guest pages.  Furthermore,
this change aligns with future work on a stateless KHO, which will also
use page-sized bitmaps for its radix tree metadata.

Link: https://lkml.kernel.org/r/20251021000852.2924827-3-pasha.tatashin@soleen.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:41 -08:00
Pasha Tatashin
e38f65d317 kho: warn and fail on metadata or preserved memory in scratch area
Patch series "KHO: kfence + KHO memory corruption fix", v3.

This series fixes a memory corruption bug in KHO that occurs when KFENCE
is enabled.

The root cause is that KHO metadata, allocated via kzalloc(), can be
randomly serviced by kfence_alloc().  When a kernel boots via KHO, the
early memblock allocator is restricted to a "scratch area".  This forces
the KFENCE pool to be allocated within this scratch area, creating a
conflict.  If KHO metadata is subsequently placed in this pool, it gets
corrupted during the next kexec operation.

Google is using KHO and have had obscure crashes due to this memory
corruption, with stacks all over the place.  I would prefer this fix to be
properly backported to stable so we can also automatically consume it once
we switch to the upstream KHO.

Patch 1/3 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG)
that adds checks to detect and fail any operation that attempts to place
KHO metadata or preserved memory within the scratch area.  This serves as
a validation and diagnostic tool to confirm the problem without affecting
production builds.

Patch 2/3 Increases bitmap to PAGE_SIZE, so buddy allocator can be used.

Patch 3/3 Provides the fix by modifying KHO to allocate its metadata
directly from the buddy allocator instead of slab.  This bypasses the
KFENCE interception entirely.


This patch (of 3):

It is invalid for KHO metadata or preserved memory regions to be located
within the KHO scratch area, as this area is overwritten when the next
kernel is loaded, and used early in boot by the next kernel.  This can
lead to memory corruption.

Add checks to kho_preserve_* and KHO's internal metadata allocators
(xa_load_or_alloc, new_chunk) to verify that the physical address of the
memory does not overlap with any defined scratch region.  If an overlap is
detected, the operation will fail and a WARN_ON is triggered.  To avoid
performance overhead in production kernels, these checks are enabled only
when CONFIG_KEXEC_HANDOVER_DEBUG is selected.

[rppt@kernel.org: fix KEXEC_HANDOVER_DEBUG Kconfig dependency]
  Link: https://lkml.kernel.org/r/aQHUyyFtiNZhx8jo@kernel.org
[pasha.tatashin@soleen.com: build fix]
  Link: https://lkml.kernel.org/r/CA+CK2bBnorfsTymKtv4rKvqGBHs=y=MjEMMRg_tE-RME6n-zUw@mail.gmail.com
Link: https://lkml.kernel.org/r/20251021000852.2924827-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20251021000852.2924827-2-pasha.tatashin@soleen.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09 21:19:41 -08:00
Linus Torvalds
b5c0946029 Fix a group-throttling bug in the fair scheduler.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmkPQRgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jtjQ//UM9YfdauTXacEE2dPG4b0QwxHPgMsFmS
 GMT7c6H7ApCQoiBnGwv3K0d0bu4FZwnWBMulUv/yhe49vHXdIPdZAlMkJrM3IA7X
 80IVCfLHtRkbMaAUIDc3MGwkzneCUIuzKPPH1iXqn9/R0zBB8S7qxt4XHyBHzlX7
 uCNbFkUFhrh5s3sWond2ogAlCvGiZ5Qo7/oTfNrpOmYGvXfNIh4T1zDOWpPrsKLX
 Md6rucBs0bcV1vlyKwNrobqOuaS0mdSxjt+SKDuI1CdCj6mNbYvjLPinnAi2n3zZ
 CLaI8+rBe3JpLOH+kXuOf+CUatXDBjF4GO1k6XXwvcsK4ARqmKcbk9Xs3i/Tn/bm
 Ls7IdLkCrekXaGU2MlYLVg0twe+O5oUgwBpa4Ap/IObbI+fIKP/Pj2blZlpT1RlY
 J455LrldsMUy1NWaqVd13gCGOPGzR6SrD+ruOJS4BAK3JFyRw8rdDW4zJ7qEiCRj
 yejZfiFcCAoD7cqFJoCon6rt+WC3T5I1/Sc40JCmfKH/GhLzTAGt/8cRaic4ntRX
 Yv8T/lJVgjhGqfDWmcRYZVF/SiyZq8IP+wqrpr3ETAfRhqB5ZhQHPUGZPPDDeeqH
 QHEYSrWdbmxKqYpBK3nwgwAIz9dSJcdQUqWLvP85rjRTeLr08/reXha4P80RJ3wz
 XmLzR7KQ9mI=
 =zLw8
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix a group-throttling bug in the fair scheduler"

* tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
2025-11-08 08:59:05 -08:00
Linus Torvalds
133262cae9 Fix a system hang caused by cpu-clock events.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmkPQBYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j3tQ//f63vT4+Wn87Ukm+5sOdNcaCaYuYdW7Pa
 lMEegZgPxmtOsrN79qEyeaZzA5v3KH/ijE1ENKb4GvDdUBKFrxiVZ/LxxMs0+4PC
 O40pHPFKBBT7Aolu+TSp8LJgo0hvFKMpBFV5nLPsf/iwr4otgI7UFaMfWtdZQ4J7
 e1iPuEt2hVrlcbgDn0HdT6YQfrWJNHyWLu2a16TMsklryuJRoA3lJJusDMVPry1o
 REacolLCH4c+zvlLcCGx33LQl9k560RZqQVnZwkSRlvv2pkf7pa8XG8f2nNAsxFf
 DbXRLNfXrVMOciLkgDUeJ6Vb9feMNDF0+pNYWOX4hZ6iOX4bmg5CEScmDwTq6xpU
 XMefVvZMhyTKxUHynGzvVgTyTgjNbawvLqLwXjNYcSrl9+WSOnnBgLn/YgrCF8lY
 W/wh2jAGzmKk3wm9r65pyjLW+GJwVT8zEKD2J6UzpRd52ITbHMuZ+StRdPSOEb5+
 1fgD5FTXerYYV7FwC7SgcWpJ+BtBptfngi2PzPVCQ+VexvGJPqdCFQ0EcoXPpznc
 XsjmuLgef2mRJlNjKSVwLFjXOwfUe4Dsj54cAurUEh0xDINwQelS1NRykmXQfgod
 i6a/nCCmSdVDBydMnM7AhH/hqRUz9CErNmIPUJS5VLjyKlW12qm0KsD9a0K0oei8
 67GKLQhj/FI=
 =SulU
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fix from Ingo Molnar:
 "Fix a system hang caused by cpu-clock events deadlock"

* tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix system hang caused by cpu-clock usage
2025-11-08 08:54:13 -08:00
Linus Torvalds
e6f55fe790 Fix (well, cut in half) a futex performance regression on PowerPC.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmkPP2YRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hTlg//WBWA4717W46TetZJAYauc1rDeunhwoXQ
 nVQntZ4/jWrDroWhOC5iElaQGQlO974pEqt+z99IbnIFJ273vwgf03u6txcEOydu
 2zmL067IC1TPNyrsnZ+yimeMEkpFKXYg00RhhSLr9HK6YPEbLh/SI8nYqSUvLcaJ
 eY60Ck8xgGf6oc4atiEPyUz6oX3EcEMVYUMXopsv8KksP6MUzbB6bbRMWklQn03s
 qFoDfrOF0aBlsMbjS5GsqeTMUTlYs/Py1L2IWMoKbSKhhVsT8eOazJZvcAKPCyhK
 OWrTZpW0vFcxS0dmLArGOejv8SyZsqQ35yOs5GfTa1qYOiQJ/trfi660tkglvddm
 DPXC2hPbR3XqLFmTKu1u2oJ46rfHYITWPilE3PfbaKOCVW1QEC2Q8p+cGPswvlFi
 CH6lU2UiiyiRIkvqGevEjAd9oVGyCQg01Pxce7WDRYmLYoFf3V3xrVwvC3IqoUM8
 kLfUAxAGZN0CQFAwaIwi1hWSmCmEaHjIETZ321OsJYGSs/VbkM9TgXTs3J2fkz4i
 nY4FLCoxAwTvOrk9aW8fBKdGs51pUL/zWA3dGcrz3pJunYWnEiZ2Cvg+fEn6n46m
 fYT6KjOhqbap6V5Y7W7ny5qKiD3x7rEGX++nWHvL02TsPCe4fC/paob5RL/AmqNA
 D4S+6mpLwlQ=
 =ehm1
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix (well, cut in half) a futex performance regression on PowerPC"

* tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Optimize per-cpu reference counting
2025-11-08 08:51:22 -08:00
Linus Torvalds
5b95a50001 Fixes for tracing:
- Check for reader catching up in ring_buffer_map_get_reader()
 
   If the reader catches up to the writer in the memory mapped ring buffer
   then calling rb_get_reader_page() will return NULL as there's no
   pages left. But this isn't checked for before calling rb_get_reader_page()
   and the return of NULL causes a warning.
 
   If it is detected that the reader caught up to the writer, then simply
   exit the routine.
 
 - Fix memory leak in histogram create_field_var()
 
   The couple of the error paths in create_field_var() did not properly clean
   up what was allocated. Make sure everything is freed properly on error.
 
 - Fix help message of tools latency_collector
 
   The help message incorrectly stated that "-t" was the same as "--threads"
   whereas "--threads" is actually represented by "-e".
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaQ3wOxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrYvAP9zLYz/pCTTCY64/Yx2gMimFt7g9XhO
 b5xL+mZWoiYJigD+Ma7IpRC1QVyAk5YgxkWJqpEyHrxE84fBIBevoTRBTQE=
 =+x8m
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Check for reader catching up in ring_buffer_map_get_reader()

   If the reader catches up to the writer in the memory mapped ring
   buffer then calling rb_get_reader_page() will return NULL as there's
   no pages left. But this isn't checked for before calling
   rb_get_reader_page() and the return of NULL causes a warning.

   If it is detected that the reader caught up to the writer, then
   simply exit the routine

 - Fix memory leak in histogram create_field_var()

   The couple of the error paths in create_field_var() did not properly
   clean up what was allocated. Make sure everything is freed properly
   on error

 - Fix help message of tools latency_collector

   The help message incorrectly stated that "-t" was the same as
   "--threads" whereas "--threads" is actually represented by "-e"

* tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/tools: Fix incorrcet short option in usage text for --threads
  tracing: Fix memory leaks in create_field_var()
  ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up
2025-11-07 08:07:11 -08:00
Mario Limonciello (AMD)
0b6c10cb84 PM: hibernate: Fix style issues in save_compressed_image()
Address two issues indicated by checkpatch:

 - Trailing statements should be on next line.
 - Prefer 'unsigned int' to bare use of 'unsigned'.

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
[ rjw: Changelog edits ]
Link: https://patch.msgid.link/20251106045158.3198061-4-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07 16:53:57 +01:00
Mario Limonciello (AMD)
66ededc694 PM: hibernate: Use atomic64_t for compressed_size variable
`compressed_size` can overflow, showing nonsensical values.

Change from `atomic_t` to `atomic64_t` to prevent overflow.

Fixes: a06c6f5d3c ("PM: hibernate: Move to crypto APIs for LZO compression")
Reported-by: Askar Safin <safinaskar@gmail.com>
Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: Askar Safin <safinaskar@gmail.com>
Cc: 6.9+ <stable@vger.kernel.org> # 6.9+
Link: https://patch.msgid.link/20251106045158.3198061-3-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07 16:53:56 +01:00
Mario Limonciello (AMD)
62b9ca1706 PM: hibernate: Emit an error when image writing fails
If image writing fails, a return code is passed up to the caller, but
none of the callers log anything to the log and so the only record
of it is the return code that userspace gets.

Adjust the logging so that the image size and speed of writing is
only emitted on success and if there is an error, it's saved to the
logs.

Fixes: a06c6f5d3c ("PM: hibernate: Move to crypto APIs for LZO compression")
Reported-by: Askar Safin <safinaskar@gmail.com>
Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: Askar Safin <safinaskar@gmail.com>
Cc: 6.9+ <stable@vger.kernel.org> # 6.9+
[ rjw: Added missing braces after "else", changelog edits ]
Link: https://patch.msgid.link/20251106045158.3198061-2-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07 16:53:56 +01:00
Zilin Guan
80f0d631dc tracing: Fix memory leaks in create_field_var()
The function create_field_var() allocates memory for 'val' through
create_hist_field() inside parse_atom(), and for 'var' through
create_var(), which in turn allocates var->type and var->var.name
internally. Simply calling kfree() to release these structures will
result in memory leaks.

Use destroy_hist_field() to properly free 'val', and explicitly release
the memory of var->type and var->var.name before freeing 'var' itself.

Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn
Fixes: 02205a6752 ("tracing: Add support for 'field variables'")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-06 19:51:33 -05:00
Steven Rostedt
aa997d2d2a ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up
The function ring_buffer_map_get_reader() is a bit more strict than the
other get reader functions, and except for certain situations the
rb_get_reader_page() should not return NULL. If it does, it triggers a
warning.

This warning was triggering but after looking at why, it was because
another acceptable situation was happening and it wasn't checked for.

If the reader catches up to the writer and there's still data to be read
on the reader page, then the rb_get_reader_page() will return NULL as
there's no new page to get.

In this situation, the reader page should not be updated and no warning
should trigger.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/
Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-06 19:38:54 -05:00
Puranjay Mohan
f8c67d8550 bpf: Use kmalloc_nolock() in range tree
The range tree uses bpf_mem_alloc() that is safe to be called from all
contexts and uses a pre-allocated pool of memory to serve these
allocations.

Replace bpf_mem_alloc() with kmalloc_nolock() as it can be called safely
from all contexts and is more scalable than bpf_mem_alloc().

Remove the migrate_disable/enable pairs as they were only needed for
bpf_mem_alloc() as it does per-cpu operations, kmalloc_nolock() doesn't
need this.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251106170608.4800-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-06 15:55:19 -08:00
Masami Hiramatsu (Google)
c91afa7610 tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobe
__unregister_trace_fprobe() checks tf->tuser to put it when removing
tprobe. However, disable_trace_fprobe() does not use it and only calls
unregister_fprobe(). Thus it forgets to disable tracepoint_user.

If the trace_fprobe has tuser, put it for unregistering the tracepoint
callbacks when disabling tprobe correctly.

Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/

Fixes: 2867495dea ("tracing: tprobe-events: Register tracepoint when enable tprobe event")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Tested-by: Beau Belgrave <beaub@linux.microsoft.com>
Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-11-07 07:36:20 +09:00
Masami Hiramatsu (Google)
10d9dda426 tracing: tprobe-events: Fix to register tracepoint correctly
Since __tracepoint_user_init() calls tracepoint_user_register() without
initializing tuser->tpoint with given tracpoint, it does not register
tracepoint stub function as callback correctly, and tprobe does not work.

Initializing tuser->tpoint correctly before tracepoint_user_register()
so that it sets up tracepoint callback.

I confirmed below example works fine again.

echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events
echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable
cat /sys/kernel/tracing/trace_pipe

Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/

Fixes: 2867495dea ("tracing: tprobe-events: Register tracepoint when enable tprobe event")
Reported-by: Beau Belgrave <beaub@linux.microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Tested-by: Beau Belgrave <beaub@linux.microsoft.com>
Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-11-07 07:32:55 +09:00
Peter Zijlstra
4cb5ac2626 futex: Optimize per-cpu reference counting
Shrikanth noted that the per-cpu reference counter was still some 10%
slower than the old immutable option (which removes the reference
counting entirely).

Further optimize the per-cpu reference counter by:

 - switching from RCU to preempt;
 - using __this_cpu_*() since we now have preempt disabled;
 - switching from smp_load_acquire() to READ_ONCE().

This is all safe because disabling preemption inhibits the RCU grace
period exactly like rcu_read_lock().

Having preemption disabled allows using __this_cpu_*() provided the
only access to the variable is in task context -- which is the case
here.

Furthermore, since we know changing fph->state to FR_ATOMIC demands a
full RCU grace period we can rely on the implied smp_mb() from that to
replace the acquire barrier().

This is very similar to the percpu_down_read_internal() fast-path.

The reason this is significant for PowerPC is that it uses the generic
this_cpu_*() implementation which relies on local_irq_disable() (the
x86 implementation relies on it being a single memop instruction to be
IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
barrier, not having to use explicit barriers safes a bunch.

Combined this reduces the performance gap by half, down to some 5%.

Fixes: 760e6f7bef ("futex: Remove support for IMMUTABLE")
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net
2025-11-06 12:30:54 +01:00
Aaron Lu
956dfda6a7 sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.

When running a test with the following hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).

Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().

Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.

I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
and cannot enter tg_unthrottle_up() with zero runtime_remaining.

Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().

Fixes: e1fad12dcb ("sched/fair: Switch to task based throttle model")
Reviewed-By: Benjamin Segall <bsegall@google.com>
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Hao Jia <jiahao1@lixiang.com>
Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
2025-11-06 12:30:52 +01:00
Anton Protopopov
bc414d3583 bpf: disasm: add support for BPF_JMP|BPF_JA|BPF_X
Add support for indirect jump instruction.

Example output from bpftool:

   0: (79) r3 = *(u64 *)(r1 +0)
   1: (25) if r3 > 0x4 goto pc+666
   2: (67) r3 <<= 3
   3: (18) r1 = 0xffffbeefspameggs
   5: (0f) r1 += r3
   6: (79) r1 = *(u64 *)(r1 +0)
   7: (0d) gotox r1

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251105090410.1250500-10-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05 17:53:23 -08:00