mirror of
https://github.com/torvalds/linux.git
synced 2026-03-08 04:24:31 +01:00
50929 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cb36eabcaf |
Miscellaneous fixes:
- Fix lock ordering bug found by lockdep in perf_event_wakeup() - Fix uncore counter enumeration on Granite Rapids and Sierra Forest - Fix perf_mmap() refcount bug found by Syzkaller - Fix __perf_event_overflow() vs. perf_remove_from_context() race Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmj/7MRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gXPg//V/Qrbnc9jYyyA9ZT9hGg4Oz36HSLLuRe zcpb0Fndmyjt6Fq0vwqN59UcRM2coJjZ6V3TUyhQJjstnzkmMBsPE3frx+VjUqA6 rfUukgSr0mhlT/OtlBx0hUKBaiPvNe9khnKXLo1mO5aEVkIHryPbPcU/VLG45jE1 sF+dFP1cFVgyNqac8Ai4oNLsoRNQqWAvD0UrYHijJpFE6GqW8rBm2ReASbk/RMjv s5CqpdLiRFmOoQ1Vwu9iG/wej0OMVJWUEpbmcqysT7UAMdoYtEm79HYON1Ez+ZEx x6JEV8y3bv01MZc+HmP4mvKDgo5w1zxzNk3Smsx2sscUsZYVcv4zvG9C7UlkSsJ4 uWI6wwAPc1euBAmduTMDEyQr5CkjS3Rdb83s9+I2LtZXCP73+FPEhekbMx9mIhJi Qw+H6QFeacpFso74vjfK4nGEEz0GbjWaT+VLBSJkwhOd/+/fkWyHsQoU8DPfC8nH ETMaYGXpW80XRB5ttz/MoJfmXi2ovJsVpyvd06zJE0JzKdiydJsC0d6xGHY9JBCg 07bg0ux8/hX8grNVDWusvw2S15rso3RUOq9uajsTzlr728+hbCVZba87UYlgUNHL +uA7IyX1WrY9DApXKmOWi9MTRkvdAQz6r43QMk+xkDd6b8JrOOMAJFqYuF8xD5Da mXy3HKkIKag= =0JyK -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fixes from Ingo Molnar: - Fix lock ordering bug found by lockdep in perf_event_wakeup() - Fix uncore counter enumeration on Granite Rapids and Sierra Forest - Fix perf_mmap() refcount bug found by Syzkaller - Fix __perf_event_overflow() vs perf_remove_from_context() race * tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix __perf_event_overflow() vs perf_remove_from_context() race perf/core: Fix refcount bug and potential UAF in perf_mmap perf/x86/intel/uncore: Add per-scheduler IMC CAS count events perf/core: Fix invalid wait context in ctx_sched_in() |
||
|
|
eb71ab2bf7 |
bpf-fixes
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmjmDIACgkQ6rmadz2v bTq3gg//QQLOT/FxP2/dDurliDTXvQRr1tUxmIw6s3P6hnz9j/LLEVKpLRVkqd8t XEwbubPd1TXDRsJ4f26Ew01YUtf9xi6ZQoMe/BL1okxi0ZwQGGRVMkiKOQgRT+rj qYSN5JMfPzA2AuM6FjBF/hhw24yVRdgKRYBam6D7XLfFf3s8TOhHHjJ925PqEo0t uJOy4ddDYB9BcGmfoeyiFgUtpPqcYrKIUCLBdwFvT2fnPJvrFFoCF3t7NS9UJu/O wd6ZPuGWSOl9A7vSheldP6cJUDX8L/5WEGO4/LjN7plkySF0HNv8uq/b1T3kKqoY Y3unXerLGJUAA9D5wpYAekx9YmvRTPQ/o39oTbquEB4SSJVU/SPUpvFw7m2Moq10 51yuyXLcPlI3xtk0Bd8c/CESSmkRenjWzsuZQhDGhsR0I9mIaALrhf9LaatHtXI5 f5ct73e+beK7Fc0Ze+b0JxDeFvzA3CKfAF0/fvGt0r9VZjBaMD+a3NnscBlyKztW UCXazcfndMhNfUUWanktbT5YhYPmY7hzVQEOl7HAMGn4yG6XbXXmzzY6BqEXIucM etueW2msZJHGBHQGe2RK3lxtmiB7/FglJHd86xebkIU2gCzqt8fGUha8AIuJ4rLS 7wxC33DycCofRGWdseVu7PsTasdhSGsHKbXz2fOFOFESOczYRw8= =fj3P -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad Tabba) - Fix invariant violation for single value tnums in the verifier (Harishankar Vishwanathan, Paul Chaignon) - Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai) - Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen) - Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri Olsa) - Fix race in freeing special fields in BPF maps to prevent memory leaks (Kumar Kartikeya Dwivedi) - Fix OOB read in dmabuf_collector (T.J. Mercier) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits) selftests/bpf: Avoid simplification of crafted bounds test selftests/bpf: Test refinement of single-value tnum bpf: Improve bounds when tnum has a single possible value bpf: Introduce tnum_step to step through tnum's members bpf: Fix race in devmap on PREEMPT_RT bpf: Fix race in cpumap on PREEMPT_RT selftests/bpf: Add tests for special fields races bpf: Retire rcu_trace_implies_rcu_gp() from local storage bpf: Delay freeing fields in local storage bpf: Lose const-ness of map in map_check_btf() bpf: Register dtor for freeing special fields selftests/bpf: Fix OOB read in dmabuf_collector selftests/bpf: Fix a memory leak in xdp_flowtable test bpf: Fix stack-out-of-bounds write in devmap bpf: Fix kprobe_multi cookies access in show_fdinfo callback bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing selftests/bpf: Don't override SIGSEGV handler with ASAN selftests/bpf: Check BPFTOOL env var in detect_bpftool_path() selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN selftests/bpf: Fix array bounds warning in jit_disasm_helpers ... |
||
|
|
efc11a6678 |
bpf: Improve bounds when tnum has a single possible value
We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. The
following extract from verifier logs shows what's happening:
from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
; if (magic == MARK_MAGIC_HOST || magic == MARK_MAGIC_OVERLAY || magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337
236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
237: (16) if w9 == 0xf00 goto pc+1
verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0)
We reach instruction 236 with two possible values for R9, 0xe00 and
0xf00. This is perfectly reflected in the tnum, but of course the ranges
are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path
at instruction 236 allows the verifier to reduce the range to
[0xe01; 0xf00]. The tnum is however not updated.
With these ranges, at instruction 237, the verifier is not able to
deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is
explored first, the verifier refines the bounds using the assumption
that R9 != 0xf00, and ends up with an invariant violation.
This pattern of impossible branch + bounds refinement is common to all
invariant violations seen so far. The long-term solution is likely to
rely on the refinement + invariant violation check to detect dead
branches, as started by Eduard. To fix the current issue, we need
something with less refactoring that we can backport.
This patch uses the tnum_step helper introduced in the previous patch to
detect the above situation. In particular, three cases are now detected
in the bounds refinement:
1. The u64 range and the tnum only overlap in umin.
u64: ---[xxxxxx]-----
tnum: --xx----------x-
2. The u64 range and the tnum only overlap in the maximum value
represented by the tnum, called tmax.
u64: ---[xxxxxx]-----
tnum: xx-----x--------
3. The u64 range and the tnum only overlap in between umin (excluded)
and umax.
u64: ---[xxxxxx]-----
tnum: xx----x-------x-
To detect these three cases, we call tnum_step(tnum, umin), which
returns the smallest member of the tnum greater than umin, called
tnum_next here. We're in case (1) if umin is part of the tnum and
tnum_next is greater than umax. We're in case (2) if umin is not part of
the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if
umin is not part of the tnum, tnum_next is inferior or equal to umax,
and calling tnum_step a second time gives us a value past umax.
This change implements these three cases. With it, the above bytecode
looks as follows:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (47) r0 |= 3584 ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff))
2: (57) r0 &= 3840 ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
3: (15) if r0 == 0xe00 goto pc+2 ; R0=3840
4: (15) if r0 == 0xf00 goto pc+1
4: R0=3840
6: (95) exit
In addition to the new selftests, this change was also verified with
Agni [3]. For the record, the raw SMT is available at [4]. The property
it verifies is that: If a concrete value x is contained in all input
abstract values, after __update_reg_bounds, it will continue to be
contained in all output abstract values.
Link: https://github.com/cilium/cilium/issues/44216 [1]
Link: https://pchaigno.github.io/test-verifier-complexity.html [2]
Link: https://github.com/bpfverif/agni [3]
Link: https://pastebin.com/raw/naCfaqNx [4]
Fixes:
|
||
|
|
76e954155b |
bpf: Introduce tnum_step to step through tnum's members
This commit introduces tnum_step(), a function that, when given t, and a
number z returns the smallest member of t larger than z. The number z
must be greater or equal to the smallest member of t and less than the
largest member of t.
The first step is to compute j, a number that keeps all of t's known
bits, and matches all unknown bits to z's bits. Since j is a member of
the t, it is already a candidate for result. However, we want our result
to be (minimally) greater than z.
There are only two possible cases:
(1) Case j <= z. In this case, we want to increase the value of j and
make it > z.
(2) Case j > z. In this case, we want to decrease the value of j while
keeping it > z.
(Case 1) j <= z
t = xx11x0x0
z = 10111101 (189)
j = 10111000 (184)
^
k
(Case 1.1) Let's first consider the case where j < z. We will address j
== z later.
Since z > j, there had to be a bit position that was 1 in z and a 0 in
j, beyond which all positions of higher significance are equal in j and
z. Further, this position could not have been unknown in a, because the
unknown positions of a match z. This position had to be a 1 in z and
known 0 in t.
Let k be position of the most significant 1-to-0 flip. In our example, k
= 3 (starting the count at 1 at the least significant bit). Setting (to
1) the unknown bits of t in positions of significance smaller than
k will not produce a result > z. Hence, we must set/unset the unknown
bits at positions of significance higher than k. Specifically, we look
for the next larger combination of 1s and 0s to place in those
positions, relative to the combination that exists in z. We can achieve
this by concatenating bits at unknown positions of t into an integer,
adding 1, and writing the bits of that result back into the
corresponding bit positions previously extracted from z.
>From our example, considering only positions of significance greater
than k:
t = xx..x
z = 10..1
+ 1
-----
11..0
This is the exact combination 1s and 0s we need at the unknown bits of t
in positions of significance greater than k. Further, our result must
only increase the value minimally above z. Hence, unknown bits in
positions of significance smaller than k should remain 0. We finally
have,
result = 11110000 (240)
(Case 1.2) Now consider the case when j = z, for example
t = 1x1x0xxx
z = 10110100 (180)
j = 10110100 (180)
Matching the unknown bits of the t to the bits of z yielded exactly z.
To produce a number greater than z, we must set/unset the unknown bits
in t, and *all* the unknown bits of t candidates for being set/unset. We
can do this similar to Case 1.1, by adding 1 to the bits extracted from
the masked bit positions of z. Essentially, this case is equivalent to
Case 1.1, with k = 0.
t = 1x1x0xxx
z = .0.1.100
+ 1
---------
.0.1.101
This is the exact combination of bits needed in the unknown positions of
t. After recalling the known positions of t, we get
result = 10110101 (181)
(Case 2) j > z
t = x00010x1
z = 10000010 (130)
j = 10001011 (139)
^
k
Since j > z, there had to be a bit position which was 0 in z, and a 1 in
j, beyond which all positions of higher significance are equal in j and
z. This position had to be a 0 in z and known 1 in t. Let k be the
position of the most significant 0-to-1 flip. In our example, k = 4.
Because of the 0-to-1 flip at position k, a member of t can become
greater than z if the bits in positions greater than k are themselves >=
to z. To make that member *minimally* greater than z, the bits in
positions greater than k must be exactly = z. Hence, we simply match all
of t's unknown bits in positions more significant than k to z's bits. In
positions less significant than k, we set all t's unknown bits to 0
to retain minimality.
In our example, in positions of greater significance than k (=4),
t=x000. These positions are matched with z (1000) to produce 1000. In
positions of lower significance than k, t=10x1. All unknown bits are set
to 0 to produce 1001. The final result is:
result = 10001001 (137)
This concludes the computation for a result > z that is a member of t.
The procedure for tnum_step() in this commit implements the idea
described above. As a proof of correctness, we verified the algorithm
against a logical specification of tnum_step. The specification asserts
the following about the inputs t, z and output res that:
1. res is a member of t, and
2. res is strictly greater than z, and
3. there does not exist another value res2 such that
3a. res2 is also a member of t, and
3b. res2 is greater than z
3c. res2 is smaller than res
We checked the implementation against this logical specification using
an SMT solver. The verification formula in SMTLIB format is available
at [1]. The verification returned an "unsat": indicating that no input
assignment exists for which the implementation and the specification
produce different outputs.
In addition, we also automatically generated the logical encoding of the
C implementation using Agni [2] and verified it against the same
specification. This verification also returned an "unsat", confirming
that the implementation is equivalent to the specification. The formula
for this check is also available at [3].
Link: https://pastebin.com/raw/2eRWbiit [1]
Link: https://github.com/bpfverif/agni [2]
Link: https://pastebin.com/raw/EztVbBJ2 [3]
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
1872e75375 |
bpf: Fix race in devmap on PREEMPT_RT
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be
accessed concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __dev_flush() run atomically
with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_xmit_all(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots
cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames.
If preempted after the snapshot, a second task can call bq_enqueue()
-> bq_xmit_all() on the same bq, transmitting (and freeing) the
same frames. When the first task resumes, it operates on stale
pointers in bq->q[], causing use-after-free.
2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying
bq->count and bq->q[] while bq_xmit_all() is reading them.
3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and
bq->xdp_prog after bq_xmit_all(). If preempted between
bq_xmit_all() return and bq->dev_rx = NULL, a preempting
bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to
the flush_list, and enqueues a frame. When __dev_flush() resumes,
it clears dev_rx and removes bq from the flush_list, orphaning the
newly enqueued frame.
4. __list_del_clearprev() on flush_node: similar to the cpumap race,
both tasks can call __list_del_clearprev() on the same flush_node,
the second dereferences the prev pointer already set to NULL.
The race between task A (__dev_flush -> bq_xmit_all) and task B
(bq_enqueue -> bq_xmit_all) on the same CPU:
Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect)
---------------------- --------------------------------
__dev_flush(flush_list)
bq_xmit_all(bq)
cnt = bq->count /* e.g. 16 */
/* start iterating bq->q[] */
<-- CFS preempts Task A -->
bq_enqueue(dev, xdpf)
bq->count == DEV_MAP_BULK_SIZE
bq_xmit_all(bq, 0)
cnt = bq->count /* same 16! */
ndo_xdp_xmit(bq->q[])
/* frames freed by driver */
bq->count = 0
<-- Task A resumes -->
ndo_xdp_xmit(bq->q[])
/* use-after-free: frames already freed! */
Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring
it in bq_enqueue() and __dev_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
Fixes:
|
||
|
|
869c63d597 |
bpf: Fix race in cpumap on PREEMPT_RT
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_flush_to_queue(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double __list_del_clearprev(): after bq->count is reset in
bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
bq_flush_to_queue() on the same bq when bq->count reaches
CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
on the same bq->flush_node, the second call dereferences the
prev pointer that was already set to NULL by the first.
2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
the packet queue while bq_flush_to_queue() is processing it.
The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:
Task A (xdp_do_flush) Task B (cpu_map_enqueue)
---------------------- ------------------------
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush bq->q[] to ptr_ring */
bq->count = 0
spin_unlock(&q->producer_lock)
bq_enqueue(rcpu, xdpf)
<-- CFS preempts Task A --> bq->q[bq->count++] = xdpf
/* ... more enqueues until full ... */
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush to ptr_ring */
spin_unlock(&q->producer_lock)
__list_del_clearprev(flush_node)
/* sets flush_node.prev = NULL */
<-- Task A resumes -->
__list_del_clearprev(flush_node)
flush_node.prev->next = ...
/* prev is NULL -> kernel oops */
Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
To reproduce, insert an mdelay(100) between bq->count = 0 and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.
Fixes:
|
||
|
|
baa35b3cb6 |
bpf: Retire rcu_trace_implies_rcu_gp() from local storage
This assumption will always hold going forward, hence just remove the various checks and assume it is true with a comment for the uninformed reader. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
f41deee082 |
bpf: Delay freeing fields in local storage
Currently, when use_kmalloc_nolock is false, the freeing of fields for a
local storage selem is done eagerly before waiting for the RCU or RCU
tasks trace grace period to elapse. This opens up a window where the
program which has access to the selem can recreate the fields after the
freeing of fields is done eagerly, causing memory leaks when the element
is finally freed and returned to the kernel.
Make a few changes to address this. First, delay the freeing of fields
until after the grace periods have expired using a __bpf_selem_free_rcu
wrapper which is eventually invoked after transitioning through the
necessary number of grace period waits. Replace usage of the kfree_rcu
with call_rcu to be able to take a custom callback. Finally, care needs
to be taken to extend the rcu barriers for all cases, and not just when
use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be
in flight for either case and access the smap field, which is used to
obtain the BTF record to walk over special fields in the map value.
While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since
migration should be disabled for RCU callbacks already.
Fixes:
|
||
|
|
ae51772b1e |
bpf: Lose const-ness of map in map_check_btf()
BPF hash map may now use the map_check_btf() callback to decide whether to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can opt out of const-ness using mutable, we must lose the const qualifier on the callback such that we can avoid the ugly cast. Make the change and adjust all existing users, and lose the comment in hashtab.c. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
1df97a7453 |
bpf: Register dtor for freeing special fields
There is a race window where BPF hash map elements can leak special fields if the program with access to the map value recreates these special fields between the check_and_free_fields done on the map value and its eventual return to the memory allocator. Several ways were explored prior to this patch, most notably [0] tried to use a poison value to reject attempts to recreate special fields for map values that have been logically deleted but still accessible to BPF programs (either while sitting in the free list or when reused). While this approach works well for task work, timers, wq, etc., it is harder to apply the idea to kptrs, which have a similar race and failure mode. Instead, we change bpf_mem_alloc to allow registering destructor for allocated elements, such that when they are returned to the allocator, any special fields created while they were accessible to programs in the mean time will be freed. If these values get reused, we do not free the fields again before handing the element back. The special fields thus may remain initialized while the map value sits in a free list. When bpf_mem_alloc is retired in the future, a similar concept can be introduced to kmalloc_nolock-backed kmem_cache, paired with the existing idea of a constructor. Note that the destructor registration happens in map_check_btf, after the BTF record is populated and (at that point) avaiable for inspection and duplication. Duplication is necessary since the freeing of embedded bpf_mem_alloc can be decoupled from actual map lifetime due to logic introduced to reduce the cost of rcu_barrier()s in mem alloc free path in |
||
|
|
69062f234a |
12 hotfixes. 7 are cc:stable. 8 are for MM.
All are singletons - please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaaDF4AAKCRDdBJ7gKXxA jhv5AQDv+B9rPkFJ0dlSS/hXqsDGqy3dGj/grJM0dw7LhkPHzgEAi/bV6D1jx0k3 k0hcP3JUxE54+a7liLadPDLIObOMLgo= =R1ap -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes. 7 are cc:stable. 8 are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: update Yosry Ahmed's email address mailmap: add entry for Daniele Alessandrelli mm: fix NULL NODE_DATA dereference for memoryless nodes on boot mm/tracing: rss_stat: ensure curr is false from kthread context mm/kfence: fix KASAN hardware tag faults during late enablement mm/damon/core: disallow non-power of two min_region_sz Squashfs: check metadata block offset is within range MAINTAINERS, mailmap: update e-mail address for Vlastimil Babka liveupdate: luo_file: remember retrieve() status mm: thp: deny THP for files on anonymous inodes mm: change vma_alloc_folio_noprof() macro to inline function mm/kfence: disable KFENCE upon KASAN HW tags enablement |
||
|
|
944e15f200 |
dma-mapping fixes for Linux 7.0
A set of two fixes for DMA-mapping subsystem for the recently merged API rework (Jiri Pirko and Stian Halseth). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaaDJnAAKCRCJp1EFxbsS RBBTAP0WhFXOPeheDasVySviHbXxEdP2uc9/XgbnW40/HwWGRQD+KLkYSJDgGegv v+fbLYNkZPgtOKAztp70imOOBM0fGQY= =Q/dC -----END PGP SIGNATURE----- Merge tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko and Stian Halseth)" * tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: sparc: Fix page alignment in dma mapping dma-mapping: avoid random addr value print out on error path |
||
|
|
b7bf516c3e |
bpf: Fix stack-out-of-bounds write in devmap
get_upper_ifindexes() iterates over all upper devices and writes their
indices into an array without checking bounds.
Also the callers assume that the max number of upper devices is
MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack,
but that assumption is not correct and the number of upper devices could
be larger than MAX_NEST_DEV (e.g., many macvlans), causing a
stack-out-of-bounds write.
Add a max parameter to get_upper_ifindexes() to avoid the issue.
When there are too many upper devices, return -EOVERFLOW and abort the
redirect.
To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with
an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS.
Then send a packet to the device to trigger the XDP redirect path.
Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/
Fixes:
|
||
|
|
ad6fface76 |
bpf: Fix kprobe_multi cookies access in show_fdinfo callback
We don't check if cookies are available on the kprobe_multi link
before accessing them in show_fdinfo callback, we should.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
3f4a08e644 |
kmalloc_obj fixes for v7.0-rc2
- Fix pointer-to-array allocation types for ubd and kcsan - Force size overflow helpers to __always_inline - Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor) -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaaCJawAKCRA2KwveOeQk u2LdAQD/wZYe1YCCrUR+jM6EvAoI2CzUEQUQMyzLtjIyeSR6VAD8CSdZu0htnwwy Ca76IzSF8Aj0D7Hytxkk8HDAD4WuxA0= =QucI -----END PGP SIGNATURE----- Merge tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj fixes from Kees Cook: - Fix pointer-to-array allocation types for ubd and kcsan - Force size overflow helpers to __always_inline - Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor) * tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kcsan: test: Adjust "expect" allocation type for kmalloc_obj overflow: Make sure size helpers are always inlined init/Kconfig: Adjust fixed clang version for __builtin_counted_by_ref ubd: Use pointer-to-pointers for io_thread_req arrays |
||
|
|
795469820c |
kcsan: test: Adjust "expect" allocation type for kmalloc_obj
The call to kmalloc_obj(observed.lines) returns "char (*)[3][512]",
a pointer to the whole 2D array. But "expect" wants to be "char (*)[512]",
the decayed pointer type, as if it were observed.lines itself (though
without the "3" bounds). This produces the following build error:
../kernel/kcsan/kcsan_test.c: In function '__report_matches':
../kernel/kcsan/kcsan_test.c:171:16: error: assignment to 'char (*)[512]' from incompatible pointer type 'char (*)[3][512]'
[-Wincompatible-pointer-types]
171 | expect = kmalloc_obj(observed.lines);
| ^
Instead of changing the "expect" type to "char (*)[3][512]" and
requiring a dereference at each use (e.g. "(expect*)[0]"), just
explicitly cast the return to the desired type.
Note that I'm intentionally not switching back to byte-based "kmalloc"
here because I cannot find a way for the Coccinelle script (which will
be used going forward to catch future conversions) to exclude this case.
Tested with:
$ ./tools/testing/kunit/kunit.py run \
--kconfig_add CONFIG_DEBUG_KERNEL=y \
--kconfig_add CONFIG_KCSAN=y \
--kconfig_add CONFIG_KCSAN_KUNIT_TEST=y \
--arch=x86_64 --qemu_args '-smp 2' kcsan
Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes:
|
||
|
|
0e335a7745 |
vfs-7.0-rc2.fixes
Please consider pulling these changes from the signed vfs-7.0-rc2.fixes tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaZ7xWAAKCRCRxhvAZXjc
onpeAP4qOrTURIAX9M/NGCHywvjI91ZJt20J6vm0X6KbVV/ebQD/eoJ21xzPhG9M
gN7oRcZ9SW3e/AdtdnlqB0PEP+cyGwM=
=9Ji+
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix an uninitialized variable in file_getattr().
The flags_valid field wasn't initialized before calling
vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse
- Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is
not enabled.
sysctl_hung_task_timeout_secs is 0 in that case causing spurious
"waiting for writeback completion for more than 1 seconds" warnings
- Fix a null-ptr-deref in do_statmount() when the mount is internal
- Add missing kernel-doc description for the @private parameter in
iomap_readahead()
- Fix mount namespace creation to hold namespace_sem across the mount
copy in create_new_namespace().
The previous drop-and-reacquire pattern was fragile and failed to
clean up mount propagation links if the real rootfs was a shared or
dependent mount
- Fix /proc mount iteration where m->index wasn't updated when
m->show() overflows, causing a restart to repeatedly show the same
mount entry in a rapidly expanding mount table
- Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the
inode number is out of range
- Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared.
copy_mnt_ns() received the live fs_struct so if a subsequent
namespace creation failed the rollback would leave pwd and root
pointing to detached mounts. Always allocate a new fs_struct when
CLONE_NEWNS is requested
- fserror bug fixes:
- Remove the unused fsnotify_sb_error() helper now that all callers
have been converted to fserror_report_metadata
- Fix a lockdep splat in fserror_report() where igrab() takes
inode::i_lock which can be held in IRQ context.
Replace igrab() with a direct i_count bump since filesystems
should not report inodes that are about to be freed or not yet
exposed
- Handle error pointer in procfs for try_lookup_noperm()
- Fix an integer overflow in ep_loop_check_proc() where recursive calls
returning INT_MAX would overflow when +1 is added, breaking the
recursion depth check
- Fix a misleading break in pidfs
* tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: avoid misleading break
eventpoll: Fix integer overflow in ep_loop_check_proc()
proc: Fix pointer error dereference
fserror: fix lockdep complaint when igrabbing inode
fsnotify: drop unused helper
unshare: fix unshare_fs() handling
minix: Correct errno in minix_new_inode
namespace: fix proc mount iteration
mount: hold namespace_sem across copy in create_new_namespace()
iomap: Describe @private in iomap_readahead()
statmount: Fix the null-ptr-deref in do_statmount()
writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK
fs: init flags_valid before calling vfs_fileattr_get
|
||
|
|
c9bc1753b3 |
perf: Fix __perf_event_overflow() vs perf_remove_from_context() race
Make sure that __perf_event_overflow() runs with IRQs disabled for all
possible callchains. Specifically the software events can end up running
it with only preemption disabled.
This opens up a race vs perf_event_exit_event() and friends that will go
and free various things the overflow path expects to be present, like
the BPF program.
Fixes:
|
||
|
|
f85b1c6af5 |
liveupdate: luo_file: remember retrieve() status
LUO keeps track of successful retrieve attempts on a LUO file. It does so
to avoid multiple retrievals of the same file. Multiple retrievals cause
problems because once the file is retrieved, the serialized data
structures are likely freed and the file is likely in a very different
state from what the code expects.
The retrieve boolean in struct luo_file keeps track of this, and is passed
to the finish callback so it knows what work was already done and what it
has left to do.
All this works well when retrieve succeeds. When it fails,
luo_retrieve_file() returns the error immediately, without ever storing
anywhere that a retrieve was attempted or what its error code was. This
results in an errored LIVEUPDATE_SESSION_RETRIEVE_FD ioctl to userspace,
but nothing prevents it from trying this again.
The retry is problematic for much of the same reasons listed above. The
file is likely in a very different state than what the retrieve logic
normally expects, and it might even have freed some serialization data
structures. Attempting to access them or free them again is going to
break things.
For example, if memfd managed to restore 8 of its 10 folios, but fails on
the 9th, a subsequent retrieve attempt will try to call
kho_restore_folio() on the first folio again, and that will fail with a
warning since it is an invalid operation.
Apart from the retry, finish() also breaks. Since on failure the
retrieved bool in luo_file is never touched, the finish() call on session
close will tell the file handler that retrieve was never attempted, and it
will try to access or free the data structures that might not exist, much
in the same way as the retry attempt.
There is no sane way of attempting the retrieve again. Remember the error
retrieve returned and directly return it on a retry. Also pass this
status code to finish() so it can make the right decision on the work it
needs to do.
This is done by changing the bool to an integer. A value of 0 means
retrieve was never attempted, a positive value means it succeeded, and a
negative value means it failed and the error code is the value.
Link: https://lkml.kernel.org/r/20260216132221.987987-1-pratyush@kernel.org
Fixes:
|
||
|
|
7dff99b354 |
Remove WARN_ALL_UNSEEDED_RANDOM kernel config option
This config option goes way back - it used to be an internal debug
option to random.c (at that point called DEBUG_RANDOM_BOOT), then was
renamed and exposed as a config option as CONFIG_WARN_UNSEEDED_RANDOM,
and then further renamed to the current CONFIG_WARN_ALL_UNSEEDED_RANDOM.
It was all done with the best of intentions: the more limited
rate-limited reports were reporting some cases, but if you wanted to see
all the gory details, you'd enable this "ALL" option.
However, it turns out - perhaps not surprisingly - that when people
don't care about and fix the first rate-limited cases, they most
certainly don't care about any others either, and so warning about all
of them isn't actually helping anything.
And the non-ratelimited reporting causes problems, where well-meaning
people enable debug options, but the excessive flood of messages that
nobody cares about will hide actual real information when things go
wrong.
I just got a kernel bug report (which had nothing to do with randomness)
where two thirds of the the truncated dmesg was just variations of
random: get_random_u32 called from __get_random_u32_below+0x10/0x70 with crng_init=0
and in the process early boot messages had been lost (in addition to
making the messages that _hadn't_ been lost harder to read).
The proper way to find these things for the hypothetical developer that
cares - if such a person exists - is almost certainly with boot time
tracing. That gives you the option to get call graphs etc too, which is
likely a requirement for fixing any problems anyway.
See Documentation/trace/boottime-trace.rst for that option.
And if we for some reason do want to re-introduce actual printing of
these things, it will need to have some uniqueness filtering rather than
this "just print it all" model.
Fixes:
|
||
|
|
77de62ad3d |
perf/core: Fix refcount bug and potential UAF in perf_mmap
Syzkaller reported a refcount_t: addition on 0; use-after-free warning in perf_mmap. The issue is caused by a race condition between a failing mmap() setup and a concurrent mmap() on a dependent event (e.g., using output redirection). In perf_mmap(), the ring_buffer (rb) is allocated and assigned to event->rb with the mmap_mutex held. The mutex is then released to perform map_range(). If map_range() fails, perf_mmap_close() is called to clean up. However, since the mutex was dropped, another thread attaching to this event (via inherited events or output redirection) can acquire the mutex, observe the valid event->rb pointer, and attempt to increment its reference count. If the cleanup path has already dropped the reference count to zero, this results in a use-after-free or refcount saturation warning. Fix this by extending the scope of mmap_mutex to cover the map_range() call. This ensures that the ring buffer initialization and mapping (or cleanup on failure) happens atomically effectively, preventing other threads from accessing a half-initialized or dying ring buffer. Closes: https://lore.kernel.org/oe-kbuild-all/202602020208.m7KIjdzW-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Haocheng Yu <yuhaocheng035@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260202162057.7237-1-yuhaocheng035@gmail.com |
||
|
|
486ff5ad49 |
perf/core: Fix invalid wait context in ctx_sched_in()
Lockdep found a bug in the event scheduling when a pinned event was
failed and wakes up the threads in the ring buffer like below.
It seems it should not grab a wait-queue lock under perf-context lock.
Let's do it with irq_work.
[ 39.913691] =============================
[ 39.914157] [ BUG: Invalid wait context ]
[ 39.914623] 6.15.0-next-20250530-next-2025053 #1 Not tainted
[ 39.915271] -----------------------------
[ 39.915731] repro/837 is trying to lock:
[ 39.916191] ffff88801acfabd8 (&event->waitq){....}-{3:3}, at: __wake_up+0x26/0x60
[ 39.917182] other info that might help us debug this:
[ 39.917761] context-{5:5}
[ 39.918079] 4 locks held by repro/837:
[ 39.918530] #0: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: __perf_event_task_sched_in+0xd1/0xbc0
[ 39.919612] #1: ffff88806ca3c6f8 (&cpuctx_lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1a7/0xbc0
[ 39.920748] #2: ffff88800d91fc18 (&ctx->lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1f9/0xbc0
[ 39.921819] #3: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: perf_event_wakeup+0x6c/0x470
Fixes:
|
||
|
|
47322c469d |
dma-mapping: avoid random addr value print out on error path
dma_addr is unitialized in dma_direct_map_phys() when swiotlb is forced
and DMA_ATTR_MMIO is set which leads to random value print out in
warning. Fix that by just returning DMA_MAPPING_ERROR.
Fixes:
|
||
|
|
189f164e57 |
Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
32a92f8c89 |
Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
323bbfcf1e |
Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bf4afc53b7 |
Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
69050f8d6d |
treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org> |
||
|
|
68010e7b3d |
Tracing fixes for 7.0:
- Fix possible dereference of uninitialized pointer When validating the persistent ring buffer on boot up, if the first validation fails, a reference to "head_page" is performed in the error path, but it skips over the initialization of that variable. Move the initialization before the first validation check. - Fix use of event length in validation of persistent ring buffer On boot up, the persistent ring buffer is checked to see if it is valid by several methods. One being to walk all the events in the memory location to make sure they are all valid. The length of the event is used to move to the next event. This length is determined by the data in the buffer. If that length is corrupted, it could possibly make the next event to check located at a bad memory location. Validate the length field of the event when doing the event walk. - Fix function graph on archs that do not support use of ftrace_ops When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means that its function graph tracer uses the ftrace_ops of the function tracer to call its callbacks. This allows a single registered callback to be called directly instead of checking the callback's meta data's hash entries against the function being traced. For architectures that do not support this feature, it must always call the loop function that tests each registered callback (even if there's only one). The loop function tests each callback's meta data against its hash of functions and will call its callback if the function being traced is in its hash map. The issue was that there was no check against this and the direct function was being called even if the architecture didn't support it. This meant that if function tracing was enabled at the same time as a callback was registered with the function graph tracer, its callback would be called for every function that the function tracer also traced, even if the callback's meta data only wanted to be called back for a small subset of functions. Prevent the direct calling for those architectures that do not support it. - Fix references to trace_event_file for hist files The hist files used event_file_data() to get a reference to the associated trace_event_file the histogram was attached to. This would return a pointer even if the trace_event_file is about to be freed (via RCU). Instead it should use the event_file_file() helper that returns NULL if the trace_event_file is marked to be freed so that no new references are added to it. - Wake up hist poll readers when an event is being freed When polling on a hist file, the task is only awoken when a hist trigger is triggered. This means that if an event is being freed while there's a task waiting on its hist file, it will need to wait until the hist trigger occurs to wake it up and allow the freeing to happen. Note, the event will not be completely freed until all references are removed, and a hist poller keeps a reference. But it should still be woken when the event is being freed. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaZd4ExQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqX5AP4powfnNnRfLSKH9idhTp+ltmJ+9roy L7kWTr/z20S2VQEAk331PNZ32uZu+/ZUETYpgtEx4SbRGZFehTBv1ddjfw4= =TWj5 -----END PGP SIGNATURE----- Merge tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible dereference of uninitialized pointer When validating the persistent ring buffer on boot up, if the first validation fails, a reference to "head_page" is performed in the error path, but it skips over the initialization of that variable. Move the initialization before the first validation check. - Fix use of event length in validation of persistent ring buffer On boot up, the persistent ring buffer is checked to see if it is valid by several methods. One being to walk all the events in the memory location to make sure they are all valid. The length of the event is used to move to the next event. This length is determined by the data in the buffer. If that length is corrupted, it could possibly make the next event to check located at a bad memory location. Validate the length field of the event when doing the event walk. - Fix function graph on archs that do not support use of ftrace_ops When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means that its function graph tracer uses the ftrace_ops of the function tracer to call its callbacks. This allows a single registered callback to be called directly instead of checking the callback's meta data's hash entries against the function being traced. For architectures that do not support this feature, it must always call the loop function that tests each registered callback (even if there's only one). The loop function tests each callback's meta data against its hash of functions and will call its callback if the function being traced is in its hash map. The issue was that there was no check against this and the direct function was being called even if the architecture didn't support it. This meant that if function tracing was enabled at the same time as a callback was registered with the function graph tracer, its callback would be called for every function that the function tracer also traced, even if the callback's meta data only wanted to be called back for a small subset of functions. Prevent the direct calling for those architectures that do not support it. - Fix references to trace_event_file for hist files The hist files used event_file_data() to get a reference to the associated trace_event_file the histogram was attached to. This would return a pointer even if the trace_event_file is about to be freed (via RCU). Instead it should use the event_file_file() helper that returns NULL if the trace_event_file is marked to be freed so that no new references are added to it. - Wake up hist poll readers when an event is being freed When polling on a hist file, the task is only awoken when a hist trigger is triggered. This means that if an event is being freed while there's a task waiting on its hist file, it will need to wait until the hist trigger occurs to wake it up and allow the freeing to happen. Note, the event will not be completely freed until all references are removed, and a hist poller keeps a reference. But it should still be woken when the event is being freed. * tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Wake up poll waiters for hist files when removing an event tracing: Fix checking of freed trace_event_file for hist files fgraph: Do not call handlers direct when not using ftrace_ops tracing: ring-buffer: Fix to check event length before using ring-buffer: Fix possible dereference of uninitialized pointer |
||
|
|
9678e53179 |
tracing: Wake up poll waiters for hist files when removing an event
The event_hist_poll() function attempts to verify whether an event file is
being removed, but this check may not occur or could be unnecessarily
delayed. This happens because hist_poll_wakeup() is currently invoked only
from event_hist_trigger() when a hist command is triggered. If the event
file is being removed, no associated hist command will be triggered and a
waiter will be woken up only after an unrelated hist command is triggered.
Fix the issue by adding a call to hist_poll_wakeup() in
remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This
ensures that a task polling on a hist file is woken up and receives
EPOLLERR.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com
Fixes:
|
||
|
|
f0a0da1f90 |
tracing: Fix checking of freed trace_event_file for hist files
The event_hist_open() and event_hist_poll() functions currently retrieve
a trace_event_file pointer from a file struct by invoking
event_file_data(), which simply returns file->f_inode->i_private. The
functions then check if the pointer is NULL to determine whether the event
is still valid. This approach is flawed because i_private is assigned when
an eventfs inode is allocated and remains set throughout its lifetime.
Instead, the code should call event_file_file(), which checks for
EVENT_FILE_FL_FREED. Using the incorrect access function may result in the
code potentially opening a hist file for an event that is being removed or
becoming stuck while polling on this file.
Correct the access method to event_file_file() in both functions.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com
Fixes:
|
||
|
|
f4ff9f646a |
fgraph: Do not call handlers direct when not using ftrace_ops
The function graph tracer was modified to us the ftrace_ops of the
function tracer. This simplified the code as well as allowed more features
of the function graph tracer.
Not all architectures were converted over as it required the
implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those
architectures, it still did it the old way where the function graph tracer
handle was called by the function tracer trampoline. The handler then had
to check the hash to see if the registered handlers wanted to be called by
that function or not.
In order to speed up the function graph tracer that used ftrace_ops, if
only one callback was registered with function graph, it would call its
function directly via a static call.
Now, if the architecture does not support the use of using ftrace_ops and
still has the ftrace function trampoline calling the function graph
handler, then by doing a direct call it removes the check against the
handler's hash (list of functions it wants callbacks to), and it may call
that handler for functions that the handler did not request calls for.
On 32bit x86, which does not support the ftrace_ops use with function
graph tracer, it shows the issue:
~# trace-cmd start -p function -l schedule
~# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) * 11898.94 us | schedule();
3) # 1783.041 us | schedule();
1) | schedule() {
------------------------------------------
1) bash-8369 => kworker-7669
------------------------------------------
1) | schedule() {
------------------------------------------
1) kworker-7669 => bash-8369
------------------------------------------
1) + 97.004 us | }
1) | schedule() {
[..]
Now by starting the function tracer is another instance:
~# trace-cmd start -B foo -p function
This causes the function graph tracer to trace all functions (because the
function trace calls the function graph tracer for each on, and the
function graph trace is doing a direct call):
~# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) 1.669 us | } /* preempt_count_sub */
1) + 10.443 us | } /* _raw_spin_unlock_irqrestore */
1) | tick_program_event() {
1) | clockevents_program_event() {
1) 1.044 us | ktime_get();
1) 6.481 us | lapic_next_event();
1) + 10.114 us | }
1) + 11.790 us | }
1) ! 181.223 us | } /* hrtimer_interrupt */
1) ! 184.624 us | } /* __sysvec_apic_timer_interrupt */
1) | irq_exit_rcu() {
1) 0.678 us | preempt_count_sub();
When it should still only be tracing the schedule() function.
To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the
architecture does not support function graph use of ftrace_ops, and set to
1 otherwise. Then use this macro to know to allow function graph tracer to
call the handlers directly or not.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home
Fixes:
|
||
|
|
912b0ee248 |
tracing: ring-buffer: Fix to check event length before using
Check the event length before adding it for accessing next index in
rb_read_data_buffer(). Since this function is used for validating
possibly broken ring buffers, the length of the event could be broken.
In that case, the new event (e + len) can point a wrong address.
To avoid invalid memory access at boot, check whether the length of
each event is in the possible range before using it.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
|
|
f154777940 |
ring-buffer: Fix possible dereference of uninitialized pointer
There is a pointer head_page in rb_meta_validate_events() which is not
initialized at the beginning of a function. This pointer can be dereferenced
if there is a failure during reader page validation. In this case the control
is passed to "invalid" label where the pointer is dereferenced in a loop.
To fix the issue initialize orig_head and head_page before calling
rb_validate_buffer.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru
Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/
Fixes:
|
||
|
|
4f13d0dabc |
bpf-fixes
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmXR6QACgkQ6rmadz2v bTqVjg/+PZPMKGBMfF5uWk74LWYQIt01ePfHH2QeA4DsOwNK+9Q1+jmCLRPa/diL Ds//ZIEMatmtdd1eO5aHyGXE1sBsJ02LfKOhsPukQyzD/FtZ4BmQpzpG2mK5o1M5 NAH6wxY+6Tr8UlXQtoTF1FFXSa6Y0vQmkyXofOoUgSBAxTPMGVQnWw4bq7mUAX9A G6/TnPDgGbNLejPCmu8mERCkqRjIGAgjBUItVeiHbdxymtzjHcrH7nwxuP59djR9 1AhMrJnyV+s7iEMkAKGkE6NOID73R/YQEqmvD1eX0AWvqdR8+4lOHT0KPU039JqT RQV5JgXSfeEkdUtyvqQJZdiJinjFLOwp4CGcX+DKcvUpAKmLx8q3ihPiuWk8+JOV fnosXQIeQ7B9EuTvoNoNTfvU/MuV8vWd3/1kQc+KGXhzk944Ypb29zywGEoGZarU eb7YRtUIsXBo7H2K1juqTvj72jyhG83cbZxE5+pR2gv87yGgUvt0r0u0FvzkQf2c Fq671n6UaOd+0ZYey7YG9bIc+WMTbsjqVY0f7+3Rcl3Q68we0HHdqiPoF637/a1r lS6TZLRDJmykDPp03db97UcQml6RKJiyTnTNQVUF4Uk+VF1KTreXK/D8fD401gxZ GWuLt/bjq8l/EVJOhzpaO0JDejmZVRaOQUX8t9DwstS+FFbSyBs= =mb+y -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix invalid write loop logic in libbpf's bpf_linker__add_buf() (Amery Hung) - Fix a potential use-after-free of BTF object (Anton Protopopov) - Add feature detection to libbpf and avoid moving arena global variables on older kernels (Emil Tsalapatis) - Remove extern declaration of bpf_stream_vprintk() from libbpf headers (Ihor Solodrai) - Fix truncated netlink dumps in bpftool (Jakub Kicinski) - Fix map_kptr grace period wait in bpf selftests (Kumar Kartikeya Dwivedi) - Remove hexdump dependency while building bpf selftests (Matthieu Baerts) - Complete fsession support in BPF trampolines on riscv (Menglong Dong) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Remove hexdump dependency libbpf: Remove extern declaration of bpf_stream_vprintk() selftests/bpf: Use vmlinux.h in test_xdp_meta bpftool: Fix truncated netlink dumps libbpf: Delay feature gate check until object prepare time libbpf: Do not use PROG_TYPE_TRACEPOINT program for feature gating bpf: Add a map/btf from a fd array more consistently selftests/bpf: Fix map_kptr grace period wait selftests/bpf: enable fsession_test on riscv64 selftests/bpf: Adjust selftest due to function rename bpf, riscv: add fsession support for trampolines bpf: Fix a potential use-after-free of BTF object bpf, riscv: introduce emit_store_stack_imm64() for trampoline libbpf: Fix invalid write loop logic in bpf_linker__add_buf() libbpf: Add gating for arena globals relocation feature |
||
|
|
2b7a25df82 |
mm.git review status for linus..mm-nonmm-stable
Total patches: 7 Reviews/patch: 0.57 Reviewed rate: 42% - The 2 patch series "two fixes in kho_populate()" from Ran Xiaokai fixes a couple of not-major issues in the kexec handover code. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaZaKBAAKCRDdBJ7gKXxA jpB1AP9UpNzT63aGDnB6G8pgekSdK/I2gypZI3cS7MpBPorRUgEAhcClc2//zWGK 0Wz1rxh3sWIE/pzd/yOEsv+7oQHeDQA= =oUp2 -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more non-MM updates from Andrew Morton: - "two fixes in kho_populate()" fixes a couple of not-major issues in the kexec handover code (Ran Xiaokai) - misc singletons * tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib/group_cpus: handle const qualifier from clusters allocation type kho: remove unnecessary WARN_ON(err) in kho_populate() kho: fix missing early_memunmap() call in kho_populate() scripts/gdb: implement x86_page_ops in mm.py objpool: fix the overestimation of object pooling metadata size selftests/memfd: use IPC semaphore instead of SIGSTOP/SIGCONT delayacct: fix build regression on accounting tool |
||
|
|
eeccf287a2 |
mm.git review status for linus..mm-stable
Total patches: 36 Reviews/patch: 1.77 Reviewed rate: 83% - The 2 patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion" from Bing Jiao fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes. - The 11 patch series "Remove XA_ZERO from error recovery of dup_mmap()" from Liam Howlett fixes a rare mapledtree race and performs a number of cleanups. - The 13 patch series "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" from Lorenzo Stoakes implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap. - The 5 patch series "support batch checking of references and unmapping for large folios" from Baolin Wang implements batching to greatly improve the performance of reclaiming clean file-backed large folios. - The 3 patch series "selftests/mm: add memory failure selftests" from Miaohe Lin does as claimed. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaZaIEQAKCRDdBJ7gKXxA jj73AQCQDwLoipDiQRGyjB5BDYydymWuDoiB1tlDPHfYAP3b/QD/UQtVlOEXqwM3 naOKs3NQ1pwnfhDaQMirGw2eAnJ1SQY= =6Iif -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ... |
||
|
|
23b0f90ba8 |
Summary
* Removed macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. * General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. * Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmmUabYACgkQupfNUreW QU8y2Qv/d2y35uQPRDh0HKWKWXJy41C2RJzd/rFCWJPCwo150whTSHIHkWYnu76g 10QblBXQmXi9TVqFnJ7Il7PWgqkMPjzA13tfT9eXNWU8j2OB/mcVKNl9X4wm/jWi QxtGmBsIQ/nxb2pUzMCykzgfc5mLi2NQ8qhZ5bOnq7UW3zdYmzEqx+tRdvIacyIk adComi5v8xUDqyEbVFaBovuX2WHQkPyBMnD64nwWG93JpNG/+9PxGzv/DNUXY11Y epVOfSoKdJbSLjYoHEPEhT0aHjSydq3QHru7uF6wzKOFTfHej/XkXXbUnFXPO2Pn c5J0u/HziYG5eN2QTqGfrhECZYuCFPemtUozltbcgGebkl1wKH+k9K5vsCaz/mhk ihUC3mui++W/n9B9HJRYh1XeEpk6C1pWERCOx27XFZ25fSek2YO6ZWkT0q+gceC0 t4+eIFSGJ3OzheJgHNK9XhTMWiQPmHyA6brXYGx4WeRvJFLpVddPF7k3Z89zIAu/ Fut7FGTH =0Z+I -----END PGP SIGNATURE----- Merge tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. - General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. - Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing * tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions sysctl: Replace unidirectional INT converter macros with functions sysctl: Add kernel doc to proc_douintvec_conv sysctl: Replace UINT converter macros with functions sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros sysctl: clarify proc_douintvec_minmax doc sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n sysctl: Remove unused ctl_table forward declarations loadpin: Implement custom proc_handler for enforce alloc_tag: move memory_allocation_profiling_sysctls into .rodata sysctl: Add missing kernel-doc for proc_dointvec_conv |
||
|
|
6c4b2243cb
|
unshare: fix unshare_fs() handling
There's an unpleasant corner case in unshare(2), when we have a
CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that
case copy_mnt_ns() gets passed current->fs instead of a private copy,
which causes interesting warts in proof of correctness]
> I guess if private means fs->users == 1, the condition could still be true.
Unfortunately, it's worse than just a convoluted proof of correctness.
Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
(and current->fs->users == 1).
We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and
flips current->fs->{pwd,root} to corresponding locations in the new namespace.
Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
destroyed and its mount tree is dissolved, but... current->fs->root and
current->fs->pwd are both left pointing to now detached mounts.
They are pinning those, so it's not a UAF, but it leaves the calling
process with unshare(2) failing with -ENOMEM _and_ leaving it with
pwd and root on detached isolated mounts. The last part is clearly a bug.
There is other fun related to that mess (races with pivot_root(), including
the one between pivot_root() and fork(), of all things), but this one
is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
fs_struct even if it hadn't been shared in the first place". Sure, we could
go for something like "if both CLONE_NEWNS *and* one of the things that might
end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
force allocation of new fs_struct", but let's keep it simple - the cost
of copy_fs_struct() is trivial.
Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
a freshly allocated fs_struct, yet to be attached to anything. That
seriously simplifies the analysis...
FWIW, that bug had been there since the introduction of unshare(2) ;-/
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV
Tested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
||
|
|
d295082ea6 |
SPDX updates for 7.0-rc1
Here are two small changes that add some missing SPDX license lines to some core kernel files. These are: - adding SPDX license lines to kdb files - adding SPDX license lines to the remaining kernel/ files Both of these have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCaZR2VA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+yne8ACfYf6Mo8A+c4ZVduiKhAywn7wtX3YAn3Z29idk Cq3AiQM/bjJvM0/KPOwW =k232 -----END PGP SIGNATURE----- Merge tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX updates from Greg KH: "Here are two small changes that add some missing SPDX license lines to some core kernel files. These are: - adding SPDX license lines to kdb files - adding SPDX license lines to the remaining kernel/ files Both of these have been in linux-next for a while with no reported issues" * tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: kernel: debug: Add SPDX license ids to kdb files kernel: add SPDX-License-Identifier lines |
||
|
|
99dfe2d4da |
block-7.0-20260216
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmTrNEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjsOEACpUk78nFmLbEgJ5UH8+Z6daDzgoasb5YRT Mj4g+cM2J9Xc9JxgX8QR3F2EfolweTo/H6xlhnlPDcnpB+b3qj4WHuijR/wghphj MBKKqNXTEC+j0ra9uk8h3RmIKaK79xcUup7XfTcuWdYpSsMyYE/m/rck3thw6yNL OAjmWLTP4IwYzXip2AB+J7JbDDOV/qWK0aOYdWHCdbn9X8bBel/HDOITWPdybnSR DNKBeoi/Yv8KwA+axogqP213ifc3Xr6ejRDkqDOf1bgXsKkELkIxcfog6MhfHhxq 3Cqlj1pBuIBxGVU7wmBTDqL+aHrVb983tcA5x1NGZIzJao64b026o5DUhNPprwrZ HveU1MZ2jarAjAz85gE3S4oUY+6d47ooytfvO548Zp/1LY+fOxnjYqq5ksh8BBLk WyjfkJScgr17Z4SVOK8a9GboWO2WKiQJRg+hZ/TWX5fyvu5g9sbRasdwxnp1sl52 EayzkhYFq/Rdd8slwTIaccVUPl/xeEDeRG+jTJ+4Fj54TihKiJzXVsxDkSWKf46V CWmzDx+n6MlGPm9mShSERZ7HJh3VcSp4No/HAjf93u9/UXwubK/SKiV71nhpgJMf 9bWS2G3wPx/5LoME95YkF+CSgs0e/ROUusfGd8X6nIz9EBGzeabCG/mjqd5adC09 OZahOuqrIg== =PVoY -----END PGP SIGNATURE----- Merge tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES |
||
|
|
543b9b6339 |
kernel-7.0-rc1.misc
Please consider pulling these changes from the signed kernel-7.0-rc1.misc tag. Thanks! Christian -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaZL+JwAKCRCRxhvAZXjc ovU/AP4xgVxEegnNYrXZ+TpdCXbCtQZ54JqowFX73MBtaBHY1QD/YkDaIzl6K70v d9P2Fe8Y6wOnIHxcjE4MIdMansphjAM= =TN3q -----END PGP SIGNATURE----- Merge tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - pid: introduce task_ppid_vnr() helper - pidfs: convert rb-tree to rhashtable Mateusz reported performance penalties during task creation because pidfs uses pidmap_lock to add elements into the rbtree. Switch to an rhashtable to have separate fine-grained locking and to decouple from pidmap_lock moving all heavy manipulations outside of it Also move inode allocation outside of pidmap_lock. With this there's nothing happening for pidfs under pidmap_lock - pid: reorder fields in pid_namespace to reduce false sharing - Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers" - ipc: Add SPDX license id to mqueue.c * tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pid: introduce task_ppid_vnr() helper pidfs: implement ino allocation without the pidmap lock Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers" pid: reorder fields in pid_namespace to reduce false sharing pidfs: convert rb-tree to rhashtable ipc: Add SPDX license id to mqueue.c |
||
|
|
dfe48ea179 |
blk-mq: use NOIO context to prevent deadlock during debugfs creation
Creating debugfs entries can trigger fs reclaim, which can enter back into the block layer request_queue. This can cause deadlock if the queue is frozen. Previously, a WARN_ON_ONCE check was used in debugfs_create_files() to detect this condition, but it was racy since the queue can be frozen from another context at any time. Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave() and blk_debugfs_unlock_nomemrestore() variants for callers that don't need NOIO protection (e.g., debugfs removal or read-only operations). Replace all raw debugfs_mutex lock/unlock pairs with these helpers, using the _nomemsave/_nomemrestore variants where appropriate. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/ Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
2d10a48871 |
Probes for v7.0
- kprobes: Use a dedicated kernel thread to optimize the kprobes instead of using workqueue thread. Since the kprobe optimizer waits a long time for synchronize_rcu_task(), it can block other workers in the same queue if it uses a workqueue. - kprobe-events: Returns immediately if no new probe events are specified on the kernel command line at boot time. This shorten the kernel boot time. - kprobes: When a kprobe is fully removed from the kernel code, retry optimizing another kprobe which is blocked by that kprobe. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmmPtX4bHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bgoUH/0oZ9/c+wkn9AKxFlOAw AIbtTfKiQHXPirhamYuEbJJWTUH2O/Wr0SQrIkbwJzToou5CRTuT5UQBpmOk2GKZ uZ5bO63nOMn/1ECbDpc+lFGV5J4Dn5Pw2GFjUTR6EPwms+CN/d6f2LlvuAVfJX7Y klL2X9/QLwV5QT8La5zNEka2zBfrZlZ9a5DGoLtgdwX2zosuef9aDnisS2jkLenH wugZLryx0Qvlj9DXimP7oVZpYDHr0hZEI3CNQPge0gMQDj1XajGawYfkhrfaeFMX 4eXjFCN+NQG6fBUpDugcOThOypWmjlJhoYxxROJu0eFcLQjsuWjp6jLixI9Nr7P+ tQk= =syAj -----END PGP SIGNATURE----- Merge tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull kprobes updates from Masami Hiramatsu: - Use a dedicated kernel thread to optimize the kprobes instead of using workqueue thread. Since the kprobe optimizer waits a long time for synchronize_rcu_task(), it can block other workers in the same queue if it uses a workqueue. - kprobe-events: return immediately if no new probe events are specified on the kernel command line at boot time. This shortens the kernel boot time. - When a kprobe is fully removed from the kernel code, retry optimizing another kprobe which is blocked by that kprobe. * tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Use dedicated kthread for kprobe optimizer tracing: kprobe-event: Return directly when trace kprobes is empty kprobes: retry blocked optprobe in do_free_cleaned_kprobes |
||
|
|
8e9bf8b9e8 |
printk, vt, fbcon: Remove console_conditional_schedule()
do_con_write(), fbcon_redraw.*() invoke console_conditional_schedule() which is a conditional scheduling point based on printk's internal variables console_may_schedule. It may only be used if the console lock is acquired for instance via console_lock() or console_trylock(). Prinkt sets the internal variable to 1 (and allows to schedule) if the console lock has been acquired via console_lock(). The trylock does not allow it. The console_conditional_schedule() invocation in do_con_write() is invoked shortly before console_unlock(). The console_conditional_schedule() invocation in fbcon_redraw.*() original from fbcon_scroll() / vt's con_scroll() which originate from a line feed. In console_unlock() the variable is set to 0 (forbids to schedule) and it tries to schedule while making progress printing. This is brand new compared to when console_conditional_schedule() was added in v2.4.9.11. In v2.6.38-rc3, console_unlock() (started its existence) iterated over all consoles and flushed them with disabled interrupts. A scheduling attempt here was not possible, it relied that a long print scheduled before console_unlock(). Since commit |
||
|
|
3c6e577d5a |
tracing updates for 7.0:
User visible changes:
- Add an entry into MAINTAINERS file for RUST versions of code
There's now RUST code for tracing and static branches. To differentiate
that code from the C code, add entries in for the RUST version (with "[RUST]"
around it) so that the right maintainers get notified on changes.
- New bitmask-list option added to tracefs
When this is set, bitmasks in trace event are not displayed as hex
numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
- New show_event_filters file in tracefs
Instead of having to search all events/*/*/filter for any active filters
enabled in the trace instance, the file show_event_filters will list them
so that there's only one file that needs to be examined to see if any
filters are active.
- New show_event_triggers file in tracefs
Instead of having to search all events/*/*/trigger for any active triggers
enabled in the trace instance, the file show_event_triggers will list them
so that there's only one file that needs to be examined to see if any
triggers are active.
- Have traceoff_on_warning disable trace pintk buffer too
Recently recording of trace_printk() could go to other trace instances
instead of the top level instance. But if traceoff_on_warning triggers, it
doesn't stop the buffer with trace_printk() and that data can easily be
lost by being overwritten. Have traceoff_on_warning also disable the
instance that has trace_printk() being written to it.
- Update the hist_debug file to show what function the field uses
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file exists for
every event. This displays the internal data of any histogram enabled for
that event. But it is lacking the function that is called to process one
of its fields. This is very useful information that was missing when
debugging histograms.
- Up the histogram stack size from 16 to 31
Stack traces can be used as keys for event histograms. Currently the size
of the stack that is stored is limited to just 16 entries. But the storage
space in the histogram is 256 bytes, meaning that it can store up to 31
entries (plus one for the count of entries). Instead of letting that space
go to waste, up the limit from 16 to 31. This makes the keys much more
useful.
- Fix permissions of per CPU file buffer_size_kb
The per CPU file of buffer_size_kb was incorrectly set to read only in a
previous cleanup. It should be writable.
- Reset "last_boot_info" if the persistent buffer is cleared
The last_boot_info shows address information of a persistent ring buffer
if it contains data from a previous boot. It is cleared when recording
starts again, but it is not cleared when the buffer is reset. The data is
useless after a reset so clear it on reset too.
Internal changes:
- A change was made to allow tracepoint callbacks to have preemption
enabled, and instead be protected by SRCU. This required some updates to
the callbacks for perf and BPF.
perf needed to disable preemption directly in its callback because it
expects preemption disabled in the later code.
BPF needed to disable migration, as its code expects to run completely on
the same CPU.
- Have irq_work wake up other CPU if current CPU is "isolated"
When there's a waiter waiting on ring buffer data and a new event happens,
an irq work is triggered to wake up that waiter. This is noisy on isolated
CPUs (running NO_HZ_FULL). Trigger an IPI to a house keeping CPU instead.
- Use proper free of trigger_data instead of open coding it in.
- Remove redundant call of event_trigger_reset_filter()
It was called immediately in a function that was called right after it.
- Workqueue cleanups
- Report errors if tracing_update_buffers() were to fail.
- Make the enum update workqueue generic for other parts of tracing
On boot up, a work queue is created to convert enum names into their
numbers in the trace event format files. This work queue can also be used
for other aspects of tracing that takes some time and shouldn't be called
by the init call code.
The blk_trace initialization takes a bit of time. Have the initialization
code moved to the new tracing generic work queue function.
- Skip kprobe boot event creation call if there's no kprobes defined on cmdline
The kprobe initialization to set up kprobes if they are defined on the
cmdline requires taking the event_mutex lock. This can be held by other
tracing code doing initialization for a long time. Since kprobes added to
the kernel command line need to be setup immediately, as they may be
tracing early initialization code, they cannot be postponed in a work
queue and must be setup in the initcall code.
If there's no kprobe on the kernel cmdline, there's no reason to take the
mutex and slow down the boot up code waiting to get the lock only to find
out there's nothing to do. Simply exit out early if there's no kprobes on
the kernel cmdline.
If there are kprobes on the cmdline, then someone cares more about tracing
over the speed of boot up.
- Clean up the trigger code a bit
- Move code out of trace.c and into their own files
trace.c is now over 11,000 lines of code and has become more difficult to
maintain. Start splitting it up so that related code is in their own
files.
Move all the trace_printk() related code into trace_printk.c.
Move the __always_inline stack functions into trace.h.
Move the pid filtering code into a new trace_pid.c file.
- Better define the max latency and snapshot code
The latency tracers have a "max latency" buffer that is a copy of the main
buffer and gets swapped with it when a new high latency is detected. This
keeps the trace up to the highest latency around where this max_latency
buffer is never written to. It is only used to save the last max latency
trace.
A while ago a snapshot feature was added to tracefs to allow user space to
perform the same logic. It could also enable events to trigger a
"snapshot" if one of their fields hit a new high. This was built on top of
the latency max_latency buffer logic.
Because snapshots came later, they were dependent on the latency tracers
to be enabled. In reality, the latency tracers depend on the snapshot code
and not the other way around. It was just that they came first.
Restructure the code and the kconfigs to have the latency tracers depend
on snapshot code instead. This actually simplifies the logic a bit and
allows to disable more when the latency tracers are not defined and the
snapshot code is.
- Fix a "false sharing" in the hwlat tracer code
The loop to search for latency in hardware was using a variable that could
be changed by user space for each sample. If the user change this
variable, it could cause a bus contention, and reading that variable can
show up as a large latency in the trace causing a false positive. Read
this variable at the start of the sample with a READ_ONCE() into a local
variable and keep the code from sharing cache lines with readers.
- Fix function graph tracer static branch optimization code
When only one tracer is defined for function graph tracing, it uses a
static branch to call that tracer directly. When another tracer is added,
it goes into loop logic to call all the registered callbacks.
The code was incorrect when going back to one tracer and never re-enabled
the static branch again to do the optimization code.
- And other small fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaY9P3BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qou3AQCCrzdrIglLNABGPyny9sqWLDz6vyyw
nWAK9xg1VFxwRQD+LyJvVMWbpGeRBS/PsAK19RgldbgkQFWNv8gNhRKRgw0=
=U/kg
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"User visible changes:
- Add an entry into MAINTAINERS file for RUST versions of code
There's now RUST code for tracing and static branches. To
differentiate that code from the C code, add entries in for the
RUST version (with "[RUST]" around it) so that the right
maintainers get notified on changes.
- New bitmask-list option added to tracefs
When this is set, bitmasks in trace event are not displayed as hex
numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
- New show_event_filters file in tracefs
Instead of having to search all events/*/*/filter for any active
filters enabled in the trace instance, the file show_event_filters
will list them so that there's only one file that needs to be
examined to see if any filters are active.
- New show_event_triggers file in tracefs
Instead of having to search all events/*/*/trigger for any active
triggers enabled in the trace instance, the file
show_event_triggers will list them so that there's only one file
that needs to be examined to see if any triggers are active.
- Have traceoff_on_warning disable trace pintk buffer too
Recently recording of trace_printk() could go to other trace
instances instead of the top level instance. But if
traceoff_on_warning triggers, it doesn't stop the buffer with
trace_printk() and that data can easily be lost by being
overwritten. Have traceoff_on_warning also disable the instance
that has trace_printk() being written to it.
- Update the hist_debug file to show what function the field uses
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file
exists for every event. This displays the internal data of any
histogram enabled for that event. But it is lacking the function
that is called to process one of its fields. This is very useful
information that was missing when debugging histograms.
- Up the histogram stack size from 16 to 31
Stack traces can be used as keys for event histograms. Currently
the size of the stack that is stored is limited to just 16 entries.
But the storage space in the histogram is 256 bytes, meaning that
it can store up to 31 entries (plus one for the count of entries).
Instead of letting that space go to waste, up the limit from 16 to
31. This makes the keys much more useful.
- Fix permissions of per CPU file buffer_size_kb
The per CPU file of buffer_size_kb was incorrectly set to read only
in a previous cleanup. It should be writable.
- Reset "last_boot_info" if the persistent buffer is cleared
The last_boot_info shows address information of a persistent ring
buffer if it contains data from a previous boot. It is cleared when
recording starts again, but it is not cleared when the buffer is
reset. The data is useless after a reset so clear it on reset too.
Internal changes:
- A change was made to allow tracepoint callbacks to have preemption
enabled, and instead be protected by SRCU. This required some
updates to the callbacks for perf and BPF.
perf needed to disable preemption directly in its callback because
it expects preemption disabled in the later code.
BPF needed to disable migration, as its code expects to run
completely on the same CPU.
- Have irq_work wake up other CPU if current CPU is "isolated"
When there's a waiter waiting on ring buffer data and a new event
happens, an irq work is triggered to wake up that waiter. This is
noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a
house keeping CPU instead.
- Use proper free of trigger_data instead of open coding it in.
- Remove redundant call of event_trigger_reset_filter()
It was called immediately in a function that was called right after
it.
- Workqueue cleanups
- Report errors if tracing_update_buffers() were to fail.
- Make the enum update workqueue generic for other parts of tracing
On boot up, a work queue is created to convert enum names into
their numbers in the trace event format files. This work queue can
also be used for other aspects of tracing that takes some time and
shouldn't be called by the init call code.
The blk_trace initialization takes a bit of time. Have the
initialization code moved to the new tracing generic work queue
function.
- Skip kprobe boot event creation call if there's no kprobes defined
on cmdline
The kprobe initialization to set up kprobes if they are defined on
the cmdline requires taking the event_mutex lock. This can be held
by other tracing code doing initialization for a long time. Since
kprobes added to the kernel command line need to be setup
immediately, as they may be tracing early initialization code, they
cannot be postponed in a work queue and must be setup in the
initcall code.
If there's no kprobe on the kernel cmdline, there's no reason to
take the mutex and slow down the boot up code waiting to get the
lock only to find out there's nothing to do. Simply exit out early
if there's no kprobes on the kernel cmdline.
If there are kprobes on the cmdline, then someone cares more about
tracing over the speed of boot up.
- Clean up the trigger code a bit
- Move code out of trace.c and into their own files
trace.c is now over 11,000 lines of code and has become more
difficult to maintain. Start splitting it up so that related code
is in their own files.
Move all the trace_printk() related code into trace_printk.c.
Move the __always_inline stack functions into trace.h.
Move the pid filtering code into a new trace_pid.c file.
- Better define the max latency and snapshot code
The latency tracers have a "max latency" buffer that is a copy of
the main buffer and gets swapped with it when a new high latency is
detected. This keeps the trace up to the highest latency around
where this max_latency buffer is never written to. It is only used
to save the last max latency trace.
A while ago a snapshot feature was added to tracefs to allow user
space to perform the same logic. It could also enable events to
trigger a "snapshot" if one of their fields hit a new high. This
was built on top of the latency max_latency buffer logic.
Because snapshots came later, they were dependent on the latency
tracers to be enabled. In reality, the latency tracers depend on
the snapshot code and not the other way around. It was just that
they came first.
Restructure the code and the kconfigs to have the latency tracers
depend on snapshot code instead. This actually simplifies the logic
a bit and allows to disable more when the latency tracers are not
defined and the snapshot code is.
- Fix a "false sharing" in the hwlat tracer code
The loop to search for latency in hardware was using a variable
that could be changed by user space for each sample. If the user
change this variable, it could cause a bus contention, and reading
that variable can show up as a large latency in the trace causing a
false positive. Read this variable at the start of the sample with
a READ_ONCE() into a local variable and keep the code from sharing
cache lines with readers.
- Fix function graph tracer static branch optimization code
When only one tracer is defined for function graph tracing, it uses
a static branch to call that tracer directly. When another tracer
is added, it goes into loop logic to call all the registered
callbacks.
The code was incorrect when going back to one tracer and never
re-enabled the static branch again to do the optimization code.
- And other small fixes and cleanups"
* tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits)
function_graph: Restore direct mode when callbacks drop to one
tracing: Fix indentation of return statement in print_trace_fmt()
tracing: Reset last_boot_info if ring buffer is reset
tracing: Fix to set write permission to per-cpu buffer_size_kb
tracing: Fix false sharing in hwlat get_sample()
tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection
tracing: Better separate SNAPSHOT and MAX_TRACE options
tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
tracing: Rename trace_array field max_buffer to snapshot_buffer
tracing: Move pid filtering into trace_pid.c
tracing: Move trace_printk functions out of trace.c and into trace_printk.c
tracing: Use system_state in trace_printk_init_buffers()
tracing: Have trace_printk functions use flags instead of using global_trace
tracing: Make tracing_update_buffers() take NULL for global_trace
tracing: Make printk_trace global for tracing system
tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
tracing: Make tracing_selftest_running global to the tracing subsystem
tracing: Make tracing_disabled global for tracing system
tracing: Clean up use of trace_create_maxlat_file()
...
|
||
|
|
72f05009d8 |
dma-mapping update for Linux 7.0
A small code cleanup for DMA-mapping subsystem: - removal of unused hooks (Robin Murphy) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaY80tQAKCRCJp1EFxbsS RHQUAP9yrD7PczbUfOCcYoxvmyqYZ76gQLwnx9GB4Gi0pctxGAEA1hxsu+LisN5x uMc3XW/zluzQj/gn+jWCD5v5DzdEWA4= =uf46 -----END PGP SIGNATURE----- Merge tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping update from Marek Szyprowski: "A small code cleanup for the DMA-mapping subsystem: removal of unused hooks (Robin Murphy)" * tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: Remove dma_mark_clean (again) |
||
|
|
b0b1a8583d |
bpf: Add a map/btf from a fd array more consistently
The add_fd_from_fd_array() function takes a file descriptor as a
parameter and tries to add either map or btf to the corresponding
list of used objects. As was reported by Dan Carpenter, since the
commit c81e4322acf0 ("bpf: Fix a potential use-after-free of BTF
object"), the fdget() is called twice on the file descriptor, and
thus userspace, potentially, can replace the file pointed to by the
file descriptor in between the two calls. On practice, this shouldn't
break anything on the kernel side, but for consistency fix the code
such that only one fdget() is executed.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aY689z7gHNv8rgVO@stanley.mountain/
Fixes:
|
||
|
|
ccd2d799ed |
bpf: Fix a potential use-after-free of BTF object
Refcounting in the check_pseudo_btf_id() function is incorrect:
the __check_pseudo_btf_id() function might get called with a zero
refcounted btf. Fix this, and patch related code accordingly.
v3: rephrase a comment (AI)
v2: fix a refcount leak introduced in v1 (AI)
Reported-by: syzbot+5a0f1995634f7c1dadbf@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5a0f1995634f7c1dadbf
Fixes:
|
||
|
|
a353e7260b |
virtio,vhost,vdpa: features, fixes
- in order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably - dma alignment fixes for non cache coherent systems Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmmO9rYPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpBzYH/2wUPo3T8/CKGFjF7QSPzgL/UI2NhnP8iSm4 btg1zVnrWmJK6vVIwnf5UsG8dFKsMcp/BEGCewTmIddNM2wEeSul0kKDXtIzrK/U jdA9bJrUKLMeU7IFKne1Fip/yE+5nkWJttWXXyVRJtOJrYxZlkWfqSns3qYcPvsG g7HXvF6tmici5uoKdRCLqHtQCWsvpnvTD5A7qoZAlEUjlQCDKKmuukpN9oK5UYLl 9uUOgPQAJaxIwx1C4uP7L+AwbLUcN/+MtrvQRNz+sFpP3sN9oXeDJKBpNQp109NB JGk1sUsINL+54Cmdd5RwZ9T1vBJyRDrdWRDy1yHj95LildaPfh0= =pnob -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - in-order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably dma alignment fixes for non-cache-coherent systems * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits) vduse: avoid adding implicit padding vhost: fix caching attributes of MMIO regions by setting them explicitly vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr() vdpa/mlx5: reuse common function for MAC address updates vdpa/mlx5: update mlx_features with driver state check crypto: virtio: Replace package id with numa node id crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req crypto: virtio: Add spinlock protection with virtqueue notification Documentation: Add documentation for VDUSE Address Space IDs vduse: bump version number vduse: add vq group asid support vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls vduse: take out allocations from vduse_dev_alloc_coherent vduse: remove unused vaddr parameter of vduse_domain_free_coherent vduse: refactor vdpa_dev_add for goto err handling vhost: forbid change vq groups ASID if DRIVER_OK is set vdpa: document set_group_asid thread safety vduse: return internal vq group struct as map token vduse: add vq group support vduse: add v1 API definition ... |