Commit graph

7055 commits

Author SHA1 Message Date
Steven Rostedt
cc337974cd ftrace: Disable preemption in the tracepoint callbacks handling filtered pids
When function trace PID filtering is enabled, the function tracer will
attach a callback to the fork tracepoint as well as the exit tracepoint
that will add the forked child PID to the PID filtering list as well as
remove the PID that is exiting.

Commit a46023d561 ("tracing: Guard __DECLARE_TRACE() use of
__DO_TRACE_CALL() with SRCU-fast") removed the disabling of preemption
when calling tracepoint callbacks.

The callbacks used for the PID filtering accounting depended on preemption
being disabled, and now the trigger a "suspicious RCU usage" warning message.

Make them explicitly disable preemption.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260302213546.156e3e4f@gandalf.local.home
Fixes: a46023d561 ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-03-03 22:25:31 -05:00
Huiwen He
0a663b764d tracing: Fix syscall events activation by ensuring refcount hits zero
When multiple syscall events are specified in the kernel command line
(e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close),
they are often not captured after boot, even though they appear enabled
in the tracing/set_event file.

The issue stems from how syscall events are initialized. Syscall
tracepoints require the global reference count (sys_tracepoint_refcount)
to transition from 0 to 1 to trigger the registration of the syscall
work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1).

The current implementation of early_enable_events() with disable_first=true
used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B".
If multiple syscalls are enabled, the refcount never drops to zero,
preventing the 0->1 transition that triggers actual registration.

Fix this by splitting early_enable_events() into two distinct phases:
1. Disable all events specified in the buffer.
2. Enable all events specified in the buffer.

This ensures the refcount hits zero before re-enabling, allowing syscall
events to be properly activated during early boot.

The code is also refactored to use a helper function to avoid logic
duplication between the disable and enable phases.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn
Fixes: ce1039bd3a ("tracing: Fix enabling of syscall events on the command line")
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03 22:15:02 -05:00
Shengming Hu
b96d0c59cd fgraph: Fix thresh_return nosleeptime double-adjust
trace_graph_thresh_return() called handle_nosleeptime() and then delegated
to trace_graph_return(), which calls handle_nosleeptime() again. When
sleep-time accounting is disabled this double-adjusts calltime and can
produce bogus durations (including underflow).

Fix this by computing rettime once, applying handle_nosleeptime() only
once, using the adjusted calltime for threshold comparison, and writing
the return event directly via __trace_graph_return() when the threshold is
met.

Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260221113314048jE4VRwIyZEALiYByGK0My@zte.com.cn
Fixes: 3c9880f3ab ("ftrace: Use a running sleeptime instead of saving on shadow stack")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03 22:11:20 -05:00
Shengming Hu
6ca8379b5d fgraph: Fix thresh_return clear per-task notrace
When tracing_thresh is enabled, function graph tracing uses
trace_graph_thresh_return() as the return handler. Unlike
trace_graph_return(), it did not clear the per-task TRACE_GRAPH_NOTRACE
flag set by the entry handler for set_graph_notrace addresses. This could
leave the task permanently in "notrace" state and effectively disable
function graph tracing for that task.

Mirror trace_graph_return()'s per-task notrace handling by clearing
TRACE_GRAPH_NOTRACE and returning early when set.

Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260221113007819YgrZsMGABff4Rc-O_fZxL@zte.com.cn
Fixes: b84214890a ("function_graph: Move graph notrace bit to shadow stack global var")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03 22:10:37 -05:00
Jiri Olsa
ad6fface76 bpf: Fix kprobe_multi cookies access in show_fdinfo callback
We don't check if cookies are available on the kprobe_multi link
before accessing them in show_fdinfo callback, we should.

Cc: stable@vger.kernel.org
Fixes: da7e9c0a7f ("bpf: Add show_fdinfo for kprobe_multi")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260225111249.186230-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26 11:23:57 -08:00
Kees Cook
189f164e57 Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:

  // SPDX-License-Identifier: GPL-2.0-only
  // Options: --include-headers-for-types --all-includes --include-headers --keep-comments
  virtual patch

  @gfp depends on patch && !(file in "tools") && !(file in "samples")@
  identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
 		    kzalloc_obj,kzalloc_objs,kzalloc_flex,
		    kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
		    kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
  @@

  	ALLOC(...
  -		, GFP_KERNEL
  	)

  $ make coccicheck MODE=patch COCCI=gfp.cocci

Build and boot tested x86_64 with Fedora 42's GCC and Clang:

Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22 08:26:33 -08:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
68010e7b3d Tracing fixes for 7.0:
- Fix possible dereference of uninitialized pointer
 
   When validating the persistent ring buffer on boot up, if the first
   validation fails, a reference to "head_page" is performed in the
   error path, but it skips over the initialization of that variable.
   Move the initialization before the first validation check.
 
 - Fix use of event length in validation of persistent ring buffer
 
   On boot up, the persistent ring buffer is checked to see if it is
   valid by several methods. One being to walk all the events in the
   memory location to make sure they are all valid. The length of the
   event is used to move to the next event. This length is determined
   by the data in the buffer. If that length is corrupted, it could
   possibly make the next event to check located at a bad memory location.
 
   Validate the length field of the event when doing the event walk.
 
 - Fix function graph on archs that do not support use of ftrace_ops
 
   When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means
   that its function graph tracer uses the ftrace_ops of the function
   tracer to call its callbacks. This allows a single registered callback
   to be called directly instead of checking the callback's meta data's
   hash entries against the function being traced.
 
   For architectures that do not support this feature, it must always
   call the loop function that tests each registered callback (even if
   there's only one). The loop function tests each callback's meta data
   against its hash of functions and will call its callback if the
   function being traced is in its hash map.
 
   The issue was that there was no check against this and the direct
   function was being called even if the architecture didn't support it.
   This meant that if function tracing was enabled at the same time
   as a callback was registered with the function graph tracer, its
   callback would be called for every function that the function tracer
   also traced, even if the callback's meta data only wanted to be
   called back for a small subset of functions.
 
   Prevent the direct calling for those architectures that do not support
   it.
 
 - Fix references to trace_event_file for hist files
 
   The hist files used event_file_data() to get a reference to the
   associated trace_event_file the histogram was attached to. This
   would return a pointer even if the trace_event_file is about to
   be freed (via RCU). Instead it should use the event_file_file()
   helper that returns NULL if the trace_event_file is marked to be
   freed so that no new references are added to it.
 
 - Wake up hist poll readers when an event is being freed
 
   When polling on a hist file, the task is only awoken when a hist
   trigger is triggered. This means that if an event is being freed
   while there's a task waiting on its hist file, it will need to wait
   until the hist trigger occurs to wake it up and allow the freeing
   to happen. Note, the event will not be completely freed until all
   references are removed, and a hist poller keeps a reference. But
   it should still be woken when the event is being freed.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaZd4ExQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqX5AP4powfnNnRfLSKH9idhTp+ltmJ+9roy
 L7kWTr/z20S2VQEAk331PNZ32uZu+/ZUETYpgtEx4SbRGZFehTBv1ddjfw4=
 =TWj5
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix possible dereference of uninitialized pointer

   When validating the persistent ring buffer on boot up, if the first
   validation fails, a reference to "head_page" is performed in the
   error path, but it skips over the initialization of that variable.
   Move the initialization before the first validation check.

 - Fix use of event length in validation of persistent ring buffer

   On boot up, the persistent ring buffer is checked to see if it is
   valid by several methods. One being to walk all the events in the
   memory location to make sure they are all valid. The length of the
   event is used to move to the next event. This length is determined by
   the data in the buffer. If that length is corrupted, it could
   possibly make the next event to check located at a bad memory
   location.

   Validate the length field of the event when doing the event walk.

 - Fix function graph on archs that do not support use of ftrace_ops

   When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means
   that its function graph tracer uses the ftrace_ops of the function
   tracer to call its callbacks. This allows a single registered
   callback to be called directly instead of checking the callback's
   meta data's hash entries against the function being traced.

   For architectures that do not support this feature, it must always
   call the loop function that tests each registered callback (even if
   there's only one). The loop function tests each callback's meta data
   against its hash of functions and will call its callback if the
   function being traced is in its hash map.

   The issue was that there was no check against this and the direct
   function was being called even if the architecture didn't support it.
   This meant that if function tracing was enabled at the same time as a
   callback was registered with the function graph tracer, its callback
   would be called for every function that the function tracer also
   traced, even if the callback's meta data only wanted to be called
   back for a small subset of functions.

   Prevent the direct calling for those architectures that do not
   support it.

 - Fix references to trace_event_file for hist files

   The hist files used event_file_data() to get a reference to the
   associated trace_event_file the histogram was attached to. This would
   return a pointer even if the trace_event_file is about to be freed
   (via RCU). Instead it should use the event_file_file() helper that
   returns NULL if the trace_event_file is marked to be freed so that no
   new references are added to it.

 - Wake up hist poll readers when an event is being freed

   When polling on a hist file, the task is only awoken when a hist
   trigger is triggered. This means that if an event is being freed
   while there's a task waiting on its hist file, it will need to wait
   until the hist trigger occurs to wake it up and allow the freeing to
   happen. Note, the event will not be completely freed until all
   references are removed, and a hist poller keeps a reference. But it
   should still be woken when the event is being freed.

* tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Wake up poll waiters for hist files when removing an event
  tracing: Fix checking of freed trace_event_file for hist files
  fgraph: Do not call handlers direct when not using ftrace_ops
  tracing: ring-buffer: Fix to check event length before using
  ring-buffer: Fix possible dereference of uninitialized pointer
2026-02-20 15:05:26 -08:00
Petr Pavlu
9678e53179 tracing: Wake up poll waiters for hist files when removing an event
The event_hist_poll() function attempts to verify whether an event file is
being removed, but this check may not occur or could be unnecessarily
delayed. This happens because hist_poll_wakeup() is currently invoked only
from event_hist_trigger() when a hist command is triggered. If the event
file is being removed, no associated hist command will be triggered and a
waiter will be woken up only after an unrelated hist command is triggered.

Fix the issue by adding a call to hist_poll_wakeup() in
remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This
ensures that a task polling on a hist file is woken up and receives
EPOLLERR.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com
Fixes: 1bd13edbbe ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:25:11 -05:00
Petr Pavlu
f0a0da1f90 tracing: Fix checking of freed trace_event_file for hist files
The event_hist_open() and event_hist_poll() functions currently retrieve
a trace_event_file pointer from a file struct by invoking
event_file_data(), which simply returns file->f_inode->i_private. The
functions then check if the pointer is NULL to determine whether the event
is still valid. This approach is flawed because i_private is assigned when
an eventfs inode is allocated and remains set throughout its lifetime.
Instead, the code should call event_file_file(), which checks for
EVENT_FILE_FL_FREED. Using the incorrect access function may result in the
code potentially opening a hist file for an event that is being removed or
becoming stuck while polling on this file.

Correct the access method to event_file_file() in both functions.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com
Fixes: 1bd13edbbe ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:23:49 -05:00
Steven Rostedt
f4ff9f646a fgraph: Do not call handlers direct when not using ftrace_ops
The function graph tracer was modified to us the ftrace_ops of the
function tracer. This simplified the code as well as allowed more features
of the function graph tracer.

Not all architectures were converted over as it required the
implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those
architectures, it still did it the old way where the function graph tracer
handle was called by the function tracer trampoline. The handler then had
to check the hash to see if the registered handlers wanted to be called by
that function or not.

In order to speed up the function graph tracer that used ftrace_ops, if
only one callback was registered with function graph, it would call its
function directly via a static call.

Now, if the architecture does not support the use of using ftrace_ops and
still has the ftrace function trampoline calling the function graph
handler, then by doing a direct call it removes the check against the
handler's hash (list of functions it wants callbacks to), and it may call
that handler for functions that the handler did not request calls for.

On 32bit x86, which does not support the ftrace_ops use with function
graph tracer, it shows the issue:

 ~# trace-cmd start -p function -l schedule
 ~# trace-cmd show
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  2) * 11898.94 us |  schedule();
  3) # 1783.041 us |  schedule();
  1)               |  schedule() {
  ------------------------------------------
  1)   bash-8369    =>  kworker-7669
  ------------------------------------------
  1)               |        schedule() {
  ------------------------------------------
  1)  kworker-7669  =>   bash-8369
  ------------------------------------------
  1) + 97.004 us   |  }
  1)               |  schedule() {
 [..]

Now by starting the function tracer is another instance:

 ~# trace-cmd start -B foo -p function

This causes the function graph tracer to trace all functions (because the
function trace calls the function graph tracer for each on, and the
function graph trace is doing a direct call):

 ~# trace-cmd show
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  1)   1.669 us    |          } /* preempt_count_sub */
  1) + 10.443 us   |        } /* _raw_spin_unlock_irqrestore */
  1)               |        tick_program_event() {
  1)               |          clockevents_program_event() {
  1)   1.044 us    |            ktime_get();
  1)   6.481 us    |            lapic_next_event();
  1) + 10.114 us   |          }
  1) + 11.790 us   |        }
  1) ! 181.223 us  |      } /* hrtimer_interrupt */
  1) ! 184.624 us  |    } /* __sysvec_apic_timer_interrupt */
  1)               |    irq_exit_rcu() {
  1)   0.678 us    |      preempt_count_sub();

When it should still only be tracing the schedule() function.

To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the
architecture does not support function graph use of ftrace_ops, and set to
1 otherwise. Then use this macro to know to allow function graph tracer to
call the handlers directly or not.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:21:22 -05:00
Masami Hiramatsu (Google)
912b0ee248 tracing: ring-buffer: Fix to check event length before using
Check the event length before adding it for accessing next index in
rb_read_data_buffer(). Since this function is used for validating
possibly broken ring buffers, the length of the event could be broken.
In that case, the new event (e + len) can point a wrong address.
To avoid invalid memory access at boot, check whether the length of
each event is in the possible range before using it.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:21:12 -05:00
Daniil Dulov
f154777940 ring-buffer: Fix possible dereference of uninitialized pointer
There is a pointer head_page in rb_meta_validate_events() which is not
initialized at the beginning of a function. This pointer can be dereferenced
if there is a failure during reader page validation. In this case the control
is passed to "invalid" label where the pointer is dereferenced in a loop.

To fix the issue initialize orig_head and head_page before calling
rb_validate_buffer.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru
Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19 15:20:41 -05:00
Linus Torvalds
99dfe2d4da block-7.0-20260216
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmTrNEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjsOEACpUk78nFmLbEgJ5UH8+Z6daDzgoasb5YRT
 Mj4g+cM2J9Xc9JxgX8QR3F2EfolweTo/H6xlhnlPDcnpB+b3qj4WHuijR/wghphj
 MBKKqNXTEC+j0ra9uk8h3RmIKaK79xcUup7XfTcuWdYpSsMyYE/m/rck3thw6yNL
 OAjmWLTP4IwYzXip2AB+J7JbDDOV/qWK0aOYdWHCdbn9X8bBel/HDOITWPdybnSR
 DNKBeoi/Yv8KwA+axogqP213ifc3Xr6ejRDkqDOf1bgXsKkELkIxcfog6MhfHhxq
 3Cqlj1pBuIBxGVU7wmBTDqL+aHrVb983tcA5x1NGZIzJao64b026o5DUhNPprwrZ
 HveU1MZ2jarAjAz85gE3S4oUY+6d47ooytfvO548Zp/1LY+fOxnjYqq5ksh8BBLk
 WyjfkJScgr17Z4SVOK8a9GboWO2WKiQJRg+hZ/TWX5fyvu5g9sbRasdwxnp1sl52
 EayzkhYFq/Rdd8slwTIaccVUPl/xeEDeRG+jTJ+4Fj54TihKiJzXVsxDkSWKf46V
 CWmzDx+n6MlGPm9mShSERZ7HJh3VcSp4No/HAjf93u9/UXwubK/SKiV71nhpgJMf
 9bWS2G3wPx/5LoME95YkF+CSgs0e/ROUusfGd8X6nIz9EBGzeabCG/mjqd5adC09
 OZahOuqrIg==
 =PVoY
 -----END PGP SIGNATURE-----

Merge tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull more block updates from Jens Axboe:

 - Fix partial IOVA mapping cleanup in error handling

 - Minor prep series ignoring discard return value, as
   the inline value is always known

 - Ensure BLK_FEAT_STABLE_WRITES is set for drbd

 - Fix leak of folio in bio_iov_iter_bounce_read()

 - Allow IOC_PR_READ_* for read-only open

 - Another debugfs deadlock fix

 - A few doc updates

* tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  blk-mq: use NOIO context to prevent deadlock during debugfs creation
  blk-stat: convert struct blk_stat_callback to kernel-doc
  block: fix enum descriptions kernel-doc
  block: update docs for bio and bvec_iter
  block: change return type to void
  nvmet: ignore discard return value
  md: ignore discard return value
  block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova
  block: fix folio leak in bio_iov_iter_bounce_read()
  block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ
  drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-17 08:48:45 -08:00
Yu Kuai
dfe48ea179 blk-mq: use NOIO context to prevent deadlock during debugfs creation
Creating debugfs entries can trigger fs reclaim, which can enter back
into the block layer request_queue. This can cause deadlock if the
queue is frozen.

Previously, a WARN_ON_ONCE check was used in debugfs_create_files()
to detect this condition, but it was racy since the queue can be frozen
from another context at any time.

Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine
the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs
reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave()
and blk_debugfs_unlock_nomemrestore() variants for callers that don't
need NOIO protection (e.g., debugfs removal or read-only operations).

Replace all raw debugfs_mutex lock/unlock pairs with these helpers,
using the _nomemsave/_nomemrestore variants where appropriate.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16 10:47:25 -07:00
Linus Torvalds
2d10a48871 Probes for v7.0
- kprobes: Use a dedicated kernel thread to optimize the kprobes
   instead of using workqueue thread. Since the kprobe optimizer waits
   a long time for synchronize_rcu_task(), it can block other workers
   in the same queue if it uses a workqueue.
 
 - kprobe-events: Returns immediately if no new probe events are
   specified on the kernel command line at boot time. This shorten the
   kernel boot time.
 
 - kprobes: When a kprobe is fully removed from the kernel code,
   retry optimizing another kprobe which is blocked by that kprobe.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmmPtX4bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bgoUH/0oZ9/c+wkn9AKxFlOAw
 AIbtTfKiQHXPirhamYuEbJJWTUH2O/Wr0SQrIkbwJzToou5CRTuT5UQBpmOk2GKZ
 uZ5bO63nOMn/1ECbDpc+lFGV5J4Dn5Pw2GFjUTR6EPwms+CN/d6f2LlvuAVfJX7Y
 klL2X9/QLwV5QT8La5zNEka2zBfrZlZ9a5DGoLtgdwX2zosuef9aDnisS2jkLenH
 wugZLryx0Qvlj9DXimP7oVZpYDHr0hZEI3CNQPge0gMQDj1XajGawYfkhrfaeFMX
 4eXjFCN+NQG6fBUpDugcOThOypWmjlJhoYxxROJu0eFcLQjsuWjp6jLixI9Nr7P+
 tQk=
 =syAj
 -----END PGP SIGNATURE-----

Merge tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull kprobes updates from Masami Hiramatsu:

 - Use a dedicated kernel thread to optimize the kprobes instead of
   using workqueue thread. Since the kprobe optimizer waits a long time
   for synchronize_rcu_task(), it can block other workers in the same
   queue if it uses a workqueue.

 - kprobe-events: return immediately if no new probe events are
   specified on the kernel command line at boot time. This shortens
   the kernel boot time.

 - When a kprobe is fully removed from the kernel code, retry optimizing
   another kprobe which is blocked by that kprobe.

* tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Use dedicated kthread for kprobe optimizer
  tracing: kprobe-event: Return directly when trace kprobes is empty
  kprobes: retry blocked optprobe in do_free_cleaned_kprobes
2026-02-16 07:04:01 -08:00
Linus Torvalds
3c6e577d5a tracing updates for 7.0:
User visible changes:
 
 - Add an entry into MAINTAINERS file for RUST versions of code
 
   There's now RUST code for tracing and static branches. To differentiate
   that code from the C code, add entries in for the RUST version (with "[RUST]"
   around it) so that the right maintainers get notified on changes.
 
 - New bitmask-list option added to tracefs
 
   When this is set, bitmasks in trace event are not displayed as hex
   numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
 
 - New show_event_filters file in tracefs
 
   Instead of having to search all events/*/*/filter for any active filters
   enabled in the trace instance, the file show_event_filters will list them
   so that there's only one file that needs to be examined to see if any
   filters are active.
 
 - New show_event_triggers file in tracefs
 
   Instead of having to search all events/*/*/trigger for any active triggers
   enabled in the trace instance, the file show_event_triggers will list them
   so that there's only one file that needs to be examined to see if any
   triggers are active.
 
 - Have traceoff_on_warning disable trace pintk buffer too
 
   Recently recording of trace_printk() could go to other trace instances
   instead of the top level instance. But if traceoff_on_warning triggers, it
   doesn't stop the buffer with trace_printk() and that data can easily be
   lost by being overwritten. Have traceoff_on_warning also disable the
   instance that has trace_printk() being written to it.
 
 - Update the hist_debug file to show what function the field uses
 
   When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file exists for
   every event. This displays the internal data of any histogram enabled for
   that event. But it is lacking the function that is called to process one
   of its fields. This is very useful information that was missing when
   debugging histograms.
 
 - Up the histogram stack size from 16 to 31
 
   Stack traces can be used as keys for event histograms. Currently the size
   of the stack that is stored is limited to just 16 entries. But the storage
   space in the histogram is 256 bytes, meaning that it can store up to 31
   entries (plus one for the count of entries). Instead of letting that space
   go to waste, up the limit from 16 to 31. This makes the keys much more
   useful.
 
 - Fix permissions of per CPU file buffer_size_kb
 
   The per CPU file of buffer_size_kb was incorrectly set to read only in a
   previous cleanup. It should be writable.
 
 - Reset "last_boot_info" if the persistent buffer is cleared
 
   The last_boot_info shows address information of a persistent ring buffer
   if it contains data from a previous boot. It is cleared when recording
   starts again, but it is not cleared when the buffer is reset. The data is
   useless after a reset so clear it on reset too.
 
 Internal changes:
 
 - A change was made to allow tracepoint callbacks to have preemption
   enabled, and instead be protected by SRCU. This required some updates to
   the callbacks for perf and BPF.
 
   perf needed to disable preemption directly in its callback because it
   expects preemption disabled in the later code.
 
   BPF needed to disable migration, as its code expects to run completely on
   the same CPU.
 
 - Have irq_work wake up other CPU if current CPU is "isolated"
 
   When there's a waiter waiting on ring buffer data and a new event happens,
   an irq work is triggered to wake up that waiter. This is noisy on isolated
   CPUs (running NO_HZ_FULL). Trigger an IPI to a house keeping CPU instead.
 
 - Use proper free of trigger_data instead of open coding it in.
 
 - Remove redundant call of event_trigger_reset_filter()
 
   It was called immediately in a function that was called right after it.
 
 - Workqueue cleanups
 
 - Report errors if tracing_update_buffers() were to fail.
 
 - Make the enum update workqueue generic for other parts of tracing
 
   On boot up, a work queue is created to convert enum names into their
   numbers in the trace event format files. This work queue can also be used
   for other aspects of tracing that takes some time and shouldn't be called
   by the init call code.
 
   The blk_trace initialization takes a bit of time. Have the initialization
   code moved to the new tracing generic work queue function.
 
 - Skip kprobe boot event creation call if there's no kprobes defined on cmdline
 
   The kprobe initialization to set up kprobes if they are defined on the
   cmdline requires taking the event_mutex lock. This can be held by other
   tracing code doing initialization for a long time. Since kprobes added to
   the kernel command line need to be setup immediately, as they may be
   tracing early initialization code, they cannot be postponed in a work
   queue and must be setup in the initcall code.
 
   If there's no kprobe on the kernel cmdline, there's no reason to take the
   mutex and slow down the boot up code waiting to get the lock only to find
   out there's nothing to do. Simply exit out early if there's no kprobes on
   the kernel cmdline.
 
   If there are kprobes on the cmdline, then someone cares more about tracing
   over the speed of boot up.
 
 - Clean up the trigger code a bit
 
 - Move code out of trace.c and into their own files
 
   trace.c is now over 11,000 lines of code and has become more difficult to
   maintain. Start splitting it up so that related code is in their own
   files.
 
   Move all the trace_printk() related code into trace_printk.c.
 
   Move the __always_inline stack functions into trace.h.
 
   Move the pid filtering code into a new trace_pid.c file.
 
 - Better define the max latency and snapshot code
 
   The latency tracers have a "max latency" buffer that is a copy of the main
   buffer and gets swapped with it when a new high latency is detected. This
   keeps the trace up to the highest latency around where this max_latency
   buffer is never written to. It is only used to save the last max latency
   trace.
 
   A while ago a snapshot feature was added to tracefs to allow user space to
   perform the same logic. It could also enable events to trigger a
   "snapshot" if one of their fields hit a new high. This was built on top of
   the latency max_latency buffer logic.
 
   Because snapshots came later, they were dependent on the latency tracers
   to be enabled. In reality, the latency tracers depend on the snapshot code
   and not the other way around. It was just that they came first.
 
   Restructure the code and the kconfigs to have the latency tracers depend
   on snapshot code instead. This actually simplifies the logic a bit and
   allows to disable more when the latency tracers are not defined and the
   snapshot code is.
 
 - Fix a "false sharing" in the hwlat tracer code
 
   The loop to search for latency in hardware was using a variable that could
   be changed by user space for each sample. If the user change this
   variable, it could cause a bus contention, and reading that variable can
   show up as a large latency in the trace causing a false positive. Read
   this variable at the start of the sample with a READ_ONCE() into a local
   variable and keep the code from sharing cache lines with readers.
 
 - Fix function graph tracer static branch optimization code
 
   When only one tracer is defined for function graph tracing, it uses a
   static branch to call that tracer directly. When another tracer is added,
   it goes into loop logic to call all the registered callbacks.
 
   The code was incorrect when going back to one tracer and never re-enabled
   the static branch again to do the optimization code.
 
 - And other small fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaY9P3BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qou3AQCCrzdrIglLNABGPyny9sqWLDz6vyyw
 nWAK9xg1VFxwRQD+LyJvVMWbpGeRBS/PsAK19RgldbgkQFWNv8gNhRKRgw0=
 =U/kg
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:
 "User visible changes:

   - Add an entry into MAINTAINERS file for RUST versions of code

     There's now RUST code for tracing and static branches. To
     differentiate that code from the C code, add entries in for the
     RUST version (with "[RUST]" around it) so that the right
     maintainers get notified on changes.

   - New bitmask-list option added to tracefs

     When this is set, bitmasks in trace event are not displayed as hex
     numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f

   - New show_event_filters file in tracefs

     Instead of having to search all events/*/*/filter for any active
     filters enabled in the trace instance, the file show_event_filters
     will list them so that there's only one file that needs to be
     examined to see if any filters are active.

   - New show_event_triggers file in tracefs

     Instead of having to search all events/*/*/trigger for any active
     triggers enabled in the trace instance, the file
     show_event_triggers will list them so that there's only one file
     that needs to be examined to see if any triggers are active.

   - Have traceoff_on_warning disable trace pintk buffer too

     Recently recording of trace_printk() could go to other trace
     instances instead of the top level instance. But if
     traceoff_on_warning triggers, it doesn't stop the buffer with
     trace_printk() and that data can easily be lost by being
     overwritten. Have traceoff_on_warning also disable the instance
     that has trace_printk() being written to it.

   - Update the hist_debug file to show what function the field uses

     When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file
     exists for every event. This displays the internal data of any
     histogram enabled for that event. But it is lacking the function
     that is called to process one of its fields. This is very useful
     information that was missing when debugging histograms.

   - Up the histogram stack size from 16 to 31

     Stack traces can be used as keys for event histograms. Currently
     the size of the stack that is stored is limited to just 16 entries.
     But the storage space in the histogram is 256 bytes, meaning that
     it can store up to 31 entries (plus one for the count of entries).
     Instead of letting that space go to waste, up the limit from 16 to
     31. This makes the keys much more useful.

   - Fix permissions of per CPU file buffer_size_kb

     The per CPU file of buffer_size_kb was incorrectly set to read only
     in a previous cleanup. It should be writable.

   - Reset "last_boot_info" if the persistent buffer is cleared

     The last_boot_info shows address information of a persistent ring
     buffer if it contains data from a previous boot. It is cleared when
     recording starts again, but it is not cleared when the buffer is
     reset. The data is useless after a reset so clear it on reset too.

  Internal changes:

   - A change was made to allow tracepoint callbacks to have preemption
     enabled, and instead be protected by SRCU. This required some
     updates to the callbacks for perf and BPF.

     perf needed to disable preemption directly in its callback because
     it expects preemption disabled in the later code.

     BPF needed to disable migration, as its code expects to run
     completely on the same CPU.

   - Have irq_work wake up other CPU if current CPU is "isolated"

     When there's a waiter waiting on ring buffer data and a new event
     happens, an irq work is triggered to wake up that waiter. This is
     noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a
     house keeping CPU instead.

   - Use proper free of trigger_data instead of open coding it in.

   - Remove redundant call of event_trigger_reset_filter()

     It was called immediately in a function that was called right after
     it.

   - Workqueue cleanups

   - Report errors if tracing_update_buffers() were to fail.

   - Make the enum update workqueue generic for other parts of tracing

     On boot up, a work queue is created to convert enum names into
     their numbers in the trace event format files. This work queue can
     also be used for other aspects of tracing that takes some time and
     shouldn't be called by the init call code.

     The blk_trace initialization takes a bit of time. Have the
     initialization code moved to the new tracing generic work queue
     function.

   - Skip kprobe boot event creation call if there's no kprobes defined
     on cmdline

     The kprobe initialization to set up kprobes if they are defined on
     the cmdline requires taking the event_mutex lock. This can be held
     by other tracing code doing initialization for a long time. Since
     kprobes added to the kernel command line need to be setup
     immediately, as they may be tracing early initialization code, they
     cannot be postponed in a work queue and must be setup in the
     initcall code.

     If there's no kprobe on the kernel cmdline, there's no reason to
     take the mutex and slow down the boot up code waiting to get the
     lock only to find out there's nothing to do. Simply exit out early
     if there's no kprobes on the kernel cmdline.

     If there are kprobes on the cmdline, then someone cares more about
     tracing over the speed of boot up.

   - Clean up the trigger code a bit

   - Move code out of trace.c and into their own files

     trace.c is now over 11,000 lines of code and has become more
     difficult to maintain. Start splitting it up so that related code
     is in their own files.

     Move all the trace_printk() related code into trace_printk.c.

     Move the __always_inline stack functions into trace.h.

     Move the pid filtering code into a new trace_pid.c file.

   - Better define the max latency and snapshot code

     The latency tracers have a "max latency" buffer that is a copy of
     the main buffer and gets swapped with it when a new high latency is
     detected. This keeps the trace up to the highest latency around
     where this max_latency buffer is never written to. It is only used
     to save the last max latency trace.

     A while ago a snapshot feature was added to tracefs to allow user
     space to perform the same logic. It could also enable events to
     trigger a "snapshot" if one of their fields hit a new high. This
     was built on top of the latency max_latency buffer logic.

     Because snapshots came later, they were dependent on the latency
     tracers to be enabled. In reality, the latency tracers depend on
     the snapshot code and not the other way around. It was just that
     they came first.

     Restructure the code and the kconfigs to have the latency tracers
     depend on snapshot code instead. This actually simplifies the logic
     a bit and allows to disable more when the latency tracers are not
     defined and the snapshot code is.

   - Fix a "false sharing" in the hwlat tracer code

     The loop to search for latency in hardware was using a variable
     that could be changed by user space for each sample. If the user
     change this variable, it could cause a bus contention, and reading
     that variable can show up as a large latency in the trace causing a
     false positive. Read this variable at the start of the sample with
     a READ_ONCE() into a local variable and keep the code from sharing
     cache lines with readers.

   - Fix function graph tracer static branch optimization code

     When only one tracer is defined for function graph tracing, it uses
     a static branch to call that tracer directly. When another tracer
     is added, it goes into loop logic to call all the registered
     callbacks.

     The code was incorrect when going back to one tracer and never
     re-enabled the static branch again to do the optimization code.

   - And other small fixes and cleanups"

* tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits)
  function_graph: Restore direct mode when callbacks drop to one
  tracing: Fix indentation of return statement in print_trace_fmt()
  tracing: Reset last_boot_info if ring buffer is reset
  tracing: Fix to set write permission to per-cpu buffer_size_kb
  tracing: Fix false sharing in hwlat get_sample()
  tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection
  tracing: Better separate SNAPSHOT and MAX_TRACE options
  tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
  tracing: Rename trace_array field max_buffer to snapshot_buffer
  tracing: Move pid filtering into trace_pid.c
  tracing: Move trace_printk functions out of trace.c and into trace_printk.c
  tracing: Use system_state in trace_printk_init_buffers()
  tracing: Have trace_printk functions use flags instead of using global_trace
  tracing: Make tracing_update_buffers() take NULL for global_trace
  tracing: Make printk_trace global for tracing system
  tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
  tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
  tracing: Make tracing_selftest_running global to the tracing subsystem
  tracing: Make tracing_disabled global for tracing system
  tracing: Clean up use of trace_create_maxlat_file()
  ...
2026-02-13 19:25:16 -08:00
Shengming Hu
53b2fae90f function_graph: Restore direct mode when callbacks drop to one
When registering a second fgraph callback, direct path is disabled and
array loop is used instead.  When ftrace_graph_active falls back to one,
we try to re-enable direct mode via ftrace_graph_enable_direct(true, ...).
But ftrace_graph_enable_direct() incorrectly disables the static key
rather than enabling it.  This leaves fgraph_do_direct permanently off
after first multi-callback transition, so direct fast mode is never
restored.

Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260213142932519cuWSpEXeS4-UnCvNXnK2P@zte.com.cn
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-13 09:33:14 -05:00
Linus Torvalds
f75c03a761 runtime verifier changes for 7.0:
- Refactor da_monitor to minimise macros
 
   Complete refactor of da_monitor.h to reduce reliance on macros
   generating functions. Use generic static functions and uses the
   preprocessor only when strictly necessary (e.g. for tracepoint
   handlers).
 
   The change essentially relies on functions with generic names (e.g.
   da_handle) instead of monitor-specific as well adding the need to
   define constant (e.g. MONITOR_NAME, MONITOR_TYPE) before including the
   header rather than calling macros that would define functions.
   Also adapt monitors and documentation accordingly.
 
 - Cleanup DA code generation scripts
 
   Clean up functions in dot2c removing reimplementations of trivial
   library functions (__buff_to_string) and removing some other unused
   intermediate steps.
 
 - Annotate functions with types in the rvgen python scripts
 
 - Remove superfluous assignments and cleanup generated code
 
   The rvgen scripts generate a superfluous assignment to 0 for enum
   variables and don't add commas to the last elements, which is against
   the kernel coding standards. Change the generation process for a
   better compliance and slightly simpler logic.
 
 - Remove superfluous declarations from generated code
 
   The monitor container source files contained a declaration and a
   definition for the rv_monitor variable. The former is superfluous and
   was removed.
 
 - Fix reference to outdated documentation
 
   s/da_monitor_synthesis.rst/monitor_synthesis.rst in comment in
   da_monitor.h
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaY3j9xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qup5AP0afGb2gr+SOIOu0dIAXjGLg3cbz0C4
 J4bDx6H0cdgXywD+Maassm0nKIgDnOBLwtuwL0kEUAxllqn7RCZ2+ARwbAU=
 =dfCl
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier updates from Steven Rostedt:

 - Refactor da_monitor to minimize macros

   Complete refactor of da_monitor.h to reduce reliance on macros
   generating functions. Use generic static functions and uses the
   preprocessor only when strictly necessary (e.g. for tracepoint
   handlers).

   The change essentially relies on functions with generic names (e.g.
   da_handle) instead of monitor-specific as well adding the need to
   define constant (e.g. MONITOR_NAME, MONITOR_TYPE) before including
   the header rather than calling macros that would define functions.
   Also adapt monitors and documentation accordingly.

 - Cleanup DA code generation scripts

   Clean up functions in dot2c removing reimplementations of trivial
   library functions (__buff_to_string) and removing some other unused
   intermediate steps.

 - Annotate functions with types in the rvgen python scripts

 - Remove superfluous assignments and cleanup generated code

   The rvgen scripts generate a superfluous assignment to 0 for enum
   variables and don't add commas to the last elements, which is against
   the kernel coding standards. Change the generation process for a
   better compliance and slightly simpler logic.

 - Remove superfluous declarations from generated code

   The monitor container source files contained a declaration and a
   definition for the rv_monitor variable. The former is superfluous and
   was removed.

 - Fix reference to outdated documentation

   s/da_monitor_synthesis.rst/monitor_synthesis.rst in comment in
   da_monitor.h

* tag 'trace-rv-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Fix documentation reference in da_monitor.h
  verification/rvgen: Remove unused variable declaration from containers
  verification/dot2c: Remove superfluous enum assignment and add last comma
  verification/dot2c: Remove __buff_to_string() and cleanup
  verification/rvgen: Annotate DA functions with types
  verification/rvgen: Adapt dot2k and templates after refactoring da_monitor.h
  Documentation/rv: Adapt documentation after da_monitor refactoring
  rv: Cleanup da_monitor after refactor
  rv: Refactor da_monitor to minimise macros
2026-02-12 14:08:49 -08:00
Linus Torvalds
136114e0ab mm.git review status for linus..mm-nonmm-stable
Total patches:       107
 Reviews/patch:       1.07
 Reviewed rate:       67%
 
 - The 2 patch series "ocfs2: give ocfs2 the ability to reclaim
   suballocator free bg" from Heming Zhao saves disk space by teaching
   ocfs2 to reclaim suballocator block group space.
 
 - The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one
   bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in
   various places.
 
 - The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than
   PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if
   VMCOREINFO_BYTES ever exceeds the page size.
 
 - The 7 patch series "kallsyms: Prevent invalid access when showing
   module buildid" from Petr Mladek cleans up kallsyms code related to
   module buildid and fixes an invalid access crash when printing
   backtraces.
 
 - The 3 patch series "Address page fault in
   ima_restore_measurement_list()" from Harshit Mogalapalli fixes a
   kexec-related crash that can occur when booting the second-stage kernel
   on x86.
 
 - The 6 patch series "kho: ABI headers and Documentation updates" from
   Mike Rapoport updates the kexec handover ABI documentation.
 
 - The 4 patch series "Align atomic storage" from Finn Thain adds the
   __aligned attribute to atomic_t and atomic64_t definitions to get
   natural alignment of both types on csky, m68k, microblaze, nios2,
   openrisc and sh.
 
 - The 2 patch series "kho: clean up page initialization logic" from
   Pratyush Yadav simplifies the page initialization logic in
   kho_restore_page().
 
 - The 6 patch series "Unload linux/kernel.h" from Yury Norov moves
   several things out of kernel.h and into more appropriate places.
 
 - The 7 patch series "don't abuse task_struct.group_leader" from Oleg
   Nesterov removes the usage of ->group_leader when it is "obviously
   unnecessary".
 
 - The 5 patch series "list private v2 & luo flb" from Pasha Tatashin
   adds some infrastructure improvements to the live update orchestrator.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA
 jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6
 0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww=
 =mmsH
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
   disk space by teaching ocfs2 to reclaim suballocator block group
   space (Heming Zhao)

 - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
   ARRAY_END() macro and uses it in various places (Alejandro Colomar)

 - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
   the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
   page size (Pnina Feder)

 - "kallsyms: Prevent invalid access when showing module buildid" cleans
   up kallsyms code related to module buildid and fixes an invalid
   access crash when printing backtraces (Petr Mladek)

 - "Address page fault in ima_restore_measurement_list()" fixes a
   kexec-related crash that can occur when booting the second-stage
   kernel on x86 (Harshit Mogalapalli)

 - "kho: ABI headers and Documentation updates" updates the kexec
   handover ABI documentation (Mike Rapoport)

 - "Align atomic storage" adds the __aligned attribute to atomic_t and
   atomic64_t definitions to get natural alignment of both types on
   csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)

 - "kho: clean up page initialization logic" simplifies the page
   initialization logic in kho_restore_page() (Pratyush Yadav)

 - "Unload linux/kernel.h" moves several things out of kernel.h and into
   more appropriate places (Yury Norov)

 - "don't abuse task_struct.group_leader" removes the usage of
   ->group_leader when it is "obviously unnecessary" (Oleg Nesterov)

 - "list private v2 & luo flb" adds some infrastructure improvements to
   the live update orchestrator (Pasha Tatashin)

* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
  watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
  procfs: fix missing RCU protection when reading real_parent in do_task_stat()
  watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
  kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
  kho: fix doc for kho_restore_pages()
  tests/liveupdate: add in-kernel liveupdate test
  liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
  liveupdate: luo_file: Use private list
  list: add kunit test for private list primitives
  list: add primitives for private list manipulations
  delayacct: fix uapi timespec64 definition
  panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
  netclassid: use thread_group_leader(p) in update_classid_task()
  RDMA/umem: don't abuse current->group_leader
  drm/pan*: don't abuse current->group_leader
  drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
  drm/amdgpu: don't abuse current->group_leader
  android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
  android/binder: don't abuse current->group_leader
  kho: skip memoryless NUMA nodes when reserving scratch areas
  ...
2026-02-12 12:13:01 -08:00
Haoyang LIU
fa4820b893 tracing: Fix indentation of return statement in print_trace_fmt()
The return statement inside the nested if block in print_trace_fmt()
is not properly indented, making the code structure unclear. This was
flagged by smatch as a warning.

Add proper indentation to the return statement to match the kernel
coding style and improve readability.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260210153903.8041-1-tttturtleruss@gmail.com
Signed-off-by: Haoyang LIU <tttturtleruss@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 21:58:21 -05:00
Masami Hiramatsu (Google)
804c4a2209 tracing: Reset last_boot_info if ring buffer is reset
Commit 32dc004252 ("tracing: Reset last-boot buffers when reading
out all cpu buffers") resets the last_boot_info when user read out
all data via trace_pipe* files. But it is not reset when user
resets the buffer from other files. (e.g. write `trace` file)

Reset it when the corresponding ring buffer is reset too.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071302364.2293046.17895165659153977720.stgit@mhiramat.tok.corp.google.com
Fixes: 32dc004252 ("tracing: Reset last-boot buffers when reading out all cpu buffers")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 10:49:48 -05:00
Masami Hiramatsu (Google)
f844282dee tracing: Fix to set write permission to per-cpu buffer_size_kb
Since the per-cpu buffer_size_kb file is writable for changing
per-cpu ring buffer size, the file should have the write access
permission.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com
Fixes: 21ccc9cd72 ("tracing: Disable "other" permission bits in the tracefs files")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 10:48:44 -05:00
Linus Torvalds
f17b474e36 bpf-next-7.0
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmGmrgACgkQ6rmadz2v
 bTq6NxAAkCHosxzGn9GYYBV8xhrBJoJJDCyEbQ4nR0XNY+zaWnuykmiPP9w1aOAM
 zm/po3mQB2pZjetvlrPrgG5RLgBCAUHzqVGy0r+phUvD3vbohKlmSlMm2kiXOb9N
 T01BgLWsyqN2ZcNFvORdSsftqIJUHcXxU6RdupGD60sO5XM9ty5cwyewLX8GBOas
 UN2bOhbK2DpqYWUvtv+3Q3ykxoStMSkXZvDRurwLKl4RHeLjXZXPo8NjnfBlk/F2
 vdFo/F4NO4TmhOave6UPXvKb4yo9IlBRmiPAl0RmNKBxenY8j9XuV/xZxU6YgzDn
 +SQfDK+CKQ4IYIygE+fqd4e5CaQrnjmPPcIw12AB2CF0LimY9Xxyyk6FSAhMN7wm
 GTVh5K2C3Dk3OiRQk4G58EvQ5QcxzX98IeeCpcckMUkPsFWHRvF402WMUcv9SWpD
 DsxxPkfENY/6N67EvH0qcSe/ikdUorQKFl4QjXKwsMCd5WhToeP4Z7Ck1gVSNkAh
 9CX++mLzg333Lpsc4SSIuk9bEPpFa5cUIKUY7GCsCiuOXciPeMDP3cGSd5LioqxN
 qWljs4Z88QDM2LJpAh8g4m3sA7bMhES3nPmdlI5CfgBcVyLW8D8CqQq4GEZ1McwL
 Ky084+lEosugoVjRejrdMMEOsqAfcbkTr2b8jpuAZdwJKm6p/bw=
 =cBdK
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Support associating BPF program with struct_ops (Amery Hung)

 - Switch BPF local storage to rqspinlock and remove recursion detection
   counters which were causing false positives (Amery Hung)

 - Fix live registers marking for indirect jumps (Anton Protopopov)

 - Introduce execution context detection BPF helpers (Changwoo Min)

 - Improve verifier precision for 32bit sign extension pattern
   (Cupertino Miranda)

 - Optimize BTF type lookup by sorting vmlinux BTF and doing binary
   search (Donglin Peng)

 - Allow states pruning for misc/invalid slots in iterator loops (Eduard
   Zingerman)

 - In preparation for ASAN support in BPF arenas teach libbpf to move
   global BPF variables to the end of the region and enable arena kfuncs
   while holding locks (Emil Tsalapatis)

 - Introduce support for implicit arguments in kfuncs and migrate a
   number of them to new API. This is a prerequisite for cgroup
   sub-schedulers in sched-ext (Ihor Solodrai)

 - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)

 - Fix ORC stack unwind from kprobe_multi (Jiri Olsa)

 - Speed up fentry attach by using single ftrace direct ops in BPF
   trampolines (Jiri Olsa)

 - Require frozen map for calculating map hash (KP Singh)

 - Fix lock entry creation in TAS fallback in rqspinlock (Kumar
   Kartikeya Dwivedi)

 - Allow user space to select cpu in lookup/update operations on per-cpu
   array and hash maps (Leon Hwang)

 - Make kfuncs return trusted pointers by default (Matt Bobrowski)

 - Introduce "fsession" support where single BPF program is executed
   upon entry and exit from traced kernel function (Menglong Dong)

 - Allow bpf_timer and bpf_wq use in all programs types (Mykyta
   Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
   Starovoitov)

 - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
   definition across the tree (Puranjay Mohan)

 - Allow BPF arena calls from non-sleepable context (Puranjay Mohan)

 - Improve register id comparison logic in the verifier and extend
   linked registers with negative offsets (Puranjay Mohan)

 - In preparation for BPF-OOM introduce kfuncs to access memcg events
   (Roman Gushchin)

 - Use CFI compatible destructor kfunc type (Sami Tolvanen)

 - Add bitwise tracking for BPF_END in the verifier (Tianci Cao)

 - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
   Tang)

 - Make BPF selftests work with 64k page size (Yonghong Song)

* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
  selftests/bpf: Fix outdated test on storage->smap
  selftests/bpf: Choose another percpu variable in bpf for btf_dump test
  selftests/bpf: Remove test_task_storage_map_stress_lookup
  selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
  selftests/bpf: Update task_local_storage/recursion test
  selftests/bpf: Update sk_storage_omem_uncharge test
  bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
  bpf: Support lockless unlink when freeing map or local storage
  bpf: Prepare for bpf_selem_unlink_nofail()
  bpf: Remove unused percpu counter from bpf_local_storage_map_free
  bpf: Remove cgroup local storage percpu counter
  bpf: Remove task local storage percpu counter
  bpf: Change local_storage->lock and b->lock to rqspinlock
  bpf: Convert bpf_selem_unlink to failable
  bpf: Convert bpf_selem_link_map to failable
  bpf: Convert bpf_selem_unlink_map to failable
  bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
  selftests/xsk: fix number of Tx frags in invalid packet
  selftests/xsk: properly handle batch ending in the middle of a packet
  bpf: Prevent reentrance into call_rcu_tasks_trace()
  ...
2026-02-10 11:26:21 -08:00
Colin Lord
f743435f98 tracing: Fix false sharing in hwlat get_sample()
The get_sample() function in the hwlat tracer assumes the caller holds
hwlat_data.lock, but this is not actually happening. The result is
unprotected data access to hwlat_data, and in per-cpu mode can result in
false sharing which may show up as false positive latency events.

The specific case of false sharing observed was primarily between
hwlat_data.sample_width and hwlat_data.count. These are separated by
just 8B and are therefore likely to share a cache line. When one thread
modifies count, the cache line is in a modified state so when other
threads read sample_width in the main latency detection loop, they fetch
the modified cache line. On some systems, the fetch itself may be slow
enough to count as a latency event, which could set up a self
reinforcing cycle of latency events as each event increments count which
then causes more latency events, continuing the cycle.

The other result of the unprotected data access is that hwlat_data.count
can end up with duplicate or missed values, which was observed on some
systems in testing.

Convert hwlat_data.count to atomic64_t so it can be safely modified
without locking, and prevent false sharing by pulling sample_width into
a local variable.

One system this was tested on was a dual socket server with 32 CPUs on
each numa node. With settings of 1us threshold, 1000us width, and
2000us window, this change reduced the number of latency events from
500 per second down to approximately 1 event per minute. Some machines
tested did not exhibit measurable latency from the false sharing.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260210074810.6328-1-clord@mykolab.com
Signed-off-by: Colin Lord <clord@mykolab.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-10 03:36:39 -05:00
Linus Torvalds
0c00ed308d for-7.0/block-20260206
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGLwcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv+TD/48S2HTnMhmW6AtFYWErQ+sEKXpHrxbYe7S
 +qR8/g/T+QSfhfqPwZEuagndFKtIP3LJfaXGSP1Lk1RfP9NLQy91v33Ibe4DjHkp
 etWSfnMHA9MUAoWKmg8EvncB2G+ZQFiYCpjazj5tKHD9S2+psGMuL8kq6qzMJE83
 uhpb8WutUl4aSIXbMSfyGlwBhI1MjjRbbWlIBmg4yC8BWt1sH8Qn2L2GNVylEIcX
 U8At3KLgPGn0axSg4yGMAwTqtGhL/jwdDyeczbmRlXuAr4iVL9UX/yADCYkazt6U
 ttQ2/H+cxCwfES84COx9EteAatlbZxo6wjGvZ3xOMiMJVTjYe1x6Gkcckq+LrZX6
 tjofi2KK78qkrMXk1mZMkZjpyUWgRtCswhDllbQyqFs0SwzQtno2//Rk8HU9dhbt
 pkpryDbGFki9X3upcNyEYp5TYflpW6YhAzShYgmE6KXim2fV8SeFLviy0erKOAl+
 fwjTE6KQ5QoQv0s3WxkWa4lREm34O6IHrCUmbiPm5CruJnQDhqAN2QZIDgYC4WAf
 0gu9cR/O4Vxu7TQXrumPs5q+gCyDU0u0B8C3mG2s+rIo+PI5cVZKs2OIZ8HiPo0F
 x73kR/pX3DMe35ZQkQX22ymMuowV+aQouDLY9DTwakP5acdcg7h7GZKABk6VLB06
 gUIsnxURiQ==
 =jNzW
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - Support for batch request processing for ublk, improving the
   efficiency of the kernel/ublk server communication. This can yield
   nice 7-12% performance improvements

 - Support for integrity data for ublk

 - Various other ublk improvements and additions, including a ton of
   selftests additions and updated

 - Move the handling of blk-crypto software fallback from below the
   block layer to above it. This reduces the complexity of dealing with
   bio splitting

 - Series fixing a number of potential deadlocks in blk-mq related to
   the queue usage counter and writeback throttling and rq-qos debugfs
   handling

 - Add an async_depth queue attribute, to resolve a performance
   regression that's been around for a qhilw related to the scheduler
   depth handling

 - Only use task_work for IOPOLL completions on NVMe, if it is necessary
   to do so. An earlier fix for an issue resulted in all these
   completions being punted to task_work, to guarantee that completions
   were only run for a given io_uring ring when it was local to that
   ring. With the new changes, we can detect if it's necessary to use
   task_work or not, and avoid it if possible.

 - rnbd fixes:
      - Fix refcount underflow in device unmap path
      - Handle PREFLUSH and NOUNMAP flags properly in protocol
      - Fix server-side bi_size for special IOs
      - Zero response buffer before use
      - Fix trace format for flags
      - Add .release to rnbd_dev_ktype

 - MD pull requests via Yu Kuai
      - Fix raid5_run() to return error when log_init() fails
      - Fix IO hang with degraded array with llbitmap
      - Fix percpu_ref not resurrected on suspend timeout in llbitmap
      - Fix GPF in write_page caused by resize race
      - Fix NULL pointer dereference in process_metadata_update
      - Fix hang when stopping arrays with metadata through dm-raid
      - Fix any_working flag handling in raid10_sync_request
      - Refactor sync/recovery code path, improve error handling for
        badblocks, and remove unused recovery_disabled field
      - Consolidate mddev boolean fields into mddev_flags
      - Use mempool to allocate stripe_request_ctx and make sure
        max_sectors is not less than io_opt in raid5
      - Fix return value of mddev_trylock
      - Fix memory leak in raid1_run()
      - Add Li Nan as mdraid reviewer

 - Move phys_vec definitions to the kernel types, mostly in preparation
   for some VFIO and RDMA changes

 - Improve the speed for secure erase for some devices

 - Various little rust updates

 - Various other minor fixes, improvements, and cleanups

* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  blk-mq: ABI/sysfs-block: fix docs build warnings
  selftests: ublk: organize test directories by test ID
  block: decouple secure erase size limit from discard size limit
  block: remove redundant kill_bdev() call in set_blocksize()
  blk-mq: add documentation for new queue attribute async_dpeth
  block, bfq: convert to use request_queue->async_depth
  mq-deadline: covert to use request_queue->async_depth
  kyber: covert to use request_queue->async_depth
  blk-mq: add a new queue sysfs attribute async_depth
  blk-mq: factor out a helper blk_mq_limit_depth()
  blk-mq-sched: unify elevators checking for async requests
  block: convert nr_requests to unsigned int
  block: don't use strcpy to copy blockdev name
  blk-mq-debugfs: warn about possible deadlock
  blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
  blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
  blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
  blk-rq-qos: fix possible debugfs_mutex deadlock
  blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
  blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
  ...
2026-02-09 17:57:21 -08:00
Steven Rostedt
b4bade506b tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection
The tracing_max_latency shouldn't be limited if CONFIG_FSNOTIFY is defined
or not and it was moved out of that protection to be always available with
CONFIG_TRACER_MAX_TRACE. All was moved out except the dentry descriptor
for it (d_max_latency) and it failed to build on some configs.

Move that out of the CONFIG_FSNOTIFY protection too.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260209194631.788bfc85@fedora
Fixes: ba73713da5 ("tracing: Clean up use of trace_create_maxlat_file()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602092133.fTdojd95-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-09 19:52:37 -05:00
Steven Rostedt
c4f1fe47b1 tracing: Better separate SNAPSHOT and MAX_TRACE options
The latency tracers (scheduler, irqsoff, etc) were created when tracing
was first added. These tracers required a "snapshot" buffer that was the
same size as the ring buffer being written to. When a new max latency was
hit, the main ring buffer would swap with the snapshot buffer so that the
trace leading up to the latency would be saved in the snapshot buffer (The
snapshot buffer is never written to directly and the data within it can be
viewed without fear of being overwritten).

Later, a new feature was added to allow snapshots to be taken by user
space or even event triggers. This created a "snapshot" file that allowed
users to trigger a snapshot from user space to save the current trace.

The config for this new feature (CONFIG_TRACER_SNAPSHOT) would select the
latency tracer config (CONFIG_TRACER_MAX_LATENCY) as it would need all the
functionality from it as it already existed. But this was incorrect. As
the snapshot feature is really what the latency tracers need and not the
other way around.

Have CONFIG_TRACER_MAX_TRACE select CONFIG_TRACER_SNAPSHOT where the
tracers that needs the max latency buffer selects the TRACE_MAX_TRACE
which will then select TRACER_SNAPSHOT.

Also, go through trace.c and trace.h and make the code that only needs the
TRACER_MAX_TRACE protected by that and the code that always requires the
snapshot to be protected by TRACER_SNAPSHOT.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.767870992@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
e4c1a09afb tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
Instead of having #ifdef CONFIG_TRACER_MAX_TRACE around every access to
the struct tracer's use_max_tr field, add a helper function for that
access and if CONFIG_TRACER_MAX_TRACE is not configured it just returns
false.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.599390238@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
694b3f6fe0 tracing: Rename trace_array field max_buffer to snapshot_buffer
When tracing was first added, there were latency tracers that would take a
snapshot of the current trace when a new max latency was hit. This
snapshot buffer was called "max_buffer". Since then, a snapshot feature
was added that allowed user space or event triggers to trigger a snapshot
of the current buffer using the same max_buffer of the trace_array.

As this snapshot buffer now has a more generic use case, calling it
"max_buffer" is confusing. Rename it to snapshot_buffer.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.428446729@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
98021e37d6 tracing: Move pid filtering into trace_pid.c
The trace.c file was a dumping ground for most tracing code. Start
organizing it better by moving various functions out into their own files.
Move the PID filtering functions from trace.c into its own trace_pid.c
file.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.998330662@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
27931ee8f4 tracing: Move trace_printk functions out of trace.c and into trace_printk.c
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Move the functions associated to the trace_printk operations out of trace.c and
into trace_printk.c.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.828744197@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
af1eea12ad tracing: Use system_state in trace_printk_init_buffers()
The function trace_printk_init_buffers() is used to expand tha
trace_printk buffers when trace_printk() is used within the kernel or in
modules. On kernel boot up, it holds off from starting the sched switch
cmdline recorder, but will start it immediately when it is added by a
module.

Currently it uses a trick to see if the global_trace buffer has been
allocated or not to know if it was called by module load or not. But this
is more of a hack, and can not be used when this code is moved out of
trace.c. Instead simply look at the system_state and if it is running then
it is know that it could only be called by module load.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.660237094@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
f377912b3d tracing: Have trace_printk functions use flags instead of using global_trace
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.

Instead of testing the trace_array tr pointer to &global_trace, test the
tr->flags to see if TRACE_ARRAY_FL_GLOBAL set.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.491116245@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
93c88d06ac tracing: Make tracing_update_buffers() take NULL for global_trace
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.

Have tracing_update_buffers() take NULL for its trace_array to denote it
should work on the global_trace top level trace_array allows that function
to be used outside of trace.c and still update the global_trace
trace_array.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.318864210@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
1c53d781d4 tracing: Make printk_trace global for tracing system
The printk_trace is used to determine which trace_array trace_printk()
writes to. By making it a global variable among the tracing subsystem it
will allow the trace_printk functions to be moved out of trace.c and still
have direct access to that variable.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.144525891@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
3e6c8f80e5 tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Make ftrace_trace_stack() into a static inline that tests if stack tracing
is enabled and if so to call __ftrace_trace_stack() to do the stack trace.
This keeps the test inlined in the fast paths and only does the function
call if stack tracing is enabled.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.974218132@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
0e730bc067 tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Move the __always_inline functions __trace_buffer_lock_reserve(),
__trace_buffer_unlock_commit() and trace_event_setup() into trace.h.

The trace.c file will be split up and these functions will be used in more
than one of these files. As they are already __always_inline they can
easily be moved into the trace.h header file.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.813550600@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
a4f77ffc8e tracing: Make tracing_selftest_running global to the tracing subsystem
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Make the variable tracing_selftest_running global so that it can be used
by other files in the tracing subsystem and trace.c can be split up.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.648932796@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
64dee86ad7 tracing: Make tracing_disabled global for tracing system
The tracing_disabled variable is set to one on boot up to prevent some
parts of tracing to access the tracing infrastructure before it is set up.
It also can be set after boot if an anomaly is discovered.

It is currently a static variable in trace.c and can be accessed via a
function call trace_is_disabled(). There's really no reason to use a
function call as the tracing subsystem should be able to access it
directly.

By making the variable accessed directly, code can be moved out of trace.c
without adding overhead of a function call to see if tracing is disabled
or not.

Make tracing_disabled global and remove the tracing_is_disabled() helper
function. Also add some "unlikely()"s around tracing_disabled where it's
checked in hot paths.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.483690153@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
ba73713da5 tracing: Clean up use of trace_create_maxlat_file()
In trace.c, the function trace_create_maxlat_file() is defined behind the
 #ifdef CONFIG_TRACER_MAX_TRACE block. The #else part defines it as:

 #define trace_create_maxlat_file(tr, d_tracer)				\
	trace_create_file("tracing_max_latency", TRACE_MODE_WRITE,	\
			  d_tracer, tr, &tracing_max_lat_fops)

But the one place that it it used has:

 #ifdef CONFIG_TRACER_MAX_TRACE
	trace_create_maxlat_file(tr, d_tracer);
 #endif

Which is pointless and also wrong!

It only gets created when both CONFIG_TRACE_MAX_TRACE and CONFIG_FS_NOTIFY
is defined, but the file itself should not be dependent on
CONFIG_FS_NOTIFY. Always create that file when TRACE_MAX_TRACE is defined
regardless if FS_NOTIFY is or is not.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260207191101.0e014abd@robin
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
326669faf3 tracing: Move tracing_set_filter_buffering() into trace_events_hist.c
The function tracing_set_filter_buffering() is only used in
trace_events_hist.c. Move it to that file and make it static.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260206195936.617080218@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
c8b039c3e3 tracing: Have all triggers expect a file parameter
When the triggers were first created, they may not have had a file
parameter passed to them and things needed to be done generically.

But today, all triggers have a file parameter passed to them. Remove the
generic code and add a "if (WARN_ON_ONCE(!file))" to each trigger.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260206101351.609d8906@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:00:57 -05:00
Yaxiong Tian
2cdfe39dc9 tracing/kprobes: Skip setup_boot_kprobe_events() when no cmdline event
When the 'kprobe_event=' kernel command-line parameter is not provided,
there is no need to execute setup_boot_kprobe_events().

This change optimizes the initialization function init_kprobe_trace()
by skipping unnecessary work and effectively prevents potential blocking
that could arise from contention on the event_mutex lock in subsequent
operations.

Link: https://patch.msgid.link/20260204015401.163748-1-tianyaxiong@kylinos.cn
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-06 15:27:00 -05:00
Yaxiong Tian
0c2580a809 blktrace: Make init_blk_tracer() asynchronous
The init_blk_tracer() function causes significant boot delay as it
waits for the trace_event_sem lock held by trace_event_update_all().
Specifically, its child function register_trace_event() requires
this lock, which is occupied for an extended period during boot.

To resolve this, the execution of primary init_blk_tracer() is moved
to the trace_init_wq workqueue, allowing it to run asynchronously,
and prevent blocking the main boot thread.

Link: https://patch.msgid.link/20260204015353.163331-1-tianyaxiong@kylinos.cn
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-06 15:27:00 -05:00
Yaxiong Tian
1c48f7ab72 tracing: Rename eval_map_wq and allow other parts of tracing use it
The eval_map_work_func() function, though queued in eval_map_wq,
holds the trace_event_sem read-write lock for a long time during
kernel boot. This causes blocking issues for other functions.

Rename eval_map_wq to trace_init_wq and make it global, thereby
allowing other parts of tracing to schedule work on this queue
asynchronously and avoiding blockage of the main boot thread.

Link: https://patch.msgid.link/20260204015344.162818-1-tianyaxiong@kylinos.cn
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-06 15:26:59 -05:00
Steven Rostedt
033c55fe2e tracing: Fix ftrace event field alignments
The fields of ftrace specific events (events used to save ftrace internal
events like function traces and trace_printk) are generated similarly to
how normal trace event fields are generated. That is, the fields are added
to a trace_events_fields array that saves the name, offset, size,
alignment and signness of the field. It is used to produce the output in
the format file in tracefs so that tooling knows how to parse the binary
data of the trace events.

The issue is that some of the ftrace event structures are packed. The
function graph exit event structures are one of them. The 64 bit calltime
and rettime fields end up 4 byte aligned, but the algorithm to show to
userspace shows them as 8 byte aligned.

The macros that create the ftrace events has one for embedded structure
fields. There's two macros for theses fields:

  __field_desc() and __field_packed()

The difference of the latter macro is that it treats the field as packed.

Rename that field to __field_desc_packed() and create replace the
__field_packed() to be a normal field that is packed and have the calltime
and rettime use those.

This showed up on 32bit architectures for function graph time fields. It
had:

 ~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
        field:unsigned long func;       offset:8;       size:4; signed:0;
        field:unsigned int depth;       offset:12;      size:4; signed:0;
        field:unsigned int overrun;     offset:16;      size:4; signed:0;
        field:unsigned long long calltime;      offset:24;      size:8; signed:0;
        field:unsigned long long rettime;       offset:32;      size:8; signed:0;

Notice that overrun is at offset 16 with size 4, where in the structure
calltime is at offset 20 (16 + 4), but it shows the offset at 24. That's
because it used the alignment of unsigned long long when used as a
declaration and not as a member of a structure where it would be aligned
by word size (in this case 4).

By using the proper structure alignment, the format has it at the correct
offset:

 ~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
        field:unsigned long func;       offset:8;       size:4; signed:0;
        field:unsigned int depth;       offset:12;      size:4; signed:0;
        field:unsigned int overrun;     offset:16;      size:4; signed:0;
        field:unsigned long long calltime;      offset:20;      size:8; signed:0;
        field:unsigned long long rettime;       offset:28;      size:8; signed:0;

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: "jempty.liang" <imntjempty@163.com>
Link: https://patch.msgid.link/20260204113628.53faec78@gandalf.local.home
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Closes: https://lore.kernel.org/all/20260130015740.212343-1-imntjempty@163.com/
Closes: https://lore.kernel.org/all/20260202123342.2544795-1-imntjempty@163.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-05 09:47:11 -05:00