tracing fixes for v7.0:

- Fix thresh_return of function graph tracer
 
   The update to store data on the shadow stack removed the abuse of
   using the task recursion word as a way to keep track of what functions
   to ignore. The trace_graph_return() was updated to handle this, but
   when function_graph tracer is using a threshold (only trace functions
   that took longer than a specified time), it uses
   trace_graph_thresh_return() instead. This function was still incorrectly
   using the task struct recursion word causing the function graph tracer to
   permanently set all functions to "notrace"
 
 - Fix thresh_return nosleep accounting
 
   When the calltime was moved to the shadow stack storage instead of being
   on the fgraph descriptor, the calculations for the amount of sleep time
   was updated. The calculation was done in the trace_graph_thresh_return()
   function, which also called the trace_graph_return(), which did the
   calculation again, causing the time to be doubled.
 
   Remove the call to trace_graph_return() as what it needed to do wasn't
   that much, and just do the work in trace_graph_thresh_return().
 
 - Fix syscall trace event activation on boot up
 
   The syscall trace events are pseudo events attached to the raw_syscall
   tracepoints. When the first syscall event is enabled, it enables the
   raw_syscall tracepoint and doesn't need to do anything when a second
   syscall event is also enabled.
 
   When events are enabled via the kernel command line, syscall events
   are partially enabled as the enabling is called before rcu_init.
   This is due to allow early events to be enabled immediately. Because
   kernel command line events do not distinguish between different
   types of events, the syscall events are enabled here but are not fully
   functioning. After rcu_init, they are disabled and re-enabled so that
   they can be fully enabled. The problem happened is that this
   "disable-enable" is done one at a time. If more than one syscall event
   is specified on the command line, by disabling them one at a time,
   the counter never gets to zero, and the raw_syscall is not disabled and
   enabled, keeping the syscall events in their non-fully functional state.
 
   Instead, disable all events and re-enabled them all, as that will ensure
   the raw_syscall event is also disabled and re-enabled.
 
 - Disable preemption in ftrace pid filtering
 
   The ftrace pid filtering attaches to the fork and exit tracepoints to
   add or remove pids that should be traced. They access variables protected
   by RCU (preemption disabled). Now that tracepoint callbacks are called with
   preemption enabled, this protection needs to be added explicitly, and
   not depend on the functions being called with preemption disabled.
 
 - Disable preemption in event pid filtering
 
   The event pid filtering needs the same preemption disabling guards as
   ftrace pid filtering.
 
 - Fix accounting of the memory mapped ring buffer on fork
 
   Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But
   this does not prevent the application from calling madvise(MADVISE_DOFORK).
   This causes the mapping to be copied on fork. After the first tasks exits,
   the mapping is considered unmapped by everyone. But when he second task
   exits, the counter goes below zero and triggers a WARN_ON.
 
   Since nothing prevents two separate tasks from mmapping the ftrace ring
   buffer (although two mappings may mess each other up), there's no reason
   to stop the memory from being copied on fork.
 
   Update the vm_operations to have an ".open" handler to update the
   accounting and let the ring buffer know someone else has it mapped.
 
 - Add all ftrace headers in MAINTAINERS file
 
   The MAINTAINERS file only specifies include/linux/ftrace.h But misses
   ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get all
   *ftrace* files.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaamiIBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qulnAP9ZO6iChQL0hX/Xuu2VyRhVz0Svf8Sg
 iq2IUHP48twOogEApR4zeelMORxdKqkLR+BajZUVFR1PukVbMaszPr9GoQw=
 =H9pj
 -----END PGP SIGNATURE-----

Merge tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix thresh_return of function graph tracer

   The update to store data on the shadow stack removed the abuse of
   using the task recursion word as a way to keep track of what
   functions to ignore. The trace_graph_return() was updated to handle
   this, but when function_graph tracer is using a threshold (only trace
   functions that took longer than a specified time), it uses
   trace_graph_thresh_return() instead.

   This function was still incorrectly using the task struct recursion
   word causing the function graph tracer to permanently set all
   functions to "notrace"

 - Fix thresh_return nosleep accounting

   When the calltime was moved to the shadow stack storage instead of
   being on the fgraph descriptor, the calculations for the amount of
   sleep time was updated. The calculation was done in the
   trace_graph_thresh_return() function, which also called the
   trace_graph_return(), which did the calculation again, causing the
   time to be doubled.

   Remove the call to trace_graph_return() as what it needed to do
   wasn't that much, and just do the work in
   trace_graph_thresh_return().

 - Fix syscall trace event activation on boot up

   The syscall trace events are pseudo events attached to the
   raw_syscall tracepoints. When the first syscall event is enabled, it
   enables the raw_syscall tracepoint and doesn't need to do anything
   when a second syscall event is also enabled.

   When events are enabled via the kernel command line, syscall events
   are partially enabled as the enabling is called before rcu_init. This
   is due to allow early events to be enabled immediately. Because
   kernel command line events do not distinguish between different types
   of events, the syscall events are enabled here but are not fully
   functioning. After rcu_init, they are disabled and re-enabled so that
   they can be fully enabled.

   The problem happened is that this "disable-enable" is done one at a
   time. If more than one syscall event is specified on the command
   line, by disabling them one at a time, the counter never gets to
   zero, and the raw_syscall is not disabled and enabled, keeping the
   syscall events in their non-fully functional state.

   Instead, disable all events and re-enabled them all, as that will
   ensure the raw_syscall event is also disabled and re-enabled.

 - Disable preemption in ftrace pid filtering

   The ftrace pid filtering attaches to the fork and exit tracepoints to
   add or remove pids that should be traced. They access variables
   protected by RCU (preemption disabled). Now that tracepoint callbacks
   are called with preemption enabled, this protection needs to be added
   explicitly, and not depend on the functions being called with
   preemption disabled.

 - Disable preemption in event pid filtering

   The event pid filtering needs the same preemption disabling guards as
   ftrace pid filtering.

 - Fix accounting of the memory mapped ring buffer on fork

   Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY.
   But this does not prevent the application from calling
   madvise(MADVISE_DOFORK). This causes the mapping to be copied on
   fork. After the first tasks exits, the mapping is considered unmapped
   by everyone. But when he second task exits, the counter goes below
   zero and triggers a WARN_ON.

   Since nothing prevents two separate tasks from mmapping the ftrace
   ring buffer (although two mappings may mess each other up), there's
   no reason to stop the memory from being copied on fork.

   Update the vm_operations to have an ".open" handler to update the
   accounting and let the ring buffer know someone else has it mapped.

 - Add all ftrace headers in MAINTAINERS file

   The MAINTAINERS file only specifies include/linux/ftrace.h But misses
   ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get
   all *ftrace* files.

* tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Add MAINTAINERS entries for all ftrace headers
  tracing: Fix WARN_ON in tracing_buffers_mmap_close
  tracing: Disable preemption in the tracepoint callbacks handling filtered pids
  ftrace: Disable preemption in the tracepoint callbacks handling filtered pids
  tracing: Fix syscall events activation by ensuring refcount hits zero
  fgraph: Fix thresh_return nosleeptime double-adjust
  fgraph: Fix thresh_return clear per-task notrace
This commit is contained in:
Linus Torvalds 2026-03-05 08:05:05 -08:00
commit 18ecff396c
7 changed files with 90 additions and 22 deletions

View file

@ -10484,7 +10484,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
F: Documentation/trace/ftrace*
F: arch/*/*/*/*ftrace*
F: arch/*/*/*ftrace*
F: include/*/ftrace.h
F: include/*/*ftrace*
F: kernel/trace/fgraph.c
F: kernel/trace/ftrace*
F: samples/ftrace

View file

@ -248,6 +248,7 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
int ring_buffer_map(struct trace_buffer *buffer, int cpu,
struct vm_area_struct *vma);
void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu);
int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
#endif /* _LINUX_RING_BUFFER_H */

View file

@ -8611,6 +8611,7 @@ ftrace_pid_follow_sched_process_fork(void *data,
struct trace_pid_list *pid_list;
struct trace_array *tr = data;
guard(preempt)();
pid_list = rcu_dereference_sched(tr->function_pids);
trace_filter_add_remove_task(pid_list, self, task);
@ -8624,6 +8625,7 @@ ftrace_pid_follow_sched_process_exit(void *data, struct task_struct *task)
struct trace_pid_list *pid_list;
struct trace_array *tr = data;
guard(preempt)();
pid_list = rcu_dereference_sched(tr->function_pids);
trace_filter_add_remove_task(pid_list, NULL, task);

View file

@ -7310,6 +7310,27 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
return err;
}
/*
* This is called when a VMA is duplicated (e.g., on fork()) to increment
* the user_mapped counter without remapping pages.
*/
void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;
if (WARN_ON(!cpumask_test_cpu(cpu, buffer->cpumask)))
return;
cpu_buffer = buffer->buffers[cpu];
guard(mutex)(&cpu_buffer->mapping_lock);
if (cpu_buffer->user_mapped)
__rb_inc_dec_mapped(cpu_buffer, true);
else
WARN(1, "Unexpected buffer stat, it should be mapped");
}
int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;

View file

@ -8213,6 +8213,18 @@ static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
static inline void put_snapshot_map(struct trace_array *tr) { }
#endif
/*
* This is called when a VMA is duplicated (e.g., on fork()) to increment
* the user_mapped counter without remapping pages.
*/
static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
{
struct ftrace_buffer_info *info = vma->vm_file->private_data;
struct trace_iterator *iter = &info->iter;
ring_buffer_map_dup(iter->array_buffer->buffer, iter->cpu_file);
}
static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
{
struct ftrace_buffer_info *info = vma->vm_file->private_data;
@ -8232,6 +8244,7 @@ static int tracing_buffers_may_split(struct vm_area_struct *vma, unsigned long a
}
static const struct vm_operations_struct tracing_buffers_vmops = {
.open = tracing_buffers_mmap_open,
.close = tracing_buffers_mmap_close,
.may_split = tracing_buffers_may_split,
};

View file

@ -1039,6 +1039,7 @@ event_filter_pid_sched_process_exit(void *data, struct task_struct *task)
struct trace_pid_list *pid_list;
struct trace_array *tr = data;
guard(preempt)();
pid_list = rcu_dereference_raw(tr->filtered_pids);
trace_filter_add_remove_task(pid_list, NULL, task);
@ -1054,6 +1055,7 @@ event_filter_pid_sched_process_fork(void *data,
struct trace_pid_list *pid_list;
struct trace_array *tr = data;
guard(preempt)();
pid_list = rcu_dereference_sched(tr->filtered_pids);
trace_filter_add_remove_task(pid_list, self, task);
@ -4668,26 +4670,22 @@ static __init int event_trace_memsetup(void)
return 0;
}
__init void
early_enable_events(struct trace_array *tr, char *buf, bool disable_first)
/*
* Helper function to enable or disable a comma-separated list of events
* from the bootup buffer.
*/
static __init void __early_set_events(struct trace_array *tr, char *buf, bool enable)
{
char *token;
int ret;
while (true) {
token = strsep(&buf, ",");
if (!token)
break;
while ((token = strsep(&buf, ","))) {
if (*token) {
/* Restarting syscalls requires that we stop them first */
if (disable_first)
if (enable) {
if (ftrace_set_clr_event(tr, token, 1))
pr_warn("Failed to enable trace event: %s\n", token);
} else {
ftrace_set_clr_event(tr, token, 0);
ret = ftrace_set_clr_event(tr, token, 1);
if (ret)
pr_warn("Failed to enable trace event: %s\n", token);
}
}
/* Put back the comma to allow this to be called again */
@ -4696,6 +4694,32 @@ early_enable_events(struct trace_array *tr, char *buf, bool disable_first)
}
}
/**
* early_enable_events - enable events from the bootup buffer
* @tr: The trace array to enable the events in
* @buf: The buffer containing the comma separated list of events
* @disable_first: If true, disable all events in @buf before enabling them
*
* This function enables events from the bootup buffer. If @disable_first
* is true, it will first disable all events in the buffer before enabling
* them.
*
* For syscall events, which rely on a global refcount to register the
* SYSCALL_WORK_SYSCALL_TRACEPOINT flag (especially for pid 1), we must
* ensure the refcount hits zero before re-enabling them. A simple
* "disable then enable" per-event is not enough if multiple syscalls are
* used, as the refcount will stay above zero. Thus, we need a two-phase
* approach: disable all, then enable all.
*/
__init void
early_enable_events(struct trace_array *tr, char *buf, bool disable_first)
{
if (disable_first)
__early_set_events(tr, buf, false);
__early_set_events(tr, buf, true);
}
static __init int event_trace_enable(void)
{
struct trace_array *tr = top_trace_array();

View file

@ -400,14 +400,19 @@ static void trace_graph_thresh_return(struct ftrace_graph_ret *trace,
struct fgraph_ops *gops,
struct ftrace_regs *fregs)
{
unsigned long *task_var = fgraph_get_task_var(gops);
struct fgraph_times *ftimes;
struct trace_array *tr;
unsigned int trace_ctx;
u64 calltime, rettime;
int size;
rettime = trace_clock_local();
ftrace_graph_addr_finish(gops, trace);
if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) {
trace_recursion_clear(TRACE_GRAPH_NOTRACE_BIT);
if (*task_var & TRACE_GRAPH_NOTRACE) {
*task_var &= ~TRACE_GRAPH_NOTRACE;
return;
}
@ -418,11 +423,13 @@ static void trace_graph_thresh_return(struct ftrace_graph_ret *trace,
tr = gops->private;
handle_nosleeptime(tr, trace, ftimes, size);
if (tracing_thresh &&
(trace_clock_local() - ftimes->calltime < tracing_thresh))
calltime = ftimes->calltime;
if (tracing_thresh && (rettime - calltime < tracing_thresh))
return;
else
trace_graph_return(trace, gops, fregs);
trace_ctx = tracing_gen_ctx();
__trace_graph_return(tr, trace, trace_ctx, calltime, rettime);
}
static struct fgraph_ops funcgraph_ops = {