Commit graph

1542 commits

Author SHA1 Message Date
Calvin Owens
d008ba8be8 tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G
Some of the sizing logic through tracer_alloc_buffers() uses int
internally, causing unexpected behavior if the user passes a value that
does not fit in an int (on my x86 machine, the result is uselessly tiny
buffers).

Fix by plumbing the parameter's real type (unsigned long) through to the
ring buffer allocation functions, which already use unsigned long.

It has always been possible to create larger ring buffers via the sysfs
interface: this only affects the cmdline parameter.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org
Fixes: 73c5162aa3 ("tracing: keep ring buffer to minimum size till used")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06 22:25:53 -05:00
Qing Wang
e39bb9e02b tracing: Fix WARN_ON in tracing_buffers_mmap_close
When a process forks, the child process copies the parent's VMAs but the
user_mapped reference count is not incremented. As a result, when both the
parent and child processes exit, tracing_buffers_mmap_close() is called
twice. On the second call, user_mapped is already 0, causing the function to
return -ENODEV and triggering a WARN_ON.

Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set.
But this is only a hint, and the application can call
madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the
application does that, it can trigger this issue on fork.

Fix it by incrementing the user_mapped reference count without re-mapping
the pages in the VMA's open callback.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com
Fixes: cf9f0f7c4c ("tracing: Allow user-space mapping of the ring-buffer")
Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d
Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03 22:25:32 -05:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Haoyang LIU
fa4820b893 tracing: Fix indentation of return statement in print_trace_fmt()
The return statement inside the nested if block in print_trace_fmt()
is not properly indented, making the code structure unclear. This was
flagged by smatch as a warning.

Add proper indentation to the return statement to match the kernel
coding style and improve readability.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260210153903.8041-1-tttturtleruss@gmail.com
Signed-off-by: Haoyang LIU <tttturtleruss@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 21:58:21 -05:00
Masami Hiramatsu (Google)
804c4a2209 tracing: Reset last_boot_info if ring buffer is reset
Commit 32dc004252 ("tracing: Reset last-boot buffers when reading
out all cpu buffers") resets the last_boot_info when user read out
all data via trace_pipe* files. But it is not reset when user
resets the buffer from other files. (e.g. write `trace` file)

Reset it when the corresponding ring buffer is reset too.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071302364.2293046.17895165659153977720.stgit@mhiramat.tok.corp.google.com
Fixes: 32dc004252 ("tracing: Reset last-boot buffers when reading out all cpu buffers")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 10:49:48 -05:00
Masami Hiramatsu (Google)
f844282dee tracing: Fix to set write permission to per-cpu buffer_size_kb
Since the per-cpu buffer_size_kb file is writable for changing
per-cpu ring buffer size, the file should have the write access
permission.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com
Fixes: 21ccc9cd72 ("tracing: Disable "other" permission bits in the tracefs files")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-11 10:48:44 -05:00
Steven Rostedt
c4f1fe47b1 tracing: Better separate SNAPSHOT and MAX_TRACE options
The latency tracers (scheduler, irqsoff, etc) were created when tracing
was first added. These tracers required a "snapshot" buffer that was the
same size as the ring buffer being written to. When a new max latency was
hit, the main ring buffer would swap with the snapshot buffer so that the
trace leading up to the latency would be saved in the snapshot buffer (The
snapshot buffer is never written to directly and the data within it can be
viewed without fear of being overwritten).

Later, a new feature was added to allow snapshots to be taken by user
space or even event triggers. This created a "snapshot" file that allowed
users to trigger a snapshot from user space to save the current trace.

The config for this new feature (CONFIG_TRACER_SNAPSHOT) would select the
latency tracer config (CONFIG_TRACER_MAX_LATENCY) as it would need all the
functionality from it as it already existed. But this was incorrect. As
the snapshot feature is really what the latency tracers need and not the
other way around.

Have CONFIG_TRACER_MAX_TRACE select CONFIG_TRACER_SNAPSHOT where the
tracers that needs the max latency buffer selects the TRACE_MAX_TRACE
which will then select TRACER_SNAPSHOT.

Also, go through trace.c and trace.h and make the code that only needs the
TRACER_MAX_TRACE protected by that and the code that always requires the
snapshot to be protected by TRACER_SNAPSHOT.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.767870992@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
e4c1a09afb tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
Instead of having #ifdef CONFIG_TRACER_MAX_TRACE around every access to
the struct tracer's use_max_tr field, add a helper function for that
access and if CONFIG_TRACER_MAX_TRACE is not configured it just returns
false.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.599390238@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
694b3f6fe0 tracing: Rename trace_array field max_buffer to snapshot_buffer
When tracing was first added, there were latency tracers that would take a
snapshot of the current trace when a new max latency was hit. This
snapshot buffer was called "max_buffer". Since then, a snapshot feature
was added that allowed user space or event triggers to trigger a snapshot
of the current buffer using the same max_buffer of the trace_array.

As this snapshot buffer now has a more generic use case, calling it
"max_buffer" is confusing. Rename it to snapshot_buffer.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.428446729@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
98021e37d6 tracing: Move pid filtering into trace_pid.c
The trace.c file was a dumping ground for most tracing code. Start
organizing it better by moving various functions out into their own files.
Move the PID filtering functions from trace.c into its own trace_pid.c
file.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.998330662@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:13 -05:00
Steven Rostedt
27931ee8f4 tracing: Move trace_printk functions out of trace.c and into trace_printk.c
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Move the functions associated to the trace_printk operations out of trace.c and
into trace_printk.c.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.828744197@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
af1eea12ad tracing: Use system_state in trace_printk_init_buffers()
The function trace_printk_init_buffers() is used to expand tha
trace_printk buffers when trace_printk() is used within the kernel or in
modules. On kernel boot up, it holds off from starting the sched switch
cmdline recorder, but will start it immediately when it is added by a
module.

Currently it uses a trick to see if the global_trace buffer has been
allocated or not to know if it was called by module load or not. But this
is more of a hack, and can not be used when this code is moved out of
trace.c. Instead simply look at the system_state and if it is running then
it is know that it could only be called by module load.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.660237094@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
f377912b3d tracing: Have trace_printk functions use flags instead of using global_trace
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.

Instead of testing the trace_array tr pointer to &global_trace, test the
tr->flags to see if TRACE_ARRAY_FL_GLOBAL set.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.491116245@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
93c88d06ac tracing: Make tracing_update_buffers() take NULL for global_trace
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.

Have tracing_update_buffers() take NULL for its trace_array to denote it
should work on the global_trace top level trace_array allows that function
to be used outside of trace.c and still update the global_trace
trace_array.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.318864210@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
1c53d781d4 tracing: Make printk_trace global for tracing system
The printk_trace is used to determine which trace_array trace_printk()
writes to. By making it a global variable among the tracing subsystem it
will allow the trace_printk functions to be moved out of trace.c and still
have direct access to that variable.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.144525891@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
3e6c8f80e5 tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Make ftrace_trace_stack() into a static inline that tests if stack tracing
is enabled and if so to call __ftrace_trace_stack() to do the stack trace.
This keeps the test inlined in the fast paths and only does the function
call if stack tracing is enabled.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.974218132@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:12 -05:00
Steven Rostedt
0e730bc067 tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Move the __always_inline functions __trace_buffer_lock_reserve(),
__trace_buffer_unlock_commit() and trace_event_setup() into trace.h.

The trace.c file will be split up and these functions will be used in more
than one of these files. As they are already __always_inline they can
easily be moved into the trace.h header file.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.813550600@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
a4f77ffc8e tracing: Make tracing_selftest_running global to the tracing subsystem
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.

Make the variable tracing_selftest_running global so that it can be used
by other files in the tracing subsystem and trace.c can be split up.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.648932796@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
64dee86ad7 tracing: Make tracing_disabled global for tracing system
The tracing_disabled variable is set to one on boot up to prevent some
parts of tracing to access the tracing infrastructure before it is set up.
It also can be set after boot if an anomaly is discovered.

It is currently a static variable in trace.c and can be accessed via a
function call trace_is_disabled(). There's really no reason to use a
function call as the tracing subsystem should be able to access it
directly.

By making the variable accessed directly, code can be moved out of trace.c
without adding overhead of a function call to see if tracing is disabled
or not.

Make tracing_disabled global and remove the tracing_is_disabled() helper
function. Also add some "unlikely()"s around tracing_disabled where it's
checked in hot paths.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.483690153@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
ba73713da5 tracing: Clean up use of trace_create_maxlat_file()
In trace.c, the function trace_create_maxlat_file() is defined behind the
 #ifdef CONFIG_TRACER_MAX_TRACE block. The #else part defines it as:

 #define trace_create_maxlat_file(tr, d_tracer)				\
	trace_create_file("tracing_max_latency", TRACE_MODE_WRITE,	\
			  d_tracer, tr, &tracing_max_lat_fops)

But the one place that it it used has:

 #ifdef CONFIG_TRACER_MAX_TRACE
	trace_create_maxlat_file(tr, d_tracer);
 #endif

Which is pointless and also wrong!

It only gets created when both CONFIG_TRACE_MAX_TRACE and CONFIG_FS_NOTIFY
is defined, but the file itself should not be dependent on
CONFIG_FS_NOTIFY. Always create that file when TRACE_MAX_TRACE is defined
regardless if FS_NOTIFY is or is not.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260207191101.0e014abd@robin
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Steven Rostedt
326669faf3 tracing: Move tracing_set_filter_buffering() into trace_events_hist.c
The function tracing_set_filter_buffering() is only used in
trace_events_hist.c. Move it to that file and make it static.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260206195936.617080218@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-08 21:01:11 -05:00
Yaxiong Tian
1c48f7ab72 tracing: Rename eval_map_wq and allow other parts of tracing use it
The eval_map_work_func() function, though queued in eval_map_wq,
holds the trace_event_sem read-write lock for a long time during
kernel boot. This causes blocking issues for other functions.

Rename eval_map_wq to trace_init_wq and make it global, thereby
allowing other parts of tracing to schedule work on this queue
asynchronously and avoiding blockage of the main boot thread.

Link: https://patch.msgid.link/20260204015344.162818-1-tianyaxiong@kylinos.cn
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-06 15:26:59 -05:00
Steven Rostedt
6bdf07302f tracing: Disable trace_printk buffer on warning too
When /proc/sys/kernel/traceoff_on_warning is set to 1, the top level
tracing buffer is disabled when a warning happens. This is very useful
when debugging and want the tracing buffer to stop taking new data when a
warning triggers keeping the events that lead up to the warning from being
overwritten.

Now that there is also a persistent ring buffer and an option to have
trace_printk go to that buffer, the same holds true for that buffer. A
warning could happen just before a crash but still write enough events to
lose the events that lead up to the first warning that was the reason for
the crash.

When /proc/sys/kernel/traceoff_on_warning is set to 1 and a warning is
triggered, not only disable the top level tracing buffer, but also disable
the buffer that trace_printk()s are written to.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260121093858.5c5d7e7b@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:45:17 -05:00
Steven Rostedt
e4ef389e76 tracing: Check the return value of tracing_update_buffers()
In the very unlikely event that tracing_update_buffers() fails in
trace_printk_init_buffers(), report the failure so that it is known.

Link: https://lore.kernel.org/all/20220917020353.3836285-1-floridsleeves@gmail.com/

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260107161510.4dc98b15@gandalf.local.home
Suggested-by: Li Zhong <floridsleeves@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:44:40 -05:00
Ian Rogers
00f13e28a9 tracing: Avoid possible signed 64-bit truncation
64-bit truncation to 32-bit can result in the sign of the truncated
value changing. The cmp_mod_entry is used in bsearch and so the
truncation could result in an invalid search order. This would only
happen were the addresses more than 2GB apart and so unlikely, but
let's fix the potentially broken compare anyway.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260108002625.333331-1-irogers@google.com
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-23 13:34:30 -05:00
Ben Dooks
1e2ed4bfd5 trace: ftrace_dump_on_oops[] is not exported, make it static
The ftrace_dump_on_oops string is not used outside of trace.c so
make it static to avoid the export warning from sparse:

kernel/trace/trace.c:141:6: warning: symbol 'ftrace_dump_on_oops' was not declared. Should it be static?

Fixes: dd293df639 ("tracing: Move trace sysctls into trace.c")
Link: https://patch.msgid.link/20260106231054.84270-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07 14:52:22 -05:00
Steven Rostedt
5f1ef0dfcb tracing: Add recursion protection in kernel stack trace recording
A bug was reported about an infinite recursion caused by tracing the rcu
events with the kernel stack trace trigger enabled. The stack trace code
called back into RCU which then called the stack trace again.

Expand the ftrace recursion protection to add a set of bits to protect
events from recursion. Each bit represents the context that the event is
in (normal, softirq, interrupt and NMI).

Have the stack trace code use the interrupt context to protect against
recursion.

Note, the bug showed an issue in both the RCU code as well as the tracing
stacktrace code. This only handles the tracing stack trace side of the
bug. The RCU fix will be handled separately.

Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home
Reported-by: Yao Kai <yaokai34@huawei.com>
Tested-by: Yao Kai <yaokai34@huawei.com>
Fixes: 5f5fa7ea89 ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07 14:52:22 -05:00
Darrick J. Wong
74bf97e9a8 tracing: Fix UBSAN warning in __remove_instance()
xfs/558 triggers the following UBSAN warning:

 ------------[ cut here ]------------
 UBSAN: shift-out-of-bounds in kernel/trace/trace.c:10510:10
 shift exponent 32 is too large for 32-bit type 'int'
 CPU: 1 UID: 0 PID: 888674 Comm: rmdir Not tainted 6.19.0-rc1-xfsx #rc1 PREEMPT(lazy)  dbf607ef4c142c563f76d706e71af9731d7b9c90
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x4a/0x70
  ubsan_epilogue+0x5/0x2b
  __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113
  __remove_instance.part.0.constprop.0.cold+0x18/0x26f
  instance_rmdir+0xf3/0x110
  tracefs_syscall_rmdir+0x4d/0x90
  vfs_rmdir+0x139/0x230
  do_rmdir+0x143/0x230
  __x64_sys_rmdir+0x1d/0x20
  do_syscall_64+0x44/0x230
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 RIP: 0033:0x7f7ae8e51f17
 Code: f0 ff ff 73 01 c3 48 8b 0d de 2e 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 54 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 2e 0e 00 f7 d8 64 89 02 b8
 RSP: 002b:00007ffd90743f08 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
 RAX: ffffffffffffffda RBX: 00007ffd907440f8 RCX: 00007f7ae8e51f17
 RDX: 00007f7ae8f3c5c0 RSI: 00007ffd90744a21 RDI: 00007ffd90744a21
 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f7ae8f35ac0 R11: 0000000000000246 R12: 00007ffd90744a21
 R13: 0000000000000001 R14: 00007f7ae8f8b000 R15: 000055e5283e6a98
  </TASK>
 ---[ end trace ]---

whilst tearing down an ftrace instance.  TRACE_FLAGS_MAX_SIZE is now 64bit,
so the mask comparison expression must be typecast to a u64 value to
avoid an overflow.  AFAICT, ZEROED_TRACE_FLAGS is already cast to ULL
so this is ok.

Link: https://patch.msgid.link/20251216174950.GA7705@frogsfrogsfrogs
Fixes: bbec8e28ca ("tracing: Allow tracer to add more than 32 options")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17 17:50:04 -05:00
Maurice Hieronymus
8d4cdbd45c tracing: Fix multiple typos in trace.c
Fix multiple typos in comments:
"alse" -> "also"
"enabed" -> "enabled"
"instane" -> "instance"
"outputing" -> "outputting"
"seperated" -> "separated"

Link: https://patch.msgid.link/20251121221835.28032-7-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:40 -05:00
Steven Rostedt
02e7769e38 tracing: Fix enabling of tracing on file release
The trace file will pause tracing if the tracing instance has the
"pause-on-trace" option is set. This happens when the file is opened, and
it is unpaused when the file is closed. When this was first added, there
was only one user that paused tracing. On open, the check to pause was:

   if (!iter->snapshot && (tr->trace_flags & TRACE_ITER(PAUSE_ON_TRACE)))

Where if it is not the snapshot tracer and the "pause-on-trace" option is
set, then it increments a "stop_count" of the trace instance.

On close, the check is:

   if (!iter->snapshot && tr->stop_count)

That is, if it is not the snapshot buffer and it was stopped, it will
re-enable tracing.

Now there's more places that stop tracing. This means, if something else
stops tracing the tr->stop_count will be non-zero, and that means if the
trace file is closed, it will decrement the stop_count even though it
never incremented it. This causes a warning because when the user that
stopped tracing enables it again, the stop_count goes below zero.

Instead of relying on the stop_count being set to know if the close of
the trace file should enable tracing again, add a new flag to the trace
iterator. The trace iterator is unique per open of the trace file, and if
the open stops tracing set the trace iterator PAUSE flag. On close, if the
PAUSE flag is set, then re-enable it again.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251202161751.24abaaf1@gandalf.local.home
Fixes: 06e0a548ba ("tracing: Do not disable tracing when reading the trace file")
Reported-by: syzbot+ccdec3bfe0beec58a38d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/692f44a5.a70a0220.2ea503.00c8.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:17:56 -05:00
Linus Torvalds
0b1b4a3d8e Runtime verifier updates for v6.19:
- Adapt the ftracetest script to be run from a different folder
 
   This uses the already existing OPT_TEST_DIR but extends it further to run
   independent tests, then add an --rv flag to allow using the script for
   testing RV (mostly) independently on ftrace.
 
 - Add basic RV selftests in selftests/verification for more validations
 
   Add more validations for available/enabled monitors and reactors. This
   could have caught the bug introducing kernel panic solved above. Tests use
   ftracetest.
 
 - Convert react() function in reactor to use va_list directly
 
   Use a central helper to handle the variadic arguments. Clean up macros
   and mark functions as static.
 
 - Add lockdep annotations to reactors to have lockdep complain of errors
 
   If the reactors are called from improper context. Useful to develop new
   reactors. This highlights a warning in the panic reactor that is related
   to the printk subsystem and not to RV.
 
 - Convert core RV code to use lock guards and __free helpers
 
   This completely removes goto statements.
 
 - Fix compilation if !CONFIG_RV_REACTORS
 
   Fix the warning by keeping LTL monitor variable as always static.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaTBoVxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtWpAQDxPQAJQvBZ41l9q9Cis7PqGGezT4Nv
 g6Fh/ydMOlJCsQD/R0Xd5JxPmBI8FLCwCfqHo7wYKUhP8GfL/ORPEWhU2gI=
 =EEot
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier updates from Steven Rostedt:

 - Adapt the ftracetest script to be run from a different folder

   This uses the already existing OPT_TEST_DIR but extends it further to
   run independent tests, then add an --rv flag to allow using the
   script for testing RV (mostly) independently on ftrace.

 - Add basic RV selftests in selftests/verification for more validations

   Add more validations for available/enabled monitors and reactors.
   This could have caught the bug introducing kernel panic solved above.
   Tests use ftracetest.

 - Convert react() function in reactor to use va_list directly

   Use a central helper to handle the variadic arguments. Clean up
   macros and mark functions as static.

 - Add lockdep annotations to reactors to have lockdep complain of
   errors

   If the reactors are called from improper context. Useful to develop
   new reactors. This highlights a warning in the panic reactor that is
   related to the printk subsystem and not to RV.

 - Convert core RV code to use lock guards and __free helpers

   This completely removes goto statements.

 - Fix compilation if !CONFIG_RV_REACTORS

   Fix the warning by keeping LTL monitor variable as always static.

* tag 'trace-rv-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Fix compilation if !CONFIG_RV_REACTORS
  rv: Convert to use __free
  rv: Convert to use lock guard
  rv: Add explicit lockdep context for reactors
  rv: Make rv_reacting_on() static
  rv: Pass va_list to reactors
  selftests/verification: Add initial RV tests
  selftest/ftrace: Generalise ftracetest to use with RV
2025-12-05 10:17:00 -08:00
Linus Torvalds
69c5079b49 tracing updates for v6.19:
- Merge branch shared with kprobes on extending trace options
 
   The trace options were defined by a 32 bit variable. This limits the
   tracing instances to have a total of 32 different options. As that limit
   has been hit, and more options are being added, increase the option mask
   to a 64 bit number, doubling the number of options available.
 
   As this is required for the kprobe topic branches as well as the tracing
   topic branch, a separate branch was created and merged into both.
 
 - Make trace_user_fault_read() available for the rest of tracing
 
   The function trace_user_fault_read() is used by trace_marker file read to
   allow reading user space to be done fast and without locking or
   allocations. Make this available so that the system call trace events can
   use it too.
 
 - Have system call trace events read user space values
 
   Now that the system call trace events callbacks are called in a faultable
   context, take advantage of this and read the user space buffers for
   various system calls. For example, show the path name of the openat system
   call instead of just showing the pointer to that path name in user space.
   Also show the contents of the buffer of the write system call. Several
   system call trace events are updated to make tracing into a light weight
   strace tool for all applications in the system.
 
 - Update perf system call tracing to do the same
 
 - And a config and syscall_user_buf_size file to control the size of the buffer
 
   Limit the amount of data that can be read from user space. The default
   size is 63 bytes but that can be expanded to 165 bytes.
 
 - Allow the persistent ring buffer to print system calls normally
 
   The persistent ring buffer prints trace events by their type and ignores
   the print_fmt. This is because the print_fmt may change from kernel to
   kernel. As the system call output is fixed by the system call ABI itself,
   there's no reason to limit that. This makes reading the system call events
   in the persistent ring buffer much nicer and easier to understand.
 
 - Add options to show text offset to function profiler
 
   The function profiler that counts the number of times a function is hit
   currently lists all functions by its name and offset. But this becomes
   ambiguous when there are several functions with the same name. Add a
   tracing option that changes the output to be that of _text+offset
   instead. Now a user space tool can use this information to map the
   _text+offset to the unique function it is counting.
 
 - Report bad dynamic event command
 
   If a bad command is passed to the dynamic_events file, report it properly
   in the error log.
 
 - Clean up tracer options
 
   Clean up the tracer option code a bit, by removing some useless code and
   also using switch statements instead of a series of if statements.
 
 - Have tracing options be instance specific
 
   Tracers can have their own options (function tracer, irqsoff tracer,
   function graph tracer, etc). But now that the same tracer can be enabled
   in multiple trace instances, their options are still global. The API is
   per instance, thus changing one affects other instances. This isn't even
   consistent, as the option take affect differently depending on when an
   tracer started in an instance.  Make the options for instances only affect
   the instance it is changed under.
 
 - Optimize pid_list lock contention
 
   Whenever the pid_list is read, it uses a spin lock. This happens at every
   sched switch. Taking the lock at sched switch can be removed by instead
   using a seqlock counter.
 
 - Clean up the trace trigger structures
 
   The trigger code uses two different structures to implement a single
   tigger. This was due to trying to reuse code for the two different types
   of triggers (always on trigger, and count limited trigger). But by adding
   a single field to one structure, the other structure could be absorbed
   into the first structure making he code easier to understand.
 
 - Create a bulk garbage collector for trace triggers
 
   If user space has triggers for several hundreds of events and then removes
   them, it can take several seconds to complete. This is because each
   removal calls the slow tracepoint_synchronize_unregister() that can take
   hundreds of milliseconds to complete. Instead, create a helper thread that
   will do the clean up. When a trigger is removed, it will create the
   kthread if it isn't already created, and then add the trigger to a llist.
   The kthread will take the items off the llist, call
   tracepoint_synchronize_unregister(), and then remove the items it took
   off. It will then check if there's more items to free before sleeping.
 
   This makes user space removing all these triggers to finish in less than a
   second.
 
 - Allow function tracing of some of the tracing infrastructure code
 
   Because the tracing code can cause recursion issues if it is traced by the
   function tracer the entire tracing directory disables function tracing.
   But not all of tracing causes issues if it is traced. Namely, the event
   tracing code. Add a config that enables some of the tracing code to be
   traced to help in debugging it. Note, when this is enabled, it does add
   noise to general function tracing, especially if events are enabled as
   well (which is a common case).
 
 - Add boot-time backup instance for persistent buffer
 
   The persistent ring buffer is used mostly for kernel crash analysis in the
   field. One issue is that if there's a crash, the data in the persistent
   ring buffer must be read before tracing can begin using it. This slows
   down the boot process. Once tracing starts in the persistent ring buffer,
   the old data must be freed and the addresses no longer match and old
   events can't be in the buffer with new events.
 
   Create a way to create a backup buffer that copies the persistent ring
   buffer at boot up. Then after a crash, the always on tracer can begin
   immediately as well as the normal boot process while the crash analysis
   tooling uses the backup buffer. After the backup buffer is finished being
   read, it can be removed.
 
 - Enable function graph args and return address options at the same time
 
   Currently the when reading of arguments in the function graph tracer is
   enabled, the option to record the parent function in the entry event can
   not be enabled. Update the code so that it can.
 
 - Add new struct_offset() helper macro
 
   Add a new macro that takes a pointer to a structure and a name of one of
   its members and it will return the offset of that member. This allows the
   ring buffer code to simplify the following:
 
   From:  size = struct_size(entry, buf, cnt - sizeof(entry->id));
     To:  size = struct_offset(entry, id) + cnt;
 
   There should be other simplifications that this macro can help out with as
   well.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS9xqxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qj6tAQD4MR1lsE3XpH09asO4CDDfhbtRSQVD
 o8bVKVihWx/j5gD/XezjqE2Q2+DO6dhnsQY6pbtNdXoKgaMuQJGA+dvPsQc=
 =HilC
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Extend tracing option mask to 64 bits

   The trace options were defined by a 32 bit variable. This limits the
   tracing instances to have a total of 32 different options. As that
   limit has been hit, and more options are being added, increase the
   option mask to a 64 bit number, doubling the number of options
   available.

   As this is required for the kprobe topic branches as well as the
   tracing topic branch, a separate branch was created and merged into
   both.

 - Make trace_user_fault_read() available for the rest of tracing

   The function trace_user_fault_read() is used by trace_marker file
   read to allow reading user space to be done fast and without locking
   or allocations. Make this available so that the system call trace
   events can use it too.

 - Have system call trace events read user space values

   Now that the system call trace events callbacks are called in a
   faultable context, take advantage of this and read the user space
   buffers for various system calls. For example, show the path name of
   the openat system call instead of just showing the pointer to that
   path name in user space. Also show the contents of the buffer of the
   write system call. Several system call trace events are updated to
   make tracing into a light weight strace tool for all applications in
   the system.

 - Update perf system call tracing to do the same

 - And a config and syscall_user_buf_size file to control the size of
   the buffer

   Limit the amount of data that can be read from user space. The
   default size is 63 bytes but that can be expanded to 165 bytes.

 - Allow the persistent ring buffer to print system calls normally

   The persistent ring buffer prints trace events by their type and
   ignores the print_fmt. This is because the print_fmt may change from
   kernel to kernel. As the system call output is fixed by the system
   call ABI itself, there's no reason to limit that. This makes reading
   the system call events in the persistent ring buffer much nicer and
   easier to understand.

 - Add options to show text offset to function profiler

   The function profiler that counts the number of times a function is
   hit currently lists all functions by its name and offset. But this
   becomes ambiguous when there are several functions with the same
   name.

   Add a tracing option that changes the output to be that of
   '_text+offset' instead. Now a user space tool can use this
   information to map the '_text+offset' to the unique function it is
   counting.

 - Report bad dynamic event command

   If a bad command is passed to the dynamic_events file, report it
   properly in the error log.

 - Clean up tracer options

   Clean up the tracer option code a bit, by removing some useless code
   and also using switch statements instead of a series of if
   statements.

 - Have tracing options be instance specific

   Tracers can have their own options (function tracer, irqsoff tracer,
   function graph tracer, etc). But now that the same tracer can be
   enabled in multiple trace instances, their options are still global.
   The API is per instance, thus changing one affects other instances.
   This isn't even consistent, as the option take affect differently
   depending on when an tracer started in an instance. Make the options
   for instances only affect the instance it is changed under.

 - Optimize pid_list lock contention

   Whenever the pid_list is read, it uses a spin lock. This happens at
   every sched switch. Taking the lock at sched switch can be removed by
   instead using a seqlock counter.

 - Clean up the trace trigger structures

   The trigger code uses two different structures to implement a single
   tigger. This was due to trying to reuse code for the two different
   types of triggers (always on trigger, and count limited trigger). But
   by adding a single field to one structure, the other structure could
   be absorbed into the first structure making he code easier to
   understand.

 - Create a bulk garbage collector for trace triggers

   If user space has triggers for several hundreds of events and then
   removes them, it can take several seconds to complete. This is
   because each removal calls tracepoint_synchronize_unregister() that
   can take hundreds of milliseconds to complete.

   Instead, create a helper thread that will do the clean up. When a
   trigger is removed, it will create the kthread if it isn't already
   created, and then add the trigger to a llist. The kthread will take
   the items off the llist, call tracepoint_synchronize_unregister(),
   and then remove the items it took off. It will then check if there's
   more items to free before sleeping.

   This makes user space removing all these triggers to finish in less
   than a second.

 - Allow function tracing of some of the tracing infrastructure code

   Because the tracing code can cause recursion issues if it is traced
   by the function tracer the entire tracing directory disables function
   tracing. But not all of tracing causes issues if it is traced.
   Namely, the event tracing code. Add a config that enables some of the
   tracing code to be traced to help in debugging it. Note, when this is
   enabled, it does add noise to general function tracing, especially if
   events are enabled as well (which is a common case).

 - Add boot-time backup instance for persistent buffer

   The persistent ring buffer is used mostly for kernel crash analysis
   in the field. One issue is that if there's a crash, the data in the
   persistent ring buffer must be read before tracing can begin using
   it. This slows down the boot process. Once tracing starts in the
   persistent ring buffer, the old data must be freed and the addresses
   no longer match and old events can't be in the buffer with new
   events.

   Create a way to create a backup buffer that copies the persistent
   ring buffer at boot up. Then after a crash, the always on tracer can
   begin immediately as well as the normal boot process while the crash
   analysis tooling uses the backup buffer. After the backup buffer is
   finished being read, it can be removed.

 - Enable function graph args and return address options at the same
   time

   Currently the when reading of arguments in the function graph tracer
   is enabled, the option to record the parent function in the entry
   event can not be enabled. Update the code so that it can.

 - Add new struct_offset() helper macro

   Add a new macro that takes a pointer to a structure and a name of one
   of its members and it will return the offset of that member. This
   allows the ring buffer code to simplify the following:

   From:  size = struct_size(entry, buf, cnt - sizeof(entry->id));
     To:  size = struct_offset(entry, id) + cnt;

   There should be other simplifications that this macro can help out
   with as well

* tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (42 commits)
  overflow: Introduce struct_offset() to get offset of member
  function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously
  tracing: Add boot-time backup of persistent ring buffer
  ftrace: Allow tracing of some of the tracing code
  tracing: Use strim() in trigger_process_regex() instead of skip_spaces()
  tracing: Add bulk garbage collection of freeing event_trigger_data
  tracing: Remove unneeded event_mutex lock in event_trigger_regex_release()
  tracing: Merge struct event_trigger_ops into struct event_command
  tracing: Remove get_trigger_ops() and add count_func() from trigger ops
  tracing: Show the tracer options in boot-time created instance
  ftrace: Avoid redundant initialization in register_ftrace_direct
  tracing: Remove unused variable in tracing_trace_options_show()
  fgraph: Make fgraph_no_sleep_time signed
  tracing: Convert function graph set_flags() to use a switch() statement
  tracing: Have function graph tracer option sleep-time be per instance
  tracing: Move graph-time out of function graph options
  tracing: Have function graph tracer option funcgraph-irqs be per instance
  trace/pid_list: optimize pid_list->lock contention
  tracing: Have function graph tracer define options per instance
  tracing: Have function tracer define options per instance
  ...
2025-12-05 09:51:37 -08:00
Nam Cao
b30f635bb6 rv: Convert to use __free
Convert to use __free to tidy up the code.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/62854e2fcb8f8dd2180a98a9700702dcf89a6980.1763370183.git.namcao@linutronix.de
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-12-02 07:28:32 +01:00
Steven Rostedt
f6ed9c5d31 overflow: Introduce struct_offset() to get offset of member
The trace_marker_raw file in tracefs takes a buffer from user space that
contains an id as well as a raw data string which is usually a binary
structure. The structure used has the following:

	struct raw_data_entry {
		struct trace_entry	ent;
		unsigned int		id;
		char			buf[];
	};

Since the passed in "cnt" variable is both the size of buf as well as the
size of id, the code to allocate the location on the ring buffer had:

   size = struct_size(entry, buf, cnt - sizeof(entry->id));

Which is quite ugly and hard to understand. Instead, add a helper macro
called struct_offset() which then changes the above to a simple and easy
to understand:

   size = struct_offset(entry, id) + cnt;

This will likely come in handy for other use cases too.

Link: https://lore.kernel.org/all/CAHk-=whYZVoEdfO1PmtbirPdBMTV9Nxt9f09CK0k6S+HJD3Zmg@mail.gmail.com/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://patch.msgid.link/20251126145249.05b1770a@gandalf.local.home
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-27 20:18:05 -05:00
Masami Hiramatsu (Google)
20e7168326 tracing: Add boot-time backup of persistent ring buffer
Currently, the persistent ring buffer instance needs to be read before
using it. This means we have to wait for boot up user space and dump
the persistent ring buffer. However, in that case we can not start
tracing on it from the kernel cmdline.

To solve this limitation, this adds an option which allows to create
a trace instance as a backup of the persistent ring buffer at boot.
If user specifies trace_instance=<BACKUP>=<PERSIST_RB> then the
<BACKUP> instance is made as a copy of the <PERSIST_RB> instance.

For example, the below kernel cmdline records all syscalls, scheduler
and interrupt events on the persistent ring buffer `boot_map` but
before starting the tracing, it makes a `backup` instance from the
`boot_map`. Thus, the `backup` instance has the previous boot events.

'reserve_mem=12M:4M:trace trace_instance=boot_map@trace,syscalls:*,sched:*,irq:* trace_instance=backup=boot_map'

As you can see, this just make a copy of entire reserved area and
make a backup instance on it. So you can release (or shrink) the
backup instance after use it to save the memory usage.

  /sys/kernel/tracing/instances # free
                total        used        free      shared  buff/cache   available
  Mem:        1999284       55704     1930520       10132       13060     1914628
  Swap:             0           0           0
  /sys/kernel/tracing/instances # rmdir backup/
  /sys/kernel/tracing/instances # free
                total        used        free      shared  buff/cache   available
  Mem:        1999284       40640     1945584       10132       13060     1929692
  Swap:             0           0           0

Note: since there is no reason to make a copy of empty buffer, this
backup only accepts a persistent ring buffer as the original instance.
Also, since this backup is based on vmalloc(), it does not support
user-space mmap().

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/176377150002.219692.9425536150438129267.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-26 15:13:30 -05:00
Masami Hiramatsu (Google)
23c0e9cc76 tracing: Show the tracer options in boot-time created instance
Since tracer_init_tracefs_work_func() only updates the tracer options
for the global_trace, the instances created by the kernel cmdline
do not have those options.

Fix to update tracer options for those boot-time created instances
to show those options.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/176354112555.2356172.3989277078358802353.stgit@mhiramat.tok.corp.google.com
Fixes: 428add559b ("tracing: Have tracer option be instance specific")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-26 15:13:29 -05:00
Steven Rostedt
49c1364c7c tracing: Remove unused variable in tracing_trace_options_show()
The flags and opts used in tracing_trace_options_show() now come directly
from the trace array "current_trace_flags" and not the current_trace. The
variable "trace" was still being assigned to tr->current_trace but never
used. This caused a warning in clang.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251117120637.43ef995d@gandalf.local.home
Reported-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Closes: https://lore.kernel.org/all/aRtHWXzYa8ijUIDa@black.igk.intel.com/
Fixes: 428add559b ("tracing: Have tracer option be instance specific")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-26 15:13:28 -05:00
Deepanshu Kartikey
b042fdf18e tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAs
When a VMA is split (e.g., by partial munmap or MAP_FIXED), the kernel
calls vm_ops->close on each portion. For trace buffer mappings, this
results in ring_buffer_unmap() being called multiple times while
ring_buffer_map() was only called once.

This causes ring_buffer_unmap() to return -ENODEV on subsequent calls
because user_mapped is already 0, triggering a WARN_ON.

Trace buffer mappings cannot support partial mappings because the ring
buffer structure requires the complete buffer including the meta page.

Fix this by adding a may_split callback that returns -EINVAL to prevent
VMA splits entirely.

Cc: stable@vger.kernel.org
Fixes: cf9f0f7c4c ("tracing: Allow user-space mapping of the ring-buffer")
Link: https://patch.msgid.link/20251119064019.25904-1-kartikey406@gmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a72c325b042aae6403c7
Tested-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com
Reported-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-25 15:21:16 -05:00
Steven Rostedt
4132886e1b tracing: Move graph-time out of function graph options
The option "graph-time" affects the function profiler when it is using the
function graph infrastructure. It has nothing to do with the function
graph tracer itself. The option only affects the global function profiler
and does nothing to the function graph tracer.

Move it out of the function graph tracer options and make it a global
option that is only available at the top level instance.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251114192318.781711154@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14 14:30:55 -05:00
Steven Rostedt
428add559b tracing: Have tracer option be instance specific
Tracers can add specify options to modify them. This logic was added
before instances were created and the tracer flags were global variables.
After instances were created where a tracer may exist in more than one
instance, the flags were not updated from being global into instance
specific. This causes confusion with these options. For example, the
function tracer has an option to enable function arguments:

  # cd /sys/kernel/tracing
  # mkdir instances/foo
  # echo function > instances/foo/current_tracer
  # echo 1 > options/func-args
  # echo function > current_tracer
  # cat trace
[..]
  <idle>-0       [005] d..3.  1050.656187: rcu_needs_cpu() <-tick_nohz_next_event
  <idle>-0       [005] d..3.  1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event
  <idle>-0       [005] d..3.  1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt
  <idle>-0       [005] d..4.  1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt
  <idle>-0       [005] d..4.  1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt
 # cat instances/foo/options/func-args
 1
 # cat instances/foo/trace
[..]
  kworker/4:1-88      [004] ...1.   298.127735: next_zone <-refresh_cpu_vm_stats
  kworker/4:1-88      [004] ...1.   298.127736: first_online_pgdat <-refresh_cpu_vm_stats
  kworker/4:1-88      [004] ...1.   298.127738: next_online_pgdat <-refresh_cpu_vm_stats
  kworker/4:1-88      [004] ...1.   298.127739: fold_diff <-refresh_cpu_vm_stats
  kworker/4:1-88      [004] ...1.   298.127741: round_jiffies_relative <-vmstat_update
[..]

The above shows that setting "func-args" in the top level instance also
set it in the instance "foo", but since the interface of the trace flags
are per instance, the update didn't take affect in the "foo" instance.

Update the infrastructure to allow tracers to add a "default_flags" field
in the tracer structure that can be set instead of "flags" which will make
the flags per instance. If a tracer needs to keep the flags global (like
blktrace), keeping the "flags" field set will keep the old behavior.

This does not update function or the function graph tracers. That will be
handled later.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251111232429.305317942@kernel.org
Fixes: f20a580627 ("ftrace: Allow instances to use function tracing")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-12 09:59:54 -05:00
Steven Rostedt
3a0d5bc76f tracing: Use switch statement instead of ifs in set_tracer_flag()
The "mask" passed in to set_trace_flag() has a single bit set. The
function then checks if the mask is equal to one of the option masks and
performs the appropriate function associated to that option.

Instead of having a bunch of "if ()" statement, use a "switch ()"
statement instead to make it cleaner and a bit more optimal.

No function changes.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251106003501.890298562@kernel.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10 14:33:51 -05:00
Steven Rostedt
9c5053083e tracing: Exit out immediately after update_marker_trace()
The call to update_marker_trace() in set_tracer_flag() performs the update
to the tr->trace_flags. There's no reason to perform it again after it is
called. Return immediately instead.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251106003501.726406870@kernel.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10 14:33:43 -05:00
Steven Rostedt
5aa0d18df0 tracing: Have add_tracer_options() error pass up to callers
The function add_tracer_options() can fail, but currently it is ignored.
Pass the status of add_tracer_options() up to adding a new tracer as well
as when an instance is created. Have the instance creation fail if the
add_tracer_options() fail.

Only print a warning for the top level instance, like it does with other
failures.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251105161935.375299297@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10 14:28:31 -05:00
Steven Rostedt
c7bed15ccf tracing: Remove dummy options and flags
When a tracer does not define their own flags, dummy options and flags are
used so that the values are always valid. There's not that many locations
that reference these values so having dummy versions just complicates the
code. Remove the dummy values and just check for NULL when appropriate.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20251105161935.206093132@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10 14:27:04 -05:00
Steven Rostedt
2f294c35c0 Merge branch 'topic/func-profiler-offset' of git://git.kernel.org/pub/scm/linux/kernel/git/mhiramat/linux into trace/trace/core
Updates to the function profiler adds new options to tracefs. The options
are currently defined by an enum as flags. The added options brings the
number of options over 32, which means they can no longer be held in a 32
bit enum. The TRACE_ITER_* flags are converted to a macro TRACE_ITER(*) to
allow the creation of options to still be done by macros.

This change is intrusive, as it affects all TRACE_ITER* options throughout
the trace code. Merge the branch that added these options and converted
the TRACE_ITER_* enum into a TRACE_ITER(*) macro, to allow the topic
branches to still be developed without conflict.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-04 10:12:32 -05:00
Masami Hiramatsu (Google)
1149fcf759 tracing: Add an option to show symbols in _text+offset for function profiler
Function profiler shows the hit count of each function using its symbol
name. However, there are some same-name local symbols, which we can not
distinguish.
To solve this issue, this introduces an option to show the symbols
in "_text+OFFSET" format. This can avoid exposing the random shift of
KASLR. The functions in modules are shown as "MODNAME+OFFSET" where the
offset is from ".text".

E.g. for the kernel text symbols, specify vmlinux and the output to
 addr2line, you can find the actual function and source info;

  $ addr2line -fie vmlinux _text+3078208
  __balance_callbacks
  kernel/sched/core.c:5064

for modules, specify the module file and .text+OFFSET;

  $ addr2line -fie samples/trace_events/trace-events-sample.ko .text+8224
  do_simple_thread_func
  samples/trace_events/trace-events-sample.c:23

Link: https://lore.kernel.org/all/176187878064.994619.8878296550240416558.stgit@devnote2/

Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-04 21:44:18 +09:00
Masami Hiramatsu (Google)
bbec8e28ca tracing: Allow tracer to add more than 32 options
Since enum trace_iterator_flags is 32bit, the max number of the
option flags is limited to 32 and it is fully used now. To add
a new option, we need to expand it.

So replace the TRACE_ITER_##flag with TRACE_ITER(flag) macro which
is 64bit bitmask.

Link: https://lore.kernel.org/all/176187877103.994619.166076000668757232.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-04 21:44:00 +09:00
Steven Rostedt
25bd47a592 tracing: Have persistent ring buffer print syscalls normally
The persistent ring buffer from a previous boot has to be careful printing
events as the print formats of random events can have pointers to strings
and such that are not available.

Ftrace static events (like the function tracer event) are stable and are
printed normally.

System call event formats are also stable. Allow them to be printed
normally as well:

Instead of:

  <...>-1       [005] ...1.    57.240405: sys_enter_waitid: __syscall_nr=0xf7 (247) which=0x1 (1) upid=0x499 (1177) infop=0x7ffd5294d690 (140725988939408) options=0x5 (5) ru=0x0 (0)
  <...>-1       [005] ...1.    57.240433: sys_exit_waitid: __syscall_nr=0xf7 (247) ret=0x0 (0)
  <...>-1       [005] ...1.    57.240437: sys_enter_rt_sigprocmask: __syscall_nr=0xe (14) how=0x2 (2) nset=0x7ffd5294d7c0 (140725988939712) oset=0x0 (0) sigsetsize=0x8 (8)
  <...>-1       [005] ...1.    57.240438: sys_exit_rt_sigprocmask: __syscall_nr=0xe (14) ret=0x0 (0)
  <...>-1       [005] ...1.    57.240442: sys_enter_close: __syscall_nr=0x3 (3) fd=0x4 (4)
  <...>-1       [005] ...1.    57.240463: sys_exit_close: __syscall_nr=0x3 (3) ret=0x0 (0)
  <...>-1       [005] ...1.    57.240485: sys_enter_openat: __syscall_nr=0x101 (257) dfd=0xffffffffffdfff9c (-2097252) filename=(0xffff8b81639ca01c) flags=0x80000 (524288) mode=0x0 (0) __filename_val=/run/systemd/reboot-param
  <...>-1       [005] ...1.    57.240555: sys_exit_openat: __syscall_nr=0x101 (257) ret=0xffffffffffdffffe (-2097154)
  <...>-1       [005] ...1.    57.240571: sys_enter_openat: __syscall_nr=0x101 (257) dfd=0xffffffffffdfff9c (-2097252) filename=(0xffff8b81639ca01c) flags=0x80000 (524288) mode=0x0 (0) __filename_val=/run/systemd/reboot-param
  <...>-1       [005] ...1.    57.240620: sys_exit_openat: __syscall_nr=0x101 (257) ret=0xffffffffffdffffe (-2097154)
  <...>-1       [005] ...1.    57.240629: sys_enter_writev: __syscall_nr=0x14 (20) fd=0x3 (3) vec=0x7ffd5294ce50 (140725988937296) vlen=0x7 (7)
  <...>-1       [005] ...1.    57.242281: sys_exit_writev: __syscall_nr=0x14 (20) ret=0x24 (36)
  <...>-1       [005] ...1.    57.242286: sys_enter_reboot: __syscall_nr=0xa9 (169) magic1=0xfee1dead (4276215469) magic2=0x28121969 (672274793) cmd=0x1234567 (19088743) arg=0x0 (0)

Have:

  <...>-1       [000] ...1.    91.446011: sys_waitid(which: 1, upid: 0x4d2, infop: 0x7ffdccdadfd0, options: 5, ru: 0)
  <...>-1       [000] ...1.    91.446042: sys_waitid -> 0x0
  <...>-1       [000] ...1.    91.446045: sys_rt_sigprocmask(how: 2, nset: 0x7ffdccdae100, oset: 0, sigsetsize: 8)
  <...>-1       [000] ...1.    91.446047: sys_rt_sigprocmask -> 0x0
  <...>-1       [000] ...1.    91.446051: sys_close(fd: 4)
  <...>-1       [000] ...1.    91.446073: sys_close -> 0x0
  <...>-1       [000] ...1.    91.446095: sys_openat(dfd: 18446744073709551516, filename: 139732544945794 "/run/systemd/reboot-param", flags: O_RDONLY|O_CLOEXEC)
  <...>-1       [000] ...1.    91.446165: sys_openat -> 0xfffffffffffffffe
  <...>-1       [000] ...1.    91.446182: sys_openat(dfd: 18446744073709551516, filename: 139732544945794 "/run/systemd/reboot-param", flags: O_RDONLY|O_CLOEXEC)
  <...>-1       [000] ...1.    91.446233: sys_openat -> 0xfffffffffffffffe
  <...>-1       [000] ...1.    91.446242: sys_writev(fd: 3, vec: 0x7ffdccdad790, vlen: 7)
  <...>-1       [000] ...1.    91.447877: sys_writev -> 0x24
  <...>-1       [000] ...1.    91.447883: sys_reboot(magic1: 0xfee1dead, magic2: 0x28121969, cmd: 0x1234567, arg: 0)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Takaya Saeki <takayas@google.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/20251028231149.097404581@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28 20:11:00 -04:00