-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmqPRMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplf5D/9uOsBr+OGXtkLUJtD6MiwoJUsYgYF2dMIx
epcp+8RdMaOGtigtx69QXzTP5aPjA+AvBLAMYM+QDQDAPMWbRPsD7LaCYHy7ekwA
OL68R3QRTMYPPgpuf7pKyhif7olozAvoWAnRaoWlo67rbK+mTzZsTIsgTwF4zUu6
T0dL9thbWqtJMxKSuUk+DywggvGyNZWICJ3rAZ6os2htruH0fPhsJNGVFgNXMnpe
Cy2OvWxBWRQkZnpDEocZUdYyCRVhHr7hu311j6nSLNXufqpgFmWLGO4C3vetOlgx
ulEHfGNINcSLcw9R8pNWRxU14V6iw8Oy4nU9RtZhUpF32Iasvxb4H0w76Dp9Ukq1
/DuoSkWg/Ahn24xSYxJwwZpOEE8L92pn0M2ukCfC6h7ytmDjjEL1AQ2kyFHV4mR3
nc/3FkQ0abe3HHk8Rit6+txe3sSQo5no1z8kFlb9yp2MwAmonxCCQ9N1s7pxeeP+
iLaPbGMaZ7Ra1GswD/vzxFQtkglsxLuM5D0JkjHe99a54ZnF0vF3y9jeDVOQbV1C
H6/bU/2DI3SQ8xqv6tIXQ22reyRen3ao5VKLSrmrT/tDQVoEBV5SMnJFO1J8jBP4
QST03wiu8ShHSyZ98KefwlsndrTX02V9UVD4FVj+TZXwCWltulnIR4dVYFdySWwW
d613iUsWJw==
=NNcQ
-----END PGP SIGNATURE-----
Merge tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Improve quirk visibility and configurability (Maurizio)
- Fix runtime user modification to queue setup (Keith)
- Fix multipath leak on try_module_get failure (Keith)
- Ignore ambiguous spec definitions for better atomics support
(John)
- Fix admin queue leak on controller reset (Ming)
- Fix large allocation in persistent reservation read keys
(Sungwoo Kim)
- Fix fcloop callback handling (Justin)
- Securely free DHCHAP secrets (Daniel)
- Various cleanups and typo fixes (John, Wilfred)
- Avoid a circular lock dependency issue in the sysfs nr_requests or
scheduler store handling
- Fix a circular lock dependency with the pcpu mutex and the queue
freeze lock
- Cleanup for bio_copy_kern(), using __bio_add_page() rather than the
bio_add_page(), as adding a page here cannot fail. The exiting code
had broken cleanup for the error condition, so make it clear that the
error condition cannot happen
- Fix for a __this_cpu_read() in preemptible context splat
* tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: use trylock to avoid lockdep circular dependency in sysfs
nvme: fix memory allocation in nvme_pr_read_keys()
block: use __bio_add_page in bio_copy_kern
block: break pcpu_alloc_mutex dependency on freeze_lock
blktrace: fix __this_cpu_read/write in preemptible context
nvme-multipath: fix leak on try_module_get failure
nvmet-fcloop: Check remoteport port_state before calling done callback
nvme-pci: do not try to add queue maps at runtime
nvme-pci: cap queue creation to used queues
nvme-pci: ensure we're polling a polled queue
nvme: fix memory leak in quirks_param_set()
nvme: correct comment about nvme_ns_remove()
nvme: stop setting namespace gendisk device driver data
nvme: add support for dynamic quirk configuration via module parameter
nvme: fix admin queue leak on controller reset
nvme-fabrics: use kfree_sensitive() for DHCHAP secrets
nvme: stop using AWUPF
nvme: expose active quirks in sysfs
nvme/host: fixup some typos
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmTrNEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjsOEACpUk78nFmLbEgJ5UH8+Z6daDzgoasb5YRT
Mj4g+cM2J9Xc9JxgX8QR3F2EfolweTo/H6xlhnlPDcnpB+b3qj4WHuijR/wghphj
MBKKqNXTEC+j0ra9uk8h3RmIKaK79xcUup7XfTcuWdYpSsMyYE/m/rck3thw6yNL
OAjmWLTP4IwYzXip2AB+J7JbDDOV/qWK0aOYdWHCdbn9X8bBel/HDOITWPdybnSR
DNKBeoi/Yv8KwA+axogqP213ifc3Xr6ejRDkqDOf1bgXsKkELkIxcfog6MhfHhxq
3Cqlj1pBuIBxGVU7wmBTDqL+aHrVb983tcA5x1NGZIzJao64b026o5DUhNPprwrZ
HveU1MZ2jarAjAz85gE3S4oUY+6d47ooytfvO548Zp/1LY+fOxnjYqq5ksh8BBLk
WyjfkJScgr17Z4SVOK8a9GboWO2WKiQJRg+hZ/TWX5fyvu5g9sbRasdwxnp1sl52
EayzkhYFq/Rdd8slwTIaccVUPl/xeEDeRG+jTJ+4Fj54TihKiJzXVsxDkSWKf46V
CWmzDx+n6MlGPm9mShSERZ7HJh3VcSp4No/HAjf93u9/UXwubK/SKiV71nhpgJMf
9bWS2G3wPx/5LoME95YkF+CSgs0e/ROUusfGd8X6nIz9EBGzeabCG/mjqd5adC09
OZahOuqrIg==
=PVoY
-----END PGP SIGNATURE-----
Merge tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more block updates from Jens Axboe:
- Fix partial IOVA mapping cleanup in error handling
- Minor prep series ignoring discard return value, as
the inline value is always known
- Ensure BLK_FEAT_STABLE_WRITES is set for drbd
- Fix leak of folio in bio_iov_iter_bounce_read()
- Allow IOC_PR_READ_* for read-only open
- Another debugfs deadlock fix
- A few doc updates
* tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
blk-mq: use NOIO context to prevent deadlock during debugfs creation
blk-stat: convert struct blk_stat_callback to kernel-doc
block: fix enum descriptions kernel-doc
block: update docs for bio and bvec_iter
block: change return type to void
nvmet: ignore discard return value
md: ignore discard return value
block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova
block: fix folio leak in bio_iov_iter_bounce_read()
block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ
drbd: always set BLK_FEAT_STABLE_WRITES
Creating debugfs entries can trigger fs reclaim, which can enter back
into the block layer request_queue. This can cause deadlock if the
queue is frozen.
Previously, a WARN_ON_ONCE check was used in debugfs_create_files()
to detect this condition, but it was racy since the queue can be frozen
from another context at any time.
Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine
the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs
reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave()
and blk_debugfs_unlock_nomemrestore() variants for callers that don't
need NOIO protection (e.g., debugfs removal or read-only operations).
Replace all raw debugfs_mutex lock/unlock pairs with these helpers,
using the _nomemsave/_nomemrestore variants where appropriate.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
User visible changes:
- Add an entry into MAINTAINERS file for RUST versions of code
There's now RUST code for tracing and static branches. To differentiate
that code from the C code, add entries in for the RUST version (with "[RUST]"
around it) so that the right maintainers get notified on changes.
- New bitmask-list option added to tracefs
When this is set, bitmasks in trace event are not displayed as hex
numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
- New show_event_filters file in tracefs
Instead of having to search all events/*/*/filter for any active filters
enabled in the trace instance, the file show_event_filters will list them
so that there's only one file that needs to be examined to see if any
filters are active.
- New show_event_triggers file in tracefs
Instead of having to search all events/*/*/trigger for any active triggers
enabled in the trace instance, the file show_event_triggers will list them
so that there's only one file that needs to be examined to see if any
triggers are active.
- Have traceoff_on_warning disable trace pintk buffer too
Recently recording of trace_printk() could go to other trace instances
instead of the top level instance. But if traceoff_on_warning triggers, it
doesn't stop the buffer with trace_printk() and that data can easily be
lost by being overwritten. Have traceoff_on_warning also disable the
instance that has trace_printk() being written to it.
- Update the hist_debug file to show what function the field uses
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file exists for
every event. This displays the internal data of any histogram enabled for
that event. But it is lacking the function that is called to process one
of its fields. This is very useful information that was missing when
debugging histograms.
- Up the histogram stack size from 16 to 31
Stack traces can be used as keys for event histograms. Currently the size
of the stack that is stored is limited to just 16 entries. But the storage
space in the histogram is 256 bytes, meaning that it can store up to 31
entries (plus one for the count of entries). Instead of letting that space
go to waste, up the limit from 16 to 31. This makes the keys much more
useful.
- Fix permissions of per CPU file buffer_size_kb
The per CPU file of buffer_size_kb was incorrectly set to read only in a
previous cleanup. It should be writable.
- Reset "last_boot_info" if the persistent buffer is cleared
The last_boot_info shows address information of a persistent ring buffer
if it contains data from a previous boot. It is cleared when recording
starts again, but it is not cleared when the buffer is reset. The data is
useless after a reset so clear it on reset too.
Internal changes:
- A change was made to allow tracepoint callbacks to have preemption
enabled, and instead be protected by SRCU. This required some updates to
the callbacks for perf and BPF.
perf needed to disable preemption directly in its callback because it
expects preemption disabled in the later code.
BPF needed to disable migration, as its code expects to run completely on
the same CPU.
- Have irq_work wake up other CPU if current CPU is "isolated"
When there's a waiter waiting on ring buffer data and a new event happens,
an irq work is triggered to wake up that waiter. This is noisy on isolated
CPUs (running NO_HZ_FULL). Trigger an IPI to a house keeping CPU instead.
- Use proper free of trigger_data instead of open coding it in.
- Remove redundant call of event_trigger_reset_filter()
It was called immediately in a function that was called right after it.
- Workqueue cleanups
- Report errors if tracing_update_buffers() were to fail.
- Make the enum update workqueue generic for other parts of tracing
On boot up, a work queue is created to convert enum names into their
numbers in the trace event format files. This work queue can also be used
for other aspects of tracing that takes some time and shouldn't be called
by the init call code.
The blk_trace initialization takes a bit of time. Have the initialization
code moved to the new tracing generic work queue function.
- Skip kprobe boot event creation call if there's no kprobes defined on cmdline
The kprobe initialization to set up kprobes if they are defined on the
cmdline requires taking the event_mutex lock. This can be held by other
tracing code doing initialization for a long time. Since kprobes added to
the kernel command line need to be setup immediately, as they may be
tracing early initialization code, they cannot be postponed in a work
queue and must be setup in the initcall code.
If there's no kprobe on the kernel cmdline, there's no reason to take the
mutex and slow down the boot up code waiting to get the lock only to find
out there's nothing to do. Simply exit out early if there's no kprobes on
the kernel cmdline.
If there are kprobes on the cmdline, then someone cares more about tracing
over the speed of boot up.
- Clean up the trigger code a bit
- Move code out of trace.c and into their own files
trace.c is now over 11,000 lines of code and has become more difficult to
maintain. Start splitting it up so that related code is in their own
files.
Move all the trace_printk() related code into trace_printk.c.
Move the __always_inline stack functions into trace.h.
Move the pid filtering code into a new trace_pid.c file.
- Better define the max latency and snapshot code
The latency tracers have a "max latency" buffer that is a copy of the main
buffer and gets swapped with it when a new high latency is detected. This
keeps the trace up to the highest latency around where this max_latency
buffer is never written to. It is only used to save the last max latency
trace.
A while ago a snapshot feature was added to tracefs to allow user space to
perform the same logic. It could also enable events to trigger a
"snapshot" if one of their fields hit a new high. This was built on top of
the latency max_latency buffer logic.
Because snapshots came later, they were dependent on the latency tracers
to be enabled. In reality, the latency tracers depend on the snapshot code
and not the other way around. It was just that they came first.
Restructure the code and the kconfigs to have the latency tracers depend
on snapshot code instead. This actually simplifies the logic a bit and
allows to disable more when the latency tracers are not defined and the
snapshot code is.
- Fix a "false sharing" in the hwlat tracer code
The loop to search for latency in hardware was using a variable that could
be changed by user space for each sample. If the user change this
variable, it could cause a bus contention, and reading that variable can
show up as a large latency in the trace causing a false positive. Read
this variable at the start of the sample with a READ_ONCE() into a local
variable and keep the code from sharing cache lines with readers.
- Fix function graph tracer static branch optimization code
When only one tracer is defined for function graph tracing, it uses a
static branch to call that tracer directly. When another tracer is added,
it goes into loop logic to call all the registered callbacks.
The code was incorrect when going back to one tracer and never re-enabled
the static branch again to do the optimization code.
- And other small fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaY9P3BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qou3AQCCrzdrIglLNABGPyny9sqWLDz6vyyw
nWAK9xg1VFxwRQD+LyJvVMWbpGeRBS/PsAK19RgldbgkQFWNv8gNhRKRgw0=
=U/kg
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"User visible changes:
- Add an entry into MAINTAINERS file for RUST versions of code
There's now RUST code for tracing and static branches. To
differentiate that code from the C code, add entries in for the
RUST version (with "[RUST]" around it) so that the right
maintainers get notified on changes.
- New bitmask-list option added to tracefs
When this is set, bitmasks in trace event are not displayed as hex
numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
- New show_event_filters file in tracefs
Instead of having to search all events/*/*/filter for any active
filters enabled in the trace instance, the file show_event_filters
will list them so that there's only one file that needs to be
examined to see if any filters are active.
- New show_event_triggers file in tracefs
Instead of having to search all events/*/*/trigger for any active
triggers enabled in the trace instance, the file
show_event_triggers will list them so that there's only one file
that needs to be examined to see if any triggers are active.
- Have traceoff_on_warning disable trace pintk buffer too
Recently recording of trace_printk() could go to other trace
instances instead of the top level instance. But if
traceoff_on_warning triggers, it doesn't stop the buffer with
trace_printk() and that data can easily be lost by being
overwritten. Have traceoff_on_warning also disable the instance
that has trace_printk() being written to it.
- Update the hist_debug file to show what function the field uses
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file
exists for every event. This displays the internal data of any
histogram enabled for that event. But it is lacking the function
that is called to process one of its fields. This is very useful
information that was missing when debugging histograms.
- Up the histogram stack size from 16 to 31
Stack traces can be used as keys for event histograms. Currently
the size of the stack that is stored is limited to just 16 entries.
But the storage space in the histogram is 256 bytes, meaning that
it can store up to 31 entries (plus one for the count of entries).
Instead of letting that space go to waste, up the limit from 16 to
31. This makes the keys much more useful.
- Fix permissions of per CPU file buffer_size_kb
The per CPU file of buffer_size_kb was incorrectly set to read only
in a previous cleanup. It should be writable.
- Reset "last_boot_info" if the persistent buffer is cleared
The last_boot_info shows address information of a persistent ring
buffer if it contains data from a previous boot. It is cleared when
recording starts again, but it is not cleared when the buffer is
reset. The data is useless after a reset so clear it on reset too.
Internal changes:
- A change was made to allow tracepoint callbacks to have preemption
enabled, and instead be protected by SRCU. This required some
updates to the callbacks for perf and BPF.
perf needed to disable preemption directly in its callback because
it expects preemption disabled in the later code.
BPF needed to disable migration, as its code expects to run
completely on the same CPU.
- Have irq_work wake up other CPU if current CPU is "isolated"
When there's a waiter waiting on ring buffer data and a new event
happens, an irq work is triggered to wake up that waiter. This is
noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a
house keeping CPU instead.
- Use proper free of trigger_data instead of open coding it in.
- Remove redundant call of event_trigger_reset_filter()
It was called immediately in a function that was called right after
it.
- Workqueue cleanups
- Report errors if tracing_update_buffers() were to fail.
- Make the enum update workqueue generic for other parts of tracing
On boot up, a work queue is created to convert enum names into
their numbers in the trace event format files. This work queue can
also be used for other aspects of tracing that takes some time and
shouldn't be called by the init call code.
The blk_trace initialization takes a bit of time. Have the
initialization code moved to the new tracing generic work queue
function.
- Skip kprobe boot event creation call if there's no kprobes defined
on cmdline
The kprobe initialization to set up kprobes if they are defined on
the cmdline requires taking the event_mutex lock. This can be held
by other tracing code doing initialization for a long time. Since
kprobes added to the kernel command line need to be setup
immediately, as they may be tracing early initialization code, they
cannot be postponed in a work queue and must be setup in the
initcall code.
If there's no kprobe on the kernel cmdline, there's no reason to
take the mutex and slow down the boot up code waiting to get the
lock only to find out there's nothing to do. Simply exit out early
if there's no kprobes on the kernel cmdline.
If there are kprobes on the cmdline, then someone cares more about
tracing over the speed of boot up.
- Clean up the trigger code a bit
- Move code out of trace.c and into their own files
trace.c is now over 11,000 lines of code and has become more
difficult to maintain. Start splitting it up so that related code
is in their own files.
Move all the trace_printk() related code into trace_printk.c.
Move the __always_inline stack functions into trace.h.
Move the pid filtering code into a new trace_pid.c file.
- Better define the max latency and snapshot code
The latency tracers have a "max latency" buffer that is a copy of
the main buffer and gets swapped with it when a new high latency is
detected. This keeps the trace up to the highest latency around
where this max_latency buffer is never written to. It is only used
to save the last max latency trace.
A while ago a snapshot feature was added to tracefs to allow user
space to perform the same logic. It could also enable events to
trigger a "snapshot" if one of their fields hit a new high. This
was built on top of the latency max_latency buffer logic.
Because snapshots came later, they were dependent on the latency
tracers to be enabled. In reality, the latency tracers depend on
the snapshot code and not the other way around. It was just that
they came first.
Restructure the code and the kconfigs to have the latency tracers
depend on snapshot code instead. This actually simplifies the logic
a bit and allows to disable more when the latency tracers are not
defined and the snapshot code is.
- Fix a "false sharing" in the hwlat tracer code
The loop to search for latency in hardware was using a variable
that could be changed by user space for each sample. If the user
change this variable, it could cause a bus contention, and reading
that variable can show up as a large latency in the trace causing a
false positive. Read this variable at the start of the sample with
a READ_ONCE() into a local variable and keep the code from sharing
cache lines with readers.
- Fix function graph tracer static branch optimization code
When only one tracer is defined for function graph tracing, it uses
a static branch to call that tracer directly. When another tracer
is added, it goes into loop logic to call all the registered
callbacks.
The code was incorrect when going back to one tracer and never
re-enabled the static branch again to do the optimization code.
- And other small fixes and cleanups"
* tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits)
function_graph: Restore direct mode when callbacks drop to one
tracing: Fix indentation of return statement in print_trace_fmt()
tracing: Reset last_boot_info if ring buffer is reset
tracing: Fix to set write permission to per-cpu buffer_size_kb
tracing: Fix false sharing in hwlat get_sample()
tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection
tracing: Better separate SNAPSHOT and MAX_TRACE options
tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
tracing: Rename trace_array field max_buffer to snapshot_buffer
tracing: Move pid filtering into trace_pid.c
tracing: Move trace_printk functions out of trace.c and into trace_printk.c
tracing: Use system_state in trace_printk_init_buffers()
tracing: Have trace_printk functions use flags instead of using global_trace
tracing: Make tracing_update_buffers() take NULL for global_trace
tracing: Make printk_trace global for tracing system
tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
tracing: Make tracing_selftest_running global to the tracing subsystem
tracing: Make tracing_disabled global for tracing system
tracing: Clean up use of trace_create_maxlat_file()
...
The init_blk_tracer() function causes significant boot delay as it
waits for the trace_event_sem lock held by trace_event_update_all().
Specifically, its child function register_trace_event() requires
this lock, which is occupied for an extended period during boot.
To resolve this, the execution of primary init_blk_tracer() is moved
to the trace_init_wq workqueue, allowing it to run asynchronously,
and prevent blocking the main boot thread.
Link: https://patch.msgid.link/20260204015353.163331-1-tianyaxiong@kylinos.cn
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
0-day bot flagged the use of strcpy() in blk_trace_setup(), because the
source buffer can theoretically be bigger than the destination buffer.
While none of the current callers pass a string bigger than
BLKTRACE_BDEV_SIZE, use strscpy() to prevent eventual future misuse and
silence the checker warnings.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202602020718.GUEIRyG9-lkp@intel.com/
Fixes: 113cbd6282 ("blktrace: pass blk_user_trace2 to setup functions")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Merge branch shared with kprobes on extending trace options
The trace options were defined by a 32 bit variable. This limits the
tracing instances to have a total of 32 different options. As that limit
has been hit, and more options are being added, increase the option mask
to a 64 bit number, doubling the number of options available.
As this is required for the kprobe topic branches as well as the tracing
topic branch, a separate branch was created and merged into both.
- Make trace_user_fault_read() available for the rest of tracing
The function trace_user_fault_read() is used by trace_marker file read to
allow reading user space to be done fast and without locking or
allocations. Make this available so that the system call trace events can
use it too.
- Have system call trace events read user space values
Now that the system call trace events callbacks are called in a faultable
context, take advantage of this and read the user space buffers for
various system calls. For example, show the path name of the openat system
call instead of just showing the pointer to that path name in user space.
Also show the contents of the buffer of the write system call. Several
system call trace events are updated to make tracing into a light weight
strace tool for all applications in the system.
- Update perf system call tracing to do the same
- And a config and syscall_user_buf_size file to control the size of the buffer
Limit the amount of data that can be read from user space. The default
size is 63 bytes but that can be expanded to 165 bytes.
- Allow the persistent ring buffer to print system calls normally
The persistent ring buffer prints trace events by their type and ignores
the print_fmt. This is because the print_fmt may change from kernel to
kernel. As the system call output is fixed by the system call ABI itself,
there's no reason to limit that. This makes reading the system call events
in the persistent ring buffer much nicer and easier to understand.
- Add options to show text offset to function profiler
The function profiler that counts the number of times a function is hit
currently lists all functions by its name and offset. But this becomes
ambiguous when there are several functions with the same name. Add a
tracing option that changes the output to be that of _text+offset
instead. Now a user space tool can use this information to map the
_text+offset to the unique function it is counting.
- Report bad dynamic event command
If a bad command is passed to the dynamic_events file, report it properly
in the error log.
- Clean up tracer options
Clean up the tracer option code a bit, by removing some useless code and
also using switch statements instead of a series of if statements.
- Have tracing options be instance specific
Tracers can have their own options (function tracer, irqsoff tracer,
function graph tracer, etc). But now that the same tracer can be enabled
in multiple trace instances, their options are still global. The API is
per instance, thus changing one affects other instances. This isn't even
consistent, as the option take affect differently depending on when an
tracer started in an instance. Make the options for instances only affect
the instance it is changed under.
- Optimize pid_list lock contention
Whenever the pid_list is read, it uses a spin lock. This happens at every
sched switch. Taking the lock at sched switch can be removed by instead
using a seqlock counter.
- Clean up the trace trigger structures
The trigger code uses two different structures to implement a single
tigger. This was due to trying to reuse code for the two different types
of triggers (always on trigger, and count limited trigger). But by adding
a single field to one structure, the other structure could be absorbed
into the first structure making he code easier to understand.
- Create a bulk garbage collector for trace triggers
If user space has triggers for several hundreds of events and then removes
them, it can take several seconds to complete. This is because each
removal calls the slow tracepoint_synchronize_unregister() that can take
hundreds of milliseconds to complete. Instead, create a helper thread that
will do the clean up. When a trigger is removed, it will create the
kthread if it isn't already created, and then add the trigger to a llist.
The kthread will take the items off the llist, call
tracepoint_synchronize_unregister(), and then remove the items it took
off. It will then check if there's more items to free before sleeping.
This makes user space removing all these triggers to finish in less than a
second.
- Allow function tracing of some of the tracing infrastructure code
Because the tracing code can cause recursion issues if it is traced by the
function tracer the entire tracing directory disables function tracing.
But not all of tracing causes issues if it is traced. Namely, the event
tracing code. Add a config that enables some of the tracing code to be
traced to help in debugging it. Note, when this is enabled, it does add
noise to general function tracing, especially if events are enabled as
well (which is a common case).
- Add boot-time backup instance for persistent buffer
The persistent ring buffer is used mostly for kernel crash analysis in the
field. One issue is that if there's a crash, the data in the persistent
ring buffer must be read before tracing can begin using it. This slows
down the boot process. Once tracing starts in the persistent ring buffer,
the old data must be freed and the addresses no longer match and old
events can't be in the buffer with new events.
Create a way to create a backup buffer that copies the persistent ring
buffer at boot up. Then after a crash, the always on tracer can begin
immediately as well as the normal boot process while the crash analysis
tooling uses the backup buffer. After the backup buffer is finished being
read, it can be removed.
- Enable function graph args and return address options at the same time
Currently the when reading of arguments in the function graph tracer is
enabled, the option to record the parent function in the entry event can
not be enabled. Update the code so that it can.
- Add new struct_offset() helper macro
Add a new macro that takes a pointer to a structure and a name of one of
its members and it will return the offset of that member. This allows the
ring buffer code to simplify the following:
From: size = struct_size(entry, buf, cnt - sizeof(entry->id));
To: size = struct_offset(entry, id) + cnt;
There should be other simplifications that this macro can help out with as
well.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS9xqxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qj6tAQD4MR1lsE3XpH09asO4CDDfhbtRSQVD
o8bVKVihWx/j5gD/XezjqE2Q2+DO6dhnsQY6pbtNdXoKgaMuQJGA+dvPsQc=
=HilC
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Extend tracing option mask to 64 bits
The trace options were defined by a 32 bit variable. This limits the
tracing instances to have a total of 32 different options. As that
limit has been hit, and more options are being added, increase the
option mask to a 64 bit number, doubling the number of options
available.
As this is required for the kprobe topic branches as well as the
tracing topic branch, a separate branch was created and merged into
both.
- Make trace_user_fault_read() available for the rest of tracing
The function trace_user_fault_read() is used by trace_marker file
read to allow reading user space to be done fast and without locking
or allocations. Make this available so that the system call trace
events can use it too.
- Have system call trace events read user space values
Now that the system call trace events callbacks are called in a
faultable context, take advantage of this and read the user space
buffers for various system calls. For example, show the path name of
the openat system call instead of just showing the pointer to that
path name in user space. Also show the contents of the buffer of the
write system call. Several system call trace events are updated to
make tracing into a light weight strace tool for all applications in
the system.
- Update perf system call tracing to do the same
- And a config and syscall_user_buf_size file to control the size of
the buffer
Limit the amount of data that can be read from user space. The
default size is 63 bytes but that can be expanded to 165 bytes.
- Allow the persistent ring buffer to print system calls normally
The persistent ring buffer prints trace events by their type and
ignores the print_fmt. This is because the print_fmt may change from
kernel to kernel. As the system call output is fixed by the system
call ABI itself, there's no reason to limit that. This makes reading
the system call events in the persistent ring buffer much nicer and
easier to understand.
- Add options to show text offset to function profiler
The function profiler that counts the number of times a function is
hit currently lists all functions by its name and offset. But this
becomes ambiguous when there are several functions with the same
name.
Add a tracing option that changes the output to be that of
'_text+offset' instead. Now a user space tool can use this
information to map the '_text+offset' to the unique function it is
counting.
- Report bad dynamic event command
If a bad command is passed to the dynamic_events file, report it
properly in the error log.
- Clean up tracer options
Clean up the tracer option code a bit, by removing some useless code
and also using switch statements instead of a series of if
statements.
- Have tracing options be instance specific
Tracers can have their own options (function tracer, irqsoff tracer,
function graph tracer, etc). But now that the same tracer can be
enabled in multiple trace instances, their options are still global.
The API is per instance, thus changing one affects other instances.
This isn't even consistent, as the option take affect differently
depending on when an tracer started in an instance. Make the options
for instances only affect the instance it is changed under.
- Optimize pid_list lock contention
Whenever the pid_list is read, it uses a spin lock. This happens at
every sched switch. Taking the lock at sched switch can be removed by
instead using a seqlock counter.
- Clean up the trace trigger structures
The trigger code uses two different structures to implement a single
tigger. This was due to trying to reuse code for the two different
types of triggers (always on trigger, and count limited trigger). But
by adding a single field to one structure, the other structure could
be absorbed into the first structure making he code easier to
understand.
- Create a bulk garbage collector for trace triggers
If user space has triggers for several hundreds of events and then
removes them, it can take several seconds to complete. This is
because each removal calls tracepoint_synchronize_unregister() that
can take hundreds of milliseconds to complete.
Instead, create a helper thread that will do the clean up. When a
trigger is removed, it will create the kthread if it isn't already
created, and then add the trigger to a llist. The kthread will take
the items off the llist, call tracepoint_synchronize_unregister(),
and then remove the items it took off. It will then check if there's
more items to free before sleeping.
This makes user space removing all these triggers to finish in less
than a second.
- Allow function tracing of some of the tracing infrastructure code
Because the tracing code can cause recursion issues if it is traced
by the function tracer the entire tracing directory disables function
tracing. But not all of tracing causes issues if it is traced.
Namely, the event tracing code. Add a config that enables some of the
tracing code to be traced to help in debugging it. Note, when this is
enabled, it does add noise to general function tracing, especially if
events are enabled as well (which is a common case).
- Add boot-time backup instance for persistent buffer
The persistent ring buffer is used mostly for kernel crash analysis
in the field. One issue is that if there's a crash, the data in the
persistent ring buffer must be read before tracing can begin using
it. This slows down the boot process. Once tracing starts in the
persistent ring buffer, the old data must be freed and the addresses
no longer match and old events can't be in the buffer with new
events.
Create a way to create a backup buffer that copies the persistent
ring buffer at boot up. Then after a crash, the always on tracer can
begin immediately as well as the normal boot process while the crash
analysis tooling uses the backup buffer. After the backup buffer is
finished being read, it can be removed.
- Enable function graph args and return address options at the same
time
Currently the when reading of arguments in the function graph tracer
is enabled, the option to record the parent function in the entry
event can not be enabled. Update the code so that it can.
- Add new struct_offset() helper macro
Add a new macro that takes a pointer to a structure and a name of one
of its members and it will return the offset of that member. This
allows the ring buffer code to simplify the following:
From: size = struct_size(entry, buf, cnt - sizeof(entry->id));
To: size = struct_offset(entry, id) + cnt;
There should be other simplifications that this macro can help out
with as well
* tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (42 commits)
overflow: Introduce struct_offset() to get offset of member
function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously
tracing: Add boot-time backup of persistent ring buffer
ftrace: Allow tracing of some of the tracing code
tracing: Use strim() in trigger_process_regex() instead of skip_spaces()
tracing: Add bulk garbage collection of freeing event_trigger_data
tracing: Remove unneeded event_mutex lock in event_trigger_regex_release()
tracing: Merge struct event_trigger_ops into struct event_command
tracing: Remove get_trigger_ops() and add count_func() from trigger ops
tracing: Show the tracer options in boot-time created instance
ftrace: Avoid redundant initialization in register_ftrace_direct
tracing: Remove unused variable in tracing_trace_options_show()
fgraph: Make fgraph_no_sleep_time signed
tracing: Convert function graph set_flags() to use a switch() statement
tracing: Have function graph tracer option sleep-time be per instance
tracing: Move graph-time out of function graph options
tracing: Have function graph tracer option funcgraph-irqs be per instance
trace/pid_list: optimize pid_list->lock contention
tracing: Have function graph tracer define options per instance
tracing: Have function tracer define options per instance
...
Since enum trace_iterator_flags is 32bit, the max number of the
option flags is limited to 32 and it is fully used now. To add
a new option, we need to expand it.
So replace the TRACE_ITER_##flag with TRACE_ITER(flag) macro which
is 64bit bitmask.
Link: https://lore.kernel.org/all/176187877103.994619.166076000668757232.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The ftrace blktrace path allocates buffers and writes trace events but
was using the wrong recording function. After
commit 4d8bc7bd4f ("blktrace: move ftrace blk_io_tracer to blk_io_trace2"),
the ftrace interface was moved to use blk_io_trace2 format, but
__blk_add_trace() still called record_blktrace_event() which writes in
blk_io_trace (v1) format.
This causes critical data corruption:
- blk_io_trace (v1) has 32-bit 'action' field at offset 28
- blk_io_trace2 (v2) has 32-bit 'pid' at offset 28 and 64-bit 'action'
at offset 32
- When record_blktrace_event() writes to a v2 buffer:
* Writing pid (offset 32 in v1) corrupts the v2 action field
* Writing action (offset 28 in v1) corrupts the v2 pid field
* The 64-bit action is truncated to 32-bit via lower_32_bits()
Fix by:
1. Adding version switch to select correct format (v1 vs v2)
2. Calling appropriate recording function based on version
3. Defaulting to v2 for ftrace (as intended by commit 4d8bc7bd4f)
4. Adding WARN_ONCE for unexpected version values
Without this patch :-
linux-block (for-next) # sh reproduce_blktrace_bug.sh
dd-14242 [033] d..1. 3903.022308: Unknown action 36a2
dd-14242 [033] d..1. 3903.022333: Unknown action 36a2
dd-14242 [033] d..1. 3903.022365: Unknown action 36a2
dd-14242 [033] d..1. 3903.022366: Unknown action 36a2
dd-14242 [033] d..1. 3903.022369: Unknown action 36a2
The action field is corrupted because:
- ftrace allocated blk_io_trace2 buffer (64 bytes)
- But called record_blktrace_event() (writes v1, 48 bytes)
- Field offsets don't match, causing corruption
The hex value shown 0x30e3 is actually a PID, not an action code!
linux-block (for-next) #
linux-block (for-next) #
linux-block (for-next) # sh reproduce_blktrace_bug.sh
Trace output looks correct:
dd-2420 [019] d..1. 59.641742: 251,0 Q RS 0 + 8 [dd]
dd-2420 [019] d..1. 59.641775: 251,0 G RS 0 + 8 [dd]
dd-2420 [019] d..1. 59.641784: 251,0 P N [dd]
dd-2420 [019] d..1. 59.641785: 251,0 U N [dd] 1
dd-2420 [019] d..1. 59.641788: 251,0 D RS 0 + 8 [dd]
Fixes: 4d8bc7bd4f ("blktrace: move ftrace blk_io_tracer to blk_io_trace2")
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The WARN_ON_ONCE introduced in
commit f9ee38bbf7 ("blktrace: add block trace commands for zone operations")
triggers kernel warnings when zone operations are traced with blktrace
version 1. This can spam the kernel log during normal operation with
zoned block devices when userspace is using the legacy blktrace
protocol.
Currently blktrace implementation drops newly added REQ_OP_ZONE_XXX
when blktrace userspce version is set to 1.
Remove the WARN_ON_ONCE and quietly filter these events. Add a
rate-limited debug message to help diagnose potential issues without
flooding the kernel log. The debug message can be enabled via dynamic
debug when needed for troubleshooting.
This approach is more appropriate as encountering zone operations with
blktrace v1 is an expected condition that should be handled gracefully
rather than warned about, since users may be running older blktrace
userspace tools that only support version 1 of the protocol.
With this patch :-
linux-block (for-next) # git log -1
commit c8966006a0971d2b4bf94c0426eb7e4407c6853f (HEAD -> for-next)
Author: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Date: Mon Oct 27 19:26:53 2025 -0700
blktrace: use debug print to report dropped events
linux-block (for-next) # cdblktests
blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing) [passed]
runtime 3.805s ... 3.889s
blktests (master) # dmesg -c
blktests (master) # echo "file kernel/trace/blktrace.c +p" > /sys/kernel/debug/dynamic_debug/control
blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing) [passed]
runtime 3.889s ... 3.881s
blktests (master) # dmesg -c
[ 77.826237] blktrace: blktrace v1 cannot trace zone operation 0x1000190001
[ 77.826260] blktrace: blktrace v1 cannot trace zone operation 0x1000190004
[ 77.826282] blktrace: blktrace v1 cannot trace zone operation 0x1001490007
[ 77.826288] blktrace: blktrace v1 cannot trace zone operation 0x1001890008
[ 77.826343] blktrace: blktrace v1 cannot trace zone operation 0x1000190001
[ 77.826347] blktrace: blktrace v1 cannot trace zone operation 0x1000190004
[ 77.826350] blktrace: blktrace v1 cannot trace zone operation 0x1001490007
[ 77.826354] blktrace: blktrace v1 cannot trace zone operation 0x1001890008
[ 77.826373] blktrace: blktrace v1 cannot trace zone operation 0x1000190001
[ 77.826377] blktrace: blktrace v1 cannot trace zone operation 0x1000190004
blktests (master) # echo "file kernel/trace/blktrace.c -p" > /sys/kernel/debug/dynamic_debug/control
blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing) [passed]
runtime 3.881s ... 3.824s
blktests (master) # dmesg -c
blktests (master) #
Reported-by: syzbot+153e64c0aa875d7e4c37@syzkaller.appspotmail.com
Fixes: f9ee38bbf7 ("blktrace: add block trace commands for zone operations")
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Handle the BLKTRACESETUP2 ioctl, requesting an extended version of the
blktrace protocol from user-space.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Trace zone write plugging operations on block devices.
As tracing of zoned block commands needs the upper 32bit of the widened
64bit action, only add traces to blktrace if user-space has requested
version 2 of the blktrace protocol.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Expose ZONE APPEND completions as a block trace completion action to
blktrace.
As tracing of zoned block commands needs the upper 32bit of the widened
64bit action, only add traces to blktrace if user-space has requested
version 2 of the blktrace protocol.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add block trace commands for zone operations. These commands can only be
handled with version 2 of the blktrace protocol. For version 1, warn if a
command that does not fit into the 16 bits reserved for the command in
this version is passed in.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move ftrace's blk_io_tracer to the new blk_io_trace2 infrastructure.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move trace_note() to the new blk_io_trace2 infrastructure.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Differentiate between blk_io_trace and blk_io_trace2 when relaying to
user-space depending on which version has been requested by the blktrace
utility.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add definitions for the extended version of the blktrace protocol using a
wider action type to be able to record new actions in the kernel.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass struct blk_user_trace_setup2 to blktrace_setup_finalize(). This
prepares for the incoming extension of the blktrace protocol with a 64bit
act_mask.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add definitions for a version 2 of the blk_user_trace_setup ioctl. This
new ioctl will enable a different struct layout of the binary data passed
to user-space when using a new version of the blktrace utility requesting
the new struct layout.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split do_blk_trace_setup into two functions, this is done to prepare for
an incoming new BLKTRACESETUP2 ioctl(2) which can receive extended
parameters from user-space.
Also move the size verification logic to the callers in preparation for
using a new internal structure later.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the internal use of the action in blktrace to 64bit. Although for
now only the lower 32bits will be used.
With the upcoming version 2 of the blktrace user-space protocol the upper
32bit will also be utilized.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Untangle the if/else sequence setting the trace action in
__blk_add_trace() and turn it into a switch statement for better
extensibility.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split out the code relaying a blktrace event to user-space using relayfs.
This enables adding a second version supporting a new version of the
protocol.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out the recording of a blktrace event into its own function,
deduplicating the code.
This also enables recording different versions of the blktrace protocol
later on.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
De-duplicate the calculation of the trace length instead of doing the
calculation twice, once for calling trace_buffer_lock_reserve() and once
for calling relay_reserve().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- The 2 patch series "squashfs: Remove page->mapping references" from
Matthew Wilcox gets us closer to being able to remove page->mapping.
- The 5 patch series "relayfs: misc changes" from Jason Xing does some
maintenance and minor feature addition work in relayfs.
- The 5 patch series "kdump: crashkernel reservation from CMA" from Jiri
Bohac switches us from static preallocation of the kdump crashkernel's
working memory over to dynamic allocation. So the difficulty of
a-priori estimation of the second kernel's needs is removed and the
first kernel obtains extra memory.
- The 5 patch series "generalize panic_print's dump function to be used
by other kernel parts" from Feng Tang implements some consolidation and
rationalizatio of the various ways in which a faiing kernel splats
information at the operator.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+82gAKCRDdBJ7gKXxA
jj4JAP9xb+w9DrBY6sa+7KTPIb+aTqQ7Zw3o9O2m+riKQJv6jAEA6aEwRnDA0451
fDT5IqVlCWGvnVikdZHSnvhdD7TGsQ0=
=rT71
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
Add zoned block commands to blk_fill_rwbs:
- ZONE APPEND will be decoded as 'ZA'
- ZONE RESET will be decoded as 'ZR'
- ZONE RESET ALL will be decoded as 'ZRA'
- ZONE FINISH will be decoded as 'ZF'
- ZONE OPEN will be decoded as 'ZO'
- ZONE CLOSE will be decoded as 'ZC'
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-2-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace internal subbuf_start in blktrace with the default policy in
relayfs.
Remove dropped field from struct blktrace. Correspondingly, call the
common helper in relay. By incrementing full_count to keep track of how
many times we encountered a full buffer issue, user space will know how
many events were lost.
Link: https://lkml.kernel.org/r/20250612061201.34272-5-kerneljasonxing@gmail.com
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Yushan Zhou <katrinzhou@tencent.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "relayfs: misc changes", v5.
The series mostly focuses on the error counters which helps every user
debug their own kernel module.
This patch (of 5):
prev_padding represents the unused space of certain subbuffer. If the
content of a call of relay_write() exceeds the limit of the remainder of
this subbuffer, it will skip storing in the rest space and record the
start point as buf->prev_padding in relay_switch_subbuf(). Since the buf
is a per-cpu big buffer, the point of prev_padding as a global value for
the whole buffer instead of a single subbuffer (whose padding info is
stored in buf->padding[]) seems meaningless from the real use cases, so we
don't bother to record it any more.
Link: https://lkml.kernel.org/r/20250612061201.34272-1-kerneljasonxing@gmail.com
Link: https://lkml.kernel.org/r/20250612061201.34272-2-kerneljasonxing@gmail.com
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Yushan Zhou <katrinzhou@tencent.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Filesystems like XFS can implement atomic write I/O using either
REQ_ATOMIC flag set in the bio or via CoW operation. It will be useful
if we have a flag in trace events to distinguish between the two. This
patch adds char 'U' (Untorn writes) to rwbs field of the trace events
if REQ_ATOMIC flag is set in the bio.
<W/ REQ_ATOMIC>
=================
xfs_io-4238 [009] ..... 4148.126843: block_rq_issue: 259,0 WFSU 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [009] d.h1. 4148.129864: block_rq_complete: 259,0 WFSU () 768 + 32 none,0,0 [0]
<W/O REQ_ATOMIC>
===============
xfs_io-4237 [010] ..... 4143.325616: block_rq_issue: 259,0 WS 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [010] d.H1. 4143.329138: block_rq_complete: 259,0 WS () 768 + 32 none,0,0 [0]
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/44317cb2ec4588f6a2c1501a96684e6a1196e8ba.1747921498.git.ritesh.list@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer bounce buffering support is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A recent change added return 0 before an existing return statement
at the end of function blk_trace_setup. The final return is now
redundant, so remove it.
Fixes: 64d124798244 ("blktrace: move copy_[to|from]_user() out of ->debugfs_lock")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20241204150450.399005-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Call each handler directly and the handler do grab q->debugfs_mutex,
prepare for killing dependency between ->debug_mutex and ->mmap_lock.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241128125029.4152292-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
gcc-9 warns about a possibly non-terminated string copy:
kernel/trace/blktrace.c: In function 'do_blk_trace_setup':
kernel/trace/blktrace.c:527:2: error: 'strncpy' specified bound 32 equals destination size [-Werror=stringop-truncation]
Newer versions are fine here because they see the following explicit
nul-termination. Using strscpy_pad() avoids the warning and
simplifies the code a little. The padding helps give a clean
buffer to userspace.
Link: https://lkml.kernel.org/r/20240409140059.3806717-5-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Justin Stitt <justinstitt@google.com>
Cc: Alexey Starikovskiy <astarikovskiy@suse.de>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Len Brown <lenb@kernel.org>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bdev_get_queue() never returns NULL. Several commits [1][2] have been made
before to remove such superfluous checks, but some still remained.
For places where bdev_get_queue() is called solely for NULL checks, it is
removed entirely.
[1] commit ec9fd2a13d ("blk-lib: don't check bdev_get_queue() NULL check")
[2] commit fea127b36c ("block: remove superfluous check for request queue in bdev_is_zoned()")
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230203024029.48260-1-qkrwngud825@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time. To make things simpler, just
call debugfs_lookup_and_remove() instead which handles all of the logic
at once.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230202141956.2299521-1-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the blk_classic option is enabled, non-blktrace events must be
filtered out. Otherwise, events of other types are output in the blktrace
classic format, which is unexpected.
The problem can be triggered in the following ways:
# echo 1 > /sys/kernel/debug/tracing/options/blk_classic
# echo 1 > /sys/kernel/debug/tracing/events/enable
# echo blk > /sys/kernel/debug/tracing/current_tracer
# cat /sys/kernel/debug/tracing/trace_pipe
Fixes: c71a896154 ("blktrace: add ftrace plugin")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use only one hyphen in kernel-doc notation between the function name
and its short description.
The is the documented kerenl-doc format. It also fixes the HTML
presentation to be consistent with other functions.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20221201070331.25685-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As previous commit, 'blk_trace_cleanup' will stop block trace if
block trace's state is 'Blktrace_running'.
So remove unnessary stop block trace in 'blk_trace_shutdown'.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019033602.752383-4-yebin@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When test as follows:
step1: ioctl(sda, BLKTRACESETUP, &arg)
step2: ioctl(sda, BLKTRACESTART, NULL)
step3: ioctl(sda, BLKTRACETEARDOWN, NULL)
step4: ioctl(sda, BLKTRACESETUP, &arg)
Got issue as follows:
debugfs: File 'dropped' in directory 'sda' already present!
debugfs: File 'msg' in directory 'sda' already present!
debugfs: File 'trace0' in directory 'sda' already present!
And also find syzkaller report issue like "KASAN: use-after-free Read in relay_switch_subbuf"
"https://syzkaller.appspot.com/bug?id=13849f0d9b1b818b087341691be6cc3ac6a6bfb7"
If remove block trace without stop(BLKTRACESTOP) block trace, '__blk_trace_remove'
will just set 'q->blk_trace' with NULL. However, debugfs file isn't removed, so
will report file already present when call BLKTRACESETUP.
static int __blk_trace_remove(struct request_queue *q)
{
struct blk_trace *bt;
bt = rcu_replace_pointer(q->blk_trace, NULL,
lockdep_is_held(&q->debugfs_mutex));
if (!bt)
return -EINVAL;
if (bt->trace_state != Blktrace_running)
blk_trace_cleanup(q, bt);
return 0;
}
If do test as follows:
step1: ioctl(sda, BLKTRACESETUP, &arg)
step2: ioctl(sda, BLKTRACESTART, NULL)
step3: ioctl(sda, BLKTRACETEARDOWN, NULL)
step4: remove sda
There will remove debugfs directory which will remove recursively all file
under directory.
>> blk_release_queue
>> debugfs_remove_recursive(q->debugfs_dir)
So all files which created in 'do_blk_trace_setup' are removed, and
'dentry->d_inode' is NULL. But 'q->blk_trace' is still in 'running_trace_lock',
'trace_note_tsk' will traverse 'running_trace_lock' all nodes.
>>trace_note_tsk
>> trace_note
>> relay_reserve
>> relay_switch_subbuf
>> d_inode(buf->dentry)->i_size
To solve above issues, reference commit '5afedf670c', call 'blk_trace_cleanup'
unconditionally in '__blk_trace_remove' and first stop block trace in
'blk_trace_cleanup'.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019033602.752383-3-yebin@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reflect recent changes in the blk_fill_rwbs() kernel-doc header.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 919dbca867 ("blktrace: Use the new blk_opf_t type")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220715184735.2326034-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Improve static type checking by using the new blk_opf_t type for a function
argument that represents a combination of a request operation and request
flags. Rename that argument from 'op' into 'opf' to make its role more
clear.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-12-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Trace the remapped operation and its flags instead of only the data
direction of remapped operations. This issue was detected by analyzing
the warnings reported by sparse related to the new blk_opf_t type.
Reviewed-by: Jun'ichi Nomura <junichi.nomura@nec.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Fixes: 1b9a9ab78b ("blktrace: use op accessors")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-11-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>