Commit graph

2020 commits

Author SHA1 Message Date
Jens Axboe
e7c30675a7 io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
Currently a few pointer dereferences need to be made to both check if
BPF filters are installed, and then also to retrieve the actual filter
for the opcode. Cache the table in ctx->bpf_filters to avoid that.

Add a bit of debug info on ring exit to show if we ever got this wrong.
Small risk of that given that the table is currently only updated in one
spot, but once task forking is enabled, that will add one more spot.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-27 11:10:46 -07:00
Jens Axboe
8768770cf5 io_uring/bpf_filter: allow filtering on contents of struct open_how
This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2,
where the open_how flags, mode, and resolve can be checked by filters.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-27 11:10:46 -07:00
Jens Axboe
cff1c26b42 io_uring/net: allow filtering on IORING_OP_SOCKET data
Example population method for the BPF based opcode filtering. This
exposes the socket family, type, and protocol to a registered BPF
filter. This in turn enables the filter to make decisions based on
what was passed in to the IORING_OP_SOCKET request type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-27 11:10:46 -07:00
Jens Axboe
d42eb05e60 io_uring: add support for BPF filtering for opcode restrictions
Add support for loading classic BPF programs with io_uring to provide
fine-grained filtering of SQE operations. Unlike
IORING_REGISTER_RESTRICTIONS which only allows bitmap-based allow/deny
of opcodes, BPF filters can inspect request attributes and make dynamic
decisions.

The filter is registered via IORING_REGISTER_BPF_FILTER with a struct
io_uring_bpf:

struct io_uring_bpf_filter {
	__u32	opcode;		/* io_uring opcode to filter */
	__u32	flags;
	__u32	filter_len;	/* number of BPF instructions */
	__u32	resv;
	__u64	filter_ptr;	/* pointer to BPF filter */
	__u64	resv2[5];
};

enum {
	IO_URING_BPF_CMD_FILTER	= 1,
};

struct io_uring_bpf {
	__u16	cmd_type;	/* IO_URING_BPF_* values */
	__u16	cmd_flags;	/* none so far */
	__u32	resv;
	union {
		struct io_uring_bpf_filter	filter;
	};
};

and the filters get supplied a struct io_uring_bpf_ctx:

struct io_uring_bpf_ctx {
	__u64	user_data;
	__u8	opcode;
	__u8	sqe_flags;
	__u8	pdu_size;
	__u8	pad[5];
};

where it's possible to filter on opcode and sqe_flags, with pdu_size
indicating how much extra data is being passed in beyond the pad field.
This will used for specific finer grained filtering inside an opcode.
An example of that for sockets is in one of the following patches.
Anything the opcode supports can end up in this struct, populated by
the opcode itself, and hence can be filtered for.

Filters have the following semantics:
  - Return 1 to allow the request
  - Return 0 to deny the request with -EACCES
  - Multiple filters can be stacked per opcode. All filters must
    return 1 for the opcode to be allowed.
  - Filters are evaluated in registration order (most recent first)

The implementation uses classic BPF (cBPF) rather than eBPF for as
that's required for containers, and since they can be used by any
user in the system.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-27 11:09:57 -07:00
Jens Axboe
e26f51f6f6 io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently
For potential long term allocations, ensure that we play nicer with
memcg and use the accounting variant of the GFP_KERNEL allocation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-25 10:07:35 -07:00
Jens Axboe
6e0d71c288 io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation
Be a bit nicer and ensure it plays nice with memcg accounting.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-25 10:07:09 -07:00
Pavel Begunkov
795663b4d1 io_uring/zcrx: implement large rx buffer support
There are network cards that support receive buffers larger than 4K, and
that can be vastly beneficial for performance, and benchmarks for this
patch showed up to 30% CPU util improvement for 32K vs 4K buffers.

Allows zcrx users to specify the size in struct
io_uring_zcrx_ifq_reg::rx_buf_len. If set to zero, zcrx will use a
default value. zcrx will check and fail if the memory backing the area
can't be split into physically contiguous chunks of the required size.
It's more restrictive as it only needs dma addresses to be contig, but
that's beyond this series.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: kill duplicate netdev_queues.h include]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-24 08:33:03 -07:00
Jens Axboe
816095894c io_uring/io-wq: handle !sysctl_hung_task_timeout_secs
If the hung_task_timeout sysctl is set to 0, then we'll end up busy
looping inside io_wq_exit_workers() after an earlier commit switched to
using wait_for_completion_timeout(). Use the maximum schedule timeout
value for that case.

Fixes: 1f293098a3 ("io_uring/io-wq: don't trigger hung task for syzbot craziness")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-23 13:58:03 -07:00
Linus Torvalds
7907f673d0 io_uring-6.19-20260122
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmly5p8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqzlD/4+wBnwP7VGJSQJRHkWnewgbLk4CZcDbFdw
 jlv2Hp8AymmhNkMQmCauKJMahUYcDz0Gb+sSs3/MdOMqNvHkh1G6QVEMbK0W9N35
 HuMmimb9Pac2JSMrTiJnVcogi371P/iD1Ps7qlo08QvxV6uSQtHxtoUvsuQqwkxn
 WxhDVIxlLOZgWL0OiSAJ9maEvGWiMYeI1Me+MKatQQbkEybTBOMy34HhTe3Fi3zQ
 X3jVD2loyu/dP309/yjr2ZS2b3SGJyMCqpiICunpAeAafvnPr4vXiuUhATZHmQ1B
 XnAkkb4p0k8nECMDNZjSgqcFbDzkbHvsuyFMiSFgEELwV87qusZnbL66jmcPC3eh
 JDd6SuE4FpXNUhdGc7h6ma+vNwcr1ZNdPlUmqvlDbhcl57fcge14930OXybFPW8Q
 z3+LC5Jo37e61wYTv7M84wtsjdzdADkUb71y482U2bpeIZyg2YhTZ39FO9u6GETW
 2JhBXuikE1fanTji9klh0de2+LIbtlGZlJA1SkQiWsPRxWZB6Jg4wa2kkXZBLb94
 eR026zmg0h5Wt8g815E0XEHzv8hEqy14vGc+dj3Mw5pMYQGONNr5UOog50yihVNF
 q8dTYmtI4EN6Vbqq7usZ7Q/mN72PUzlykIT8OABPoWzvzvOZh4b2pZtKH1kfLUfN
 zMtq1Y7+dw==
 =n+AN
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for a potential leak of an iovec, if a specific cleanup path is
   used and the rw_cache is full at the time of the call

 - Fix for a regression added in this cycle, where waitid should be
   using prober release/acquire semantics for updating the wait queue
   head

 - Check for the cancelation bit being set for every work item processed
   by io-wq, not just at the start of the loop. Has no real practical
   implications other than to shut up syzbot doing crazy things that
   grossly overload a system, hence slowing down ring exit

 - A few selftest additions, updating the mini_liburing that selftests
   use

* tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  selftests/io_uring: support NO_SQARRAY in miniliburing
  selftests/io_uring: add io_uring_queue_init_params
  io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop
  io_uring/waitid: fix KCSAN warning on io_waitid->head
  io_uring/rw: free potentially allocated iovec on cache put failure
2026-01-23 12:51:00 -08:00
Jens Axboe
1edf0891d0 io_uring: fix bad indentation for setup flags if statement
smatch complains about this:

smatch warnings:
io_uring/io_uring.c:2741 io_uring_sanitise_params() warn: if statement not indented

hence fix it up.

Link: https://lore.kernel.org/all/202601231651.HeTmPS8C-lkp@intel.com/
Fixes: 5247c034a6 ("io_uring: introduce non-circular SQ")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601231651.HeTmPS8C-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-23 05:09:08 -07:00
Caleb Sander Mateos
82dadc8a49 io_uring/rsrc: take unsigned index in io_rsrc_node_lookup()
io_rsrc_node_lookup() takes a signed int index as input and compares it
to an unsigned length. Since the signed int is implicitly cast to an
unsigned int for the comparison and the length is bounded by
IORING_MAX_FIXED_FILES/IORING_MAX_REG_BUFFERS, negative indices are
already rejected on architectures where int is at least 32 bits. Make
this a bit clearer and avoid compiler warnings for comparisons of
signed and unsigned values by taking an unsigned int index instead.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 15:58:17 -07:00
Pavel Begunkov
5247c034a6 io_uring: introduce non-circular SQ
Outside of SQPOLL, normally SQ entries are consumed by the time the
submission syscall returns. For those cases we don't need a circular
buffer and the head/tail tracking, instead the kernel can assume that
entries always start from the beginning of the SQ at index 0. This patch
introduces a setup flag doing exactly that. It's a simpler and helps
to keeps SQEs hot in cache.

The feature is optional and enabled by setting IORING_SETUP_SQ_REWIND.
The flag is rejected if passed together with SQPOLL as it'd require
waiting for SQ before each submission. It also requires
IORING_SETUP_NO_SQARRAY, which can be supported but it's unlikely there
will be users, so leave more space for future optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 15:47:23 -07:00
Jens Axboe
0105b0562a io_uring: split out CQ waiting code into wait.c
Move the completion queue waiting and scheduling code out of io_uring.c
into a dedicated wait.c file. This further removes code out of the
main io_uring C and header file, and into a topical new file.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 09:21:16 -07:00
Jens Axboe
7642e66860 io_uring: split out task work code into tw.c
Move the task work handling code out of io_uring.c into a new tw.c file.
This includes the local work, normal work, and fallback work handling
infrastructure.

The associated tw.h header contains io_should_terminate_tw() as a static
inline helper, along with the necessary function declarations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 09:20:17 -07:00
Jens Axboe
1f293098a3 io_uring/io-wq: don't trigger hung task for syzbot craziness
Use the same trick that blk_io_schedule() does to avoid triggering the
hung task warning (and potential reboot/panic, depending on system
settings), and only wait for half the hung task timeout at the time.
If we exceed the default IO_URING_EXIT_WAIT_MAX period where we expect
things to certainly have finished unless there's a bug, then throw a
WARN_ON_ONCE() for that case.

Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Tested-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 07:25:35 -07:00
Jens Axboe
dd120bddc4 io_uring: add IO_URING_EXIT_WAIT_MAX definition
Add the timeout we normally wait before complaining about things being
stuck waiting for cancelations to complete as a define, and use it in
io_ring_exit_work().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 07:25:30 -07:00
Jens Axboe
649dd18f55 io_uring/sync: validate passed in offset
Check if the passed in offset is negative once cast to sync->off. This
ensures that -EINVAL is returned for that case, like it would be for
sync_file_range(2).

Fixes: c992fe2925 ("io_uring: add fsync support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21 11:50:59 -07:00
Ming Lei
f7bc22ca0d nvme/io_uring: optimize IOPOLL completions for local ring context
When multiple io_uring rings poll on the same NVMe queue, one ring can
find completions belonging to another ring. The current code always
uses task_work to handle this, but this adds overhead for the common
single-ring case.

This patch passes the polling io_ring_ctx through io_comp_batch's new
poll_ctx field. In io_do_iopoll(), the polling ring's context is stored
in iob.poll_ctx before calling the iopoll callbacks.

In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the
request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they
match (local context), we complete inline with io_uring_cmd_done32().
If they differ (remote context) or iob is NULL (non-iopoll path), we
use task_work as before.

This optimization eliminates task_work scheduling overhead for the
common case where a ring polls and finds its own completions.

~10% IOPS improvement is observed in the following benchmark:

fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -O0 -P1 -u1 -n1 /dev/ng0n1

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-20 10:18:01 -07:00
Jens Axboe
42b12cb5fd io_uring/timeout: annotate data race in io_flush_timeouts()
syzbot correctly reports this as a KCSAN race, as ctx->cached_cq_tail
should be read under ->uring_lock. This isn't immediately feasible in
io_flush_timeouts(), but as long as we read a stable value, that should
be good enough. If two io-wq threads compete on this value, then they
will both end up calling io_flush_timeouts() and at least one of them
will see the correct value.

Reported-by: syzbot+6c48db7d94402407301e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-20 09:54:17 -07:00
Jens Axboe
10dc959398 io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop
Currently this is checked before running the pending work. Normally this
is quite fine, as work items either end up blocking (which will create a
new worker for other items), or they complete fairly quickly. But syzbot
reports an issue where io-wq takes seemingly forever to exit, and with a
bit of debugging, this turns out to be because it queues a bunch of big
(2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't
support ->read_iter(), loop_rw_iter() ends up handling them. Each read
returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of
these pending, processing the whole chain can take a long time. Easily
longer than the syzbot uninterruptible sleep timeout of 140 seconds.
This then triggers a complaint off the io-wq exit path:

INFO: task syz.4.135:6326 blocked for more than 143 seconds.
      Not tainted syzkaller #0
      Blocked by coredump.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.4.135       state:D stack:26824 pid:6326  tgid:6324  ppid:5957   task_flags:0x400548 flags:0x00080000
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5256 [inline]
 __schedule+0x1139/0x6150 kernel/sched/core.c:6863
 __schedule_loop kernel/sched/core.c:6945 [inline]
 schedule+0xe7/0x3a0 kernel/sched/core.c:6960
 schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75
 do_wait_for_common kernel/sched/completion.c:100 [inline]
 __wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121
 io_wq_exit_workers io_uring/io-wq.c:1328 [inline]
 io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356
 io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203
 io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651
 io_uring_files_cancel include/linux/io_uring.h:19 [inline]
 do_exit+0x2ce/0x2bd0 kernel/exit.c:911
 do_group_exit+0xd3/0x2a0 kernel/exit.c:1112
 get_signal+0x2671/0x26d0 kernel/signal.c:3034
 arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337
 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline]
 exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
 syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline]
 do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa02738f749
RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098
RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98

There's really nothing wrong here, outside of processing these reads
will take a LONG time. However, we can speed up the exit by checking the
IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will
exit the ring after queueing up all of these reads. Then once the first
item is processed, io-wq will simply cancel the rest. That should avoid
syzbot running into this complaint again.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/
Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-20 07:55:59 -07:00
Jens Axboe
b994ace83a io_uring/waitid: fix KCSAN warning on io_waitid->head
Storing of the iw->head entry inside the wait_queue callback, or when
removing a waitid item, really should use proper load/store
acquire/release semantics, and KCSAN correctly warns of that. Ensure
that they do so.

Reported-by: syzbot+eb441775f4f948a0902f@syzkaller.appspotmail.com
Fixes: a48c0cbf28 ("io_uring/waitid: have io_waitid_complete() remove wait queue entry")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-19 19:55:30 -07:00
Jens Axboe
4b97480554 io_uring/rw: free potentially allocated iovec on cache put failure
If a read/write request goes through io_req_rw_cleanup() and has an
allocated iovec attached and fails to put to the rw_cache, then it may
end up with an unaccounted iovec pointer. Have io_rw_recycle() return
whether it recycled the request or not, and use that to gauge whether to
free a potential iovec or not.

Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-19 06:59:06 -07:00
Linus Torvalds
216c7a0326 io_uring-6.19-20260116
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlrBd8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkJGEADcc/bWQrzg/vjW+FA05ddxJ3fQJZmtxveq
 jqODAk7zf04CkD1PrpdhWWx77Y/2wnP8sgMsAVdSQLkG+j5GWRI4DDmAOt7MxHw7
 SzT1QxQkQb2aXgC2qtnqJnWrfj6If/gWEyBapdwaxOsABOu/zi4TgmjqFn6TO30D
 MG/+CnciabJCLLa27CkmSgIkA4enND23s3UnUX6qu9PET337xkNKm8i9pOWJN5q8
 pAgwHTRTMT75KtlM7oU6RLm4h+ifHUZC2mhZ3gDsAHfzlXEHlc58iVF3+QdzylLg
 6OUIdH/l4BnM5spez2sLijmq9rUJXg/AYyDBpK7riVcTu6BbhqbbTaM6uHGVNPZd
 ScqsHHeURRCUk0Q9SPiS8e34JOFgt2yOoTMks22YvSDYcaTgYpjzNIEy659sxXcy
 JZWmSqs4Hubicr/nv7yY4Fapsv0MB+m/Av0QQ6ftCnVUa5nA54unEF861+YYT0oF
 sfUozOulLjuXgP1qp0uI66E3Db9bH1W69qP/vGKvgOkxySo/lGz15DZ40hazkUxF
 YACITZmhzZ5WVERfBA5elmZHiYPOHqwMYU5qAHth2BrsK/91+3yg85vscOCHdSYL
 ch0JUYlhEZ/LfKRqrA6+1cwxor7Jog5Q5BuWcfqBG3ZiV1+LOopYLCFcGyeBzxne
 tTgyow5vhQ==
 =zH2C
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fix from Jens Axboe:
 "Just a single fix moving local task_work inside the cancelation loop,
  rather than only before cancelations.

  If any cancelations generate task_work, we do need to re-run it"

* tag 'io_uring-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring: move local task_work in exit cancel loop
2026-01-16 20:56:56 -08:00
Al Viro
5b9d406ff7 filename_...xattr(): don't consume filename reference
Callers switched to CLASS(filename_maybe_null) (in fs/xattr.c)
and CLASS(filename_complete_delayed) (in io_uring/xattr.c).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16 12:52:03 -05:00
Al Viro
e50aae1d39 non-consuming variants of do_{unlinkat,rmdir}()
similar to previous commit; replacements are filename_{unlinkat,rmdir}()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16 12:51:50 -05:00
Al Viro
dc912db15a non-consuming variant of do_mkdirat()
similar to previous commit; replacement is filename_mkdirat()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16 12:48:49 -05:00
Al Viro
da72b76aae non-consuming variant of do_symlinkat()
similar to previous commit; replacement is filename_symlinkat()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16 12:48:16 -05:00
Al Viro
037193b0ae non-consuming variant of do_linkat()
similar to previous commit; replacement is filename_linkat()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16 12:47:42 -05:00
Al Viro
e6d50234cc non-consuming variant of do_renameat2()
filename_renameat2() replaces do_renameat2(); unlike the latter,
it does not drop filename references - these days it can be just
as easily arranged in the caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-16 12:46:57 -05:00
Jens Axboe
8661d0b142 io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL
This currently isn't supported, and due to a recent commit, it also
cannot easily be supported by io_uring due to hash_node and IOPOLL
completion data overlapping.

This can be revisited if we ever do support cancelations of requests
that have gone to the block stack.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-14 22:04:11 -07:00
Jens Axboe
697a5284ad io_uring: fix IOPOLL with passthrough I/O
A previous commit improving IOPOLL made an incorrect assumption that
task_work isn't used with IOPOLL. This can cause crashes when doing
passthrough I/O on nvme, where queueing the completion task_work will
trample on the same memory that holds the completed list of requests.

Fix it up by shuffling the members around, so we're not sharing any
parts that end up getting used in this path.

Fixes: 3c7d76d612 ("io_uring: IOPOLL polling improvements")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/linux-block/CAHj4cs_SLPj9v9w5MgfzHKy+983enPx3ZQY2kMuMJ1202DBefw@mail.gmail.com/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-14 22:03:49 -07:00
Ming Lei
da579f05ef io_uring: move local task_work in exit cancel loop
With IORING_SETUP_DEFER_TASKRUN, task work is queued to ctx->work_llist
(local work) rather than the fallback list. During io_ring_exit_work(),
io_move_task_work_from_local() was called once before the cancel loop,
moving work from work_llist to fallback_llist.

However, task work can be added to work_llist during the cancel loop
itself. There are two cases:

1) io_kill_timeouts() is called from io_uring_try_cancel_requests() to
cancel pending timeouts, and it adds task work via io_req_queue_tw_complete()
for each cancelled timeout:

2) URING_CMD requests like ublk can be completed via
io_uring_cmd_complete_in_task() from ublk_queue_rq() during canceling,
given ublk request queue is only quiesced when canceling the 1st uring_cmd.

Since io_allowed_defer_tw_run() returns false in io_ring_exit_work()
(kworker != submitter_task), io_run_local_work() is never invoked,
and the work_llist entries are never processed. This causes
io_uring_try_cancel_requests() to loop indefinitely, resulting in
100% CPU usage in kworker threads.

Fix this by moving io_move_task_work_from_local() inside the cancel
loop, ensuring any work on work_llist is moved to fallback before
each cancel attempt.

Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-14 10:18:19 -07:00
Al Viro
541003b576 rename do_filp_open() to do_file_open()
"filp" thing never made sense; seeing that there are exactly 4 callers
in the entire tree (and it's neither exported nor even declared in
linux/*/*.h), there's no point keeping that ugliness.

FWIW, the 'filp' thing did originate in OSD&I; for some reason Tanenbaum
decided to call the object representing an opened file 'struct filp',
the last letter standing for 'position'.  In all Unices, Linux included,
the corresponding object had always been 'struct file'...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13 15:18:07 -05:00
Al Viro
9fa3ec8458 allow incomplete imports of filenames
There are two filename-related problems in io_uring and its
interplay with audit.

Filenames are imported when request is submitted and used when
it is processed.  Unfortunately, the latter may very well
happen in a different thread.  In that case the reference to
filename is put into the wrong audit_context - that of submitting
thread, not the processing one.  Audit logics is called by
the latter, and it really wants to be able to find the names
in audit_context current (== processing) thread.

Another related problem is the headache with refcounts -
normally all references to given struct filename are visible
only to one thread (the one that uses that struct filename).
io_uring violates that - an extra reference is stashed in
audit_context of submitter.  It gets dropped when submitter
returns to userland, which can happen simultaneously with
processing thread deciding to drop the reference it got.

We paper over that by making refcount atomic, but that means
pointless headache for everyone.

Solution: the notion of partially imported filenames.  Namely,
already copied from userland, but *not* exposed to audit yet.

io_uring can create that in submitter thread, and complete the
import (obtaining the usual reference to struct filename) in
processing thread.

Object: struct delayed_filename.

Primitives for working with it:

delayed_getname(&delayed_filename, user_string) - copies the name from
userland, returning 0 and stashing the address of (still incomplete)
struct filename in delayed_filename on success and returning -E... on
error.

delayed_getname_uflags(&delayed_filename, user_string, atflags) -
similar, in the same relation to delayed_getname() as getname_uflags()
is to getname()

complete_getname(&delayed_filename) - completes the import of filename
stashed in delayed_filename and returns struct filename to caller,
emptying delayed_filename.

CLASS(filename_complete_delayed, name)(&delayed_filename) - variant of
CLASS(filename) with complete_getname() for constructor.

dismiss_delayed_filename(&delayed_filename) - destructor; drops whatever
might be stashed in delayed_filename, emptying it.

putname_to_delayed(&delayed_filename, name) - if name is shared, stashes
its copy into delayed_filename and drops the reference to name, otherwise
stashes the name itself in there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13 15:18:07 -05:00
Jens Axboe
d6406c45f1 io_uring: track restrictions separately for IORING_OP and IORING_REGISTER
It's quite likely that only register opcode restrictions exists, in
which case we'd never need to check the normal opcodes. Split
ctx->restricted into two separate fields, one for I/O opcodes, and one
for register opcodes.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13 10:31:48 -07:00
Jens Axboe
991fb85a1d io_uring: move ctx->restricted check into io_check_restriction()
Just a cleanup, makes the code easier to read without too many dependent
nested checks.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13 10:31:48 -07:00
Jens Axboe
09bd84421d io_uring/register: set ctx->restricted when restrictions are parsed
Rather than defer this until the rings are enabled, just set it
upfront when the restrictions are parsed and enabled anyway. There's
no reason to defer this setting until the rings are enabled.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13 10:31:48 -07:00
Jens Axboe
e6ed0f051d io_uring/register: have io_parse_restrictions() set restrictions enabled
Rather than leave this to the caller, have io_parse_restrictions() set
->registered = true if restrictions have been enabled. This is in
preparation for having finer grained restrictions.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13 10:30:41 -07:00
Jens Axboe
51fff55a66 io_uring/register: have io_parse_restrictions() return number of ops
Rather than return 0 on success, return >= 0 for success, where the
return value is that number of parsed entries. As before, any < 0
return is an error.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-13 10:30:41 -07:00
Caleb Sander Mateos
130a827607 io_uring/register: drop io_register_enable_rings() submitter_task check
io_register_enable_rings() checks that the io_ring_ctx is
IORING_SETUP_R_DISABLED, which ensures submitter_task hasn't been
assigned by io_uring_create() or a previous io_register_enable_rings()
call. So drop the redundant check that submitter_task is NULL.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 11:21:38 -07:00
Caleb Sander Mateos
bcd4c95737 io_uring/msg_ring: drop unnecessary submitter_task checks
__io_msg_ring_data() checks that the target_ctx isn't
IORING_SETUP_R_DISABLED before calling io_msg_data_remote(), which calls
io_msg_remote_post(). So submitter_task can't be modified concurrently
with the read in io_msg_remote_post(). Additionally, submitter_task must
exist, as io_msg_data_remote() is only called for io_msg_need_remote(),
i.e. task_complete is set, which requires IORING_SETUP_DEFER_TASKRUN,
which in turn requires IORING_SETUP_SINGLE_ISSUER. And submitter_task is
assigned in io_uring_create() or io_register_enable_rings() before
enabling any IORING_SETUP_SINGLE_ISSUER io_ring_ctx.
Similarly, io_msg_send_fd() checks IORING_SETUP_R_DISABLED and
io_msg_need_remote() before calling io_msg_fd_remote(). submitter_task
therefore can't be modified concurrently with the read in
io_msg_fd_remote() and must be non-null.
io_register_enable_rings() can't run concurrently because it's called
from io_uring_register() -> __io_uring_register() with uring_lock held.
Thus, replace the READ_ONCE() and WRITE_ONCE() of submitter_task with
plain loads and stores. And remove the NULL checks of submitter_task in
io_msg_remote_post() and io_msg_fd_remote().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 11:21:38 -07:00
Caleb Sander Mateos
7a8737e113 io_uring: use release-acquire ordering for IORING_SETUP_R_DISABLED
io_uring_enter(), __io_msg_ring_data(), and io_msg_send_fd() read
ctx->flags and ctx->submitter_task without holding the ctx's uring_lock.
This means they may race with the assignment to ctx->submitter_task and
the clearing of IORING_SETUP_R_DISABLED from ctx->flags in
io_register_enable_rings(). Ensure the correct ordering of the
ctx->flags and ctx->submitter_task memory accesses by storing to
ctx->flags using release ordering and loading it using acquire ordering.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 4add705e4e ("io_uring: remove io_register_submitter")
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-12 11:21:38 -07:00
Linus Torvalds
68ad2095ca io_uring-6.19-20260109
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlhS1wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphGvD/9NR0RzWZdM0DwfbK4kyzfmQPCSs1kkqQF4
 LECsSc3B7OrJ/4yX27CiWNRlGdHWpmrOc8mtlAiUv+eArpoBmatjfn1UZACN0u/t
 CC0/ZXeYA6NQ8vnbnQZk+guHE7r9K66EFPFvMcEWmGGQ/CUBUKt1gkDkB1gD8qBp
 pdI/A+tZujCQA3XuyCE+qc5GJX+cFXqEx06GRDGQ+UnnAsJmSrtbUtZrEZsladMj
 16dclDfOX2X0bu9+P42rSkV2IrjwddNntDsLyF933uaayAJX9HbTTWxxw/mWr2Bt
 Be0Xh1+FniVAQFFM7qFQRqrWqecKrnh6RKg9lufWiuB4d5rq7eJT1xGb+IXX6xlr
 j/Lwbi8UkjpJmG1xnyWtk9oDQK4h+7p5MvCgCSqLrp1rY8nYT1CeCEzt1OJjeVWj
 cqm2hhkEUcioCz4gTHU8PBRxhfd4PRr/GBwZJ4jBFBeFTip1vy9kAn94Afrk/VLH
 HAreWZtsNsTDTF9cUSXyKDHYR9uGSi8NpZSEEV8dUaAKYpYBNSIutX5uqT9NBs8y
 3TJ0NrhlpVJIGwa5XtwKli62CXNsibQlNbnsd092+zvkiAAiYUiLNRaQnt3MGctk
 4eBp0sWTUkHxxKO3njUIDXZPB5g9jZvJxpjqK+V0CYzVdeimw2qs8fBQJvnYPtNw
 k3C0E2aJKQ==
 =ObKS
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:
 "A single fix for a regression introduced in 6.15, where a failure to
  wake up idle io-wq workers at ring exit will wait for the timeout to
  expire.

  This isn't normally noticeable, as the exit is async.

  But if a parent task created a thread that sets up a ring and uses
  requests that cause io-wq threads to be created, and the parent task
  then waits for the thread to exit, then it can take 5 seconds for that
  pthread_join() to succeed as the child thread is waiting for its
  children to exit.

  On top of that, just a basic cleanup as well"

* tag 'io_uring-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/io-wq: remove io_wq_for_each_worker() return value
  io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic
2026-01-09 15:21:15 -10:00
Ming Lei
15f506a77a io_uring: remove nr_segs recalculation in io_import_kbuf()
io_import_kbuf() recalculates iter->nr_segs to reflect only the bvecs
needed for the requested byte range. This was added to provide an
accurate segment count to bio_iov_bvec_set(), which copied nr_segs to
bio->bi_vcnt for use as a bio split hint.

The previous two patches eliminated this dependency:
 - bio_may_need_split() now uses bi_iter instead of bi_vcnt for split
   decisions
 - bio_iov_bvec_set() no longer copies nr_segs to bi_vcnt

Since nr_segs is no longer used for bio split decisions, the
recalculation loop is unnecessary. The iov_iter already has the correct
bi_size to cap iteration, so an oversized nr_segs is harmless.

Link: https://lkml.org/lkml/2025/4/16/351
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07 08:06:33 -07:00
Gabriel Krisman Bertazi
48ed70131e io_uring: Trim out unused includes
Clean up some left overs of refactoring io_uring into multiple files.
Compile tested with a few configurations.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-05 17:08:20 -07:00
Jens Axboe
e4fdbca2dc io_uring/io-wq: remove io_wq_for_each_worker() return value
The only use of this helper is to iterate all of the workers, and
hence all callers will pass in a func that always returns false to do
that. As none of the callers use the return value, get rid of it.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-05 15:39:20 -07:00
Jens Axboe
e0392a10c9 io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic
A previous commit added this helper, and had it terminate if false is
returned from the handler. However, that is completely opposite, it
should abort the loop if true is returned.

Fix this up by having io_wq_for_each_worker() keep iterating as long
as false is returned, and only abort if true is returned.

Cc: stable@vger.kernel.org
Fixes: 751eedc4b4 ("io_uring/io-wq: move worker lists to struct io_wq_acct")
Reported-by: Lewis Campbell <info@lewiscampbell.tech>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-05 14:37:33 -07:00
Linus Torvalds
509b5b1152 io_uring-6.19-20260102
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlX7O8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkzbD/0SoEnTZ+jlbJojq6eAFtYAU3ial6sRdKC9
 15+WqlsMN5MHoV/xLMqHGxofpxCyXMmZZSPholWaUIiGJDcf4Q4olFFDTAgZPZYk
 XxpN9KeE4/n17eFXe+TE/D172MVM0gt9QbJFoV+TLyayrGiB5QyocH6Vg4FoWvjr
 YvyicIRE3SLiBQ8zdfPC4SR28VBE3LKZxjZJxr2HQjJQw4O4/+gKkYz7upACc4Xk
 qN3JioIayuM3hrqcBSm7P0t4tlTCYHZvcGr7WI26CV6hcHD7j7N9jOVPZb4ce8et
 GIYwASYx4FTPrzAebQXXNL39RjoSeaRa/ppcdFHbT9ZZkI9yY9g3umg3kEml8RkF
 DFFwmPxlz2RuRLs+KdZ4UjLRf14W5qYlcThN7bgpTH4H0XUeDzT7HI9BiXBC7gjl
 p0Z1Y3NPAzMxil48ZPpopJxmQGcBIC8fMnDT0KVpvuILrN3ME0TMg82lQ2X/eTwf
 S/oPLebqqWy4N8Ff5x+GYmWxZvFEOxmO0AoSSiN3nlZ1skNqRlpMISTsFJXy+luq
 V31d0cLBfrWL9MNTE+yjLNT/5pc1l+HgVLxdoxCioEKWXXdB27YEDlh0CVNtjZ9j
 /ZVMJcZhzRBUvWLUvzQrtY65m0I8h6XYJAr7TXbbsL70yFAsgQmBUZPklqe6eijy
 HFYYO4vnJg==
 =KgoB
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

 - Removed dead argument length for io_uring_validate_mmap_request()

 - Use GFP_NOWAIT for overflow CQEs on legacy ring setups rather than
   GFP_ATOMIC, which makes it play nicer with memcg limits

 - Fix a potential circular locking issue with tctx node removal and
   exec based cancelations

* tag 'io_uring-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring/memmap: drop unused sz param in io_uring_validate_mmap_request()
  io_uring/tctx: add separate lock for list of tctx's in ctx
  io_uring: use GFP_NOWAIT for overflow CQEs on legacy rings
2026-01-02 12:07:55 -08:00
Caleb Sander Mateos
70eafc7430 io_uring/memmap: drop unused sz param in io_uring_validate_mmap_request()
io_uring_validate_mmap_request() doesn't use its size_t sz argument, so
remove it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-01 08:16:48 -07:00
Jens Axboe
5623eb1ed0 io_uring/tctx: add separate lock for list of tctx's in ctx
ctx->tcxt_list holds the tasks using this ring, and it's currently
protected by the normal ctx->uring_lock. However, this can cause a
circular locking issue, as reported by syzbot, where cancelations off
exec end up needing to remove an entry from this list:

======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G             L
------------------------------------------------------
syz.0.9999/12287 is trying to acquire lock:
ffff88805851c0a8 (&ctx->uring_lock){+.+.}-{4:4}, at: io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179

but task is already holding lock:
ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline]
ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&sig->cred_guard_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/mutex.c:614 [inline]
       __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776
       proc_pid_attr_write+0x547/0x630 fs/proc/base.c:2837
       vfs_write+0x27e/0xb30 fs/read_write.c:684
       ksys_write+0x145/0x250 fs/read_write.c:738
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #1 (sb_writers#3){.+.+}-{0:0}:
       percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
       percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
       __sb_start_write include/linux/fs/super.h:19 [inline]
       sb_start_write+0x4d/0x1c0 include/linux/fs/super.h:125
       mnt_want_write+0x41/0x90 fs/namespace.c:499
       open_last_lookups fs/namei.c:4529 [inline]
       path_openat+0xadd/0x3dd0 fs/namei.c:4784
       do_filp_open+0x1fa/0x410 fs/namei.c:4814
       io_openat2+0x3e0/0x5c0 io_uring/openclose.c:143
       __io_issue_sqe+0x181/0x4b0 io_uring/io_uring.c:1792
       io_issue_sqe+0x165/0x1060 io_uring/io_uring.c:1815
       io_queue_sqe io_uring/io_uring.c:2042 [inline]
       io_submit_sqe io_uring/io_uring.c:2320 [inline]
       io_submit_sqes+0xbf4/0x2140 io_uring/io_uring.c:2434
       __do_sys_io_uring_enter io_uring/io_uring.c:3280 [inline]
       __se_sys_io_uring_enter+0x2e0/0x2b60 io_uring/io_uring.c:3219
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&ctx->uring_lock){+.+.}-{4:4}:
       check_prev_add kernel/locking/lockdep.c:3165 [inline]
       check_prevs_add kernel/locking/lockdep.c:3284 [inline]
       validate_chain kernel/locking/lockdep.c:3908 [inline]
       __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237
       lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868
       __mutex_lock_common kernel/locking/mutex.c:614 [inline]
       __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776
       io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179
       io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195
       io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646
       io_uring_task_cancel include/linux/io_uring.h:24 [inline]
       begin_new_exec+0x10ed/0x2440 fs/exec.c:1131
       load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010
       search_binary_handler fs/exec.c:1669 [inline]
       exec_binprm fs/exec.c:1701 [inline]
       bprm_execve+0x92e/0x1400 fs/exec.c:1753
       do_execveat_common+0x510/0x6a0 fs/exec.c:1859
       do_execve fs/exec.c:1933 [inline]
       __do_sys_execve fs/exec.c:2009 [inline]
       __se_sys_execve fs/exec.c:2004 [inline]
       __x64_sys_execve+0x94/0xb0 fs/exec.c:2004
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

Chain exists of:
  &ctx->uring_lock --> sb_writers#3 --> &sig->cred_guard_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&sig->cred_guard_mutex);
                               lock(sb_writers#3);
                               lock(&sig->cred_guard_mutex);
  lock(&ctx->uring_lock);

 *** DEADLOCK ***

1 lock held by syz.0.9999/12287:
 #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline]
 #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733

stack backtrace:
CPU: 0 UID: 0 PID: 12287 Comm: syz.0.9999 Tainted: G             L      syzkaller #0 PREEMPT(full)
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_circular_bug+0x2e2/0x300 kernel/locking/lockdep.c:2043
 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain kernel/locking/lockdep.c:3908 [inline]
 __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237
 lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868
 __mutex_lock_common kernel/locking/mutex.c:614 [inline]
 __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776
 io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179
 io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195
 io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646
 io_uring_task_cancel include/linux/io_uring.h:24 [inline]
 begin_new_exec+0x10ed/0x2440 fs/exec.c:1131
 load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010
 search_binary_handler fs/exec.c:1669 [inline]
 exec_binprm fs/exec.c:1701 [inline]
 bprm_execve+0x92e/0x1400 fs/exec.c:1753
 do_execveat_common+0x510/0x6a0 fs/exec.c:1859
 do_execve fs/exec.c:1933 [inline]
 __do_sys_execve fs/exec.c:2009 [inline]
 __se_sys_execve fs/exec.c:2004 [inline]
 __x64_sys_execve+0x94/0xb0 fs/exec.c:2004
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff3a8b8f749
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ff3a9a97038 EFLAGS: 00000246 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 00007ff3a8de5fa0 RCX: 00007ff3a8b8f749
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000200000000400
RBP: 00007ff3a8c13f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ff3a8de6038 R14: 00007ff3a8de5fa0 R15: 00007ff3a8f0fa28
 </TASK>

Add a separate lock just for the tctx_list, tctx_lock. This can nest
under ->uring_lock, where necessary, and be used separately for list
manipulation. For the cancelation off exec side, this removes the
need to grab ->uring_lock, hence fixing the circular locking
dependency.

Reported-by: syzbot+b0e3b77ffaa8a4067ce5@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-01 08:16:40 -07:00