linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-08 01:24:47 +01:00

Author	SHA1	Message	Date
Jens Axboe	e7c30675a7	io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Currently a few pointer dereferences need to be made to both check if BPF filters are installed, and then also to retrieve the actual filter for the opcode. Cache the table in ctx->bpf_filters to avoid that. Add a bit of debug info on ring exit to show if we ever got this wrong. Small risk of that given that the table is currently only updated in one spot, but once task forking is enabled, that will add one more spot. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Jens Axboe	8768770cf5	io_uring/bpf_filter: allow filtering on contents of struct open_how This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2, where the open_how flags, mode, and resolve can be checked by filters. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Jens Axboe	cff1c26b42	io_uring/net: allow filtering on IORING_OP_SOCKET data Example population method for the BPF based opcode filtering. This exposes the socket family, type, and protocol to a registered BPF filter. This in turn enables the filter to make decisions based on what was passed in to the IORING_OP_SOCKET request type. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Jens Axboe	d42eb05e60	io_uring: add support for BPF filtering for opcode restrictions Add support for loading classic BPF programs with io_uring to provide fine-grained filtering of SQE operations. Unlike IORING_REGISTER_RESTRICTIONS which only allows bitmap-based allow/deny of opcodes, BPF filters can inspect request attributes and make dynamic decisions. The filter is registered via IORING_REGISTER_BPF_FILTER with a struct io_uring_bpf: struct io_uring_bpf_filter { __u32 opcode; /* io_uring opcode to filter / __u32 flags; __u32 filter_len; / number of BPF instructions / __u32 resv; __u64 filter_ptr; / pointer to BPF filter / __u64 resv2[5]; }; enum { IO_URING_BPF_CMD_FILTER = 1, }; struct io_uring_bpf { __u16 cmd_type; / IO_URING_BPF_* values / __u16 cmd_flags; / none so far */ __u32 resv; union { struct io_uring_bpf_filter filter; }; }; and the filters get supplied a struct io_uring_bpf_ctx: struct io_uring_bpf_ctx { __u64 user_data; __u8 opcode; __u8 sqe_flags; __u8 pdu_size; __u8 pad[5]; }; where it's possible to filter on opcode and sqe_flags, with pdu_size indicating how much extra data is being passed in beyond the pad field. This will used for specific finer grained filtering inside an opcode. An example of that for sockets is in one of the following patches. Anything the opcode supports can end up in this struct, populated by the opcode itself, and hence can be filtered for. Filters have the following semantics: - Return 1 to allow the request - Return 0 to deny the request with -EACCES - Multiple filters can be stacked per opcode. All filters must return 1 for the opcode to be allowed. - Filters are evaluated in registration order (most recent first) The implementation uses classic BPF (cBPF) rather than eBPF for as that's required for containers, and since they can be used by any user in the system. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:09:57 -07:00
Jens Axboe	e26f51f6f6	io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently For potential long term allocations, ensure that we play nicer with memcg and use the accounting variant of the GFP_KERNEL allocation. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-25 10:07:35 -07:00
Jens Axboe	6e0d71c288	io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation Be a bit nicer and ensure it plays nice with memcg accounting. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-25 10:07:09 -07:00
Pavel Begunkov	795663b4d1	io_uring/zcrx: implement large rx buffer support There are network cards that support receive buffers larger than 4K, and that can be vastly beneficial for performance, and benchmarks for this patch showed up to 30% CPU util improvement for 32K vs 4K buffers. Allows zcrx users to specify the size in struct io_uring_zcrx_ifq_reg::rx_buf_len. If set to zero, zcrx will use a default value. zcrx will check and fail if the memory backing the area can't be split into physically contiguous chunks of the required size. It's more restrictive as it only needs dma addresses to be contig, but that's beyond this series. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: kill duplicate netdev_queues.h include] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-24 08:33:03 -07:00
Jens Axboe	816095894c	io_uring/io-wq: handle !sysctl_hung_task_timeout_secs If the hung_task_timeout sysctl is set to 0, then we'll end up busy looping inside io_wq_exit_workers() after an earlier commit switched to using wait_for_completion_timeout(). Use the maximum schedule timeout value for that case. Fixes: `1f293098a3` ("io_uring/io-wq: don't trigger hung task for syzbot craziness") Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-23 13:58:03 -07:00
Linus Torvalds	7907f673d0	io_uring-6.19-20260122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmly5p8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqzlD/4+wBnwP7VGJSQJRHkWnewgbLk4CZcDbFdw jlv2Hp8AymmhNkMQmCauKJMahUYcDz0Gb+sSs3/MdOMqNvHkh1G6QVEMbK0W9N35 HuMmimb9Pac2JSMrTiJnVcogi371P/iD1Ps7qlo08QvxV6uSQtHxtoUvsuQqwkxn WxhDVIxlLOZgWL0OiSAJ9maEvGWiMYeI1Me+MKatQQbkEybTBOMy34HhTe3Fi3zQ X3jVD2loyu/dP309/yjr2ZS2b3SGJyMCqpiICunpAeAafvnPr4vXiuUhATZHmQ1B XnAkkb4p0k8nECMDNZjSgqcFbDzkbHvsuyFMiSFgEELwV87qusZnbL66jmcPC3eh JDd6SuE4FpXNUhdGc7h6ma+vNwcr1ZNdPlUmqvlDbhcl57fcge14930OXybFPW8Q z3+LC5Jo37e61wYTv7M84wtsjdzdADkUb71y482U2bpeIZyg2YhTZ39FO9u6GETW 2JhBXuikE1fanTji9klh0de2+LIbtlGZlJA1SkQiWsPRxWZB6Jg4wa2kkXZBLb94 eR026zmg0h5Wt8g815E0XEHzv8hEqy14vGc+dj3Mw5pMYQGONNr5UOog50yihVNF q8dTYmtI4EN6Vbqq7usZ7Q/mN72PUzlykIT8OABPoWzvzvOZh4b2pZtKH1kfLUfN zMtq1Y7+dw== =n+AN -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Fix for a potential leak of an iovec, if a specific cleanup path is used and the rw_cache is full at the time of the call - Fix for a regression added in this cycle, where waitid should be using prober release/acquire semantics for updating the wait queue head - Check for the cancelation bit being set for every work item processed by io-wq, not just at the start of the loop. Has no real practical implications other than to shut up syzbot doing crazy things that grossly overload a system, hence slowing down ring exit - A few selftest additions, updating the mini_liburing that selftests use * tag 'io_uring-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: selftests/io_uring: support NO_SQARRAY in miniliburing selftests/io_uring: add io_uring_queue_init_params io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop io_uring/waitid: fix KCSAN warning on io_waitid->head io_uring/rw: free potentially allocated iovec on cache put failure	2026-01-23 12:51:00 -08:00
Jens Axboe	1edf0891d0	io_uring: fix bad indentation for setup flags if statement smatch complains about this: smatch warnings: io_uring/io_uring.c:2741 io_uring_sanitise_params() warn: if statement not indented hence fix it up. Link: https://lore.kernel.org/all/202601231651.HeTmPS8C-lkp@intel.com/ Fixes: `5247c034a6` ("io_uring: introduce non-circular SQ") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601231651.HeTmPS8C-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-23 05:09:08 -07:00
Caleb Sander Mateos	82dadc8a49	io_uring/rsrc: take unsigned index in io_rsrc_node_lookup() io_rsrc_node_lookup() takes a signed int index as input and compares it to an unsigned length. Since the signed int is implicitly cast to an unsigned int for the comparison and the length is bounded by IORING_MAX_FIXED_FILES/IORING_MAX_REG_BUFFERS, negative indices are already rejected on architectures where int is at least 32 bits. Make this a bit clearer and avoid compiler warnings for comparisons of signed and unsigned values by taking an unsigned int index instead. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 15:58:17 -07:00
Pavel Begunkov	5247c034a6	io_uring: introduce non-circular SQ Outside of SQPOLL, normally SQ entries are consumed by the time the submission syscall returns. For those cases we don't need a circular buffer and the head/tail tracking, instead the kernel can assume that entries always start from the beginning of the SQ at index 0. This patch introduces a setup flag doing exactly that. It's a simpler and helps to keeps SQEs hot in cache. The feature is optional and enabled by setting IORING_SETUP_SQ_REWIND. The flag is rejected if passed together with SQPOLL as it'd require waiting for SQ before each submission. It also requires IORING_SETUP_NO_SQARRAY, which can be supported but it's unlikely there will be users, so leave more space for future optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 15:47:23 -07:00
Jens Axboe	0105b0562a	io_uring: split out CQ waiting code into wait.c Move the completion queue waiting and scheduling code out of io_uring.c into a dedicated wait.c file. This further removes code out of the main io_uring C and header file, and into a topical new file. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 09:21:16 -07:00
Jens Axboe	7642e66860	io_uring: split out task work code into tw.c Move the task work handling code out of io_uring.c into a new tw.c file. This includes the local work, normal work, and fallback work handling infrastructure. The associated tw.h header contains io_should_terminate_tw() as a static inline helper, along with the necessary function declarations. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 09:20:17 -07:00
Jens Axboe	1f293098a3	io_uring/io-wq: don't trigger hung task for syzbot craziness Use the same trick that blk_io_schedule() does to avoid triggering the hung task warning (and potential reboot/panic, depending on system settings), and only wait for half the hung task timeout at the time. If we exceed the default IO_URING_EXIT_WAIT_MAX period where we expect things to certainly have finished unless there's a bug, then throw a WARN_ON_ONCE() for that case. Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Tested-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 07:25:35 -07:00
Jens Axboe	dd120bddc4	io_uring: add IO_URING_EXIT_WAIT_MAX definition Add the timeout we normally wait before complaining about things being stuck waiting for cancelations to complete as a define, and use it in io_ring_exit_work(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 07:25:30 -07:00
Jens Axboe	649dd18f55	io_uring/sync: validate passed in offset Check if the passed in offset is negative once cast to sync->off. This ensures that -EINVAL is returned for that case, like it would be for sync_file_range(2). Fixes: `c992fe2925` ("io_uring: add fsync support") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-21 11:50:59 -07:00
Ming Lei	f7bc22ca0d	nvme/io_uring: optimize IOPOLL completions for local ring context When multiple io_uring rings poll on the same NVMe queue, one ring can find completions belonging to another ring. The current code always uses task_work to handle this, but this adds overhead for the common single-ring case. This patch passes the polling io_ring_ctx through io_comp_batch's new poll_ctx field. In io_do_iopoll(), the polling ring's context is stored in iob.poll_ctx before calling the iopoll callbacks. In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they match (local context), we complete inline with io_uring_cmd_done32(). If they differ (remote context) or iob is NULL (non-iopoll path), we use task_work as before. This optimization eliminates task_work scheduling overhead for the common case where a ring polls and finds its own completions. ~10% IOPS improvement is observed in the following benchmark: fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -O0 -P1 -u1 -n1 /dev/ng0n1 Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-20 10:18:01 -07:00
Jens Axboe	42b12cb5fd	io_uring/timeout: annotate data race in io_flush_timeouts() syzbot correctly reports this as a KCSAN race, as ctx->cached_cq_tail should be read under ->uring_lock. This isn't immediately feasible in io_flush_timeouts(), but as long as we read a stable value, that should be good enough. If two io-wq threads compete on this value, then they will both end up calling io_flush_timeouts() and at least one of them will see the correct value. Reported-by: syzbot+6c48db7d94402407301e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-20 09:54:17 -07:00
Jens Axboe	10dc959398	io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop Currently this is checked before running the pending work. Normally this is quite fine, as work items either end up blocking (which will create a new worker for other items), or they complete fairly quickly. But syzbot reports an issue where io-wq takes seemingly forever to exit, and with a bit of debugging, this turns out to be because it queues a bunch of big (2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't support ->read_iter(), loop_rw_iter() ends up handling them. Each read returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of these pending, processing the whole chain can take a long time. Easily longer than the syzbot uninterruptible sleep timeout of 140 seconds. This then triggers a complaint off the io-wq exit path: INFO: task syz.4.135:6326 blocked for more than 143 seconds. Not tainted syzkaller #0 Blocked by coredump. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.135 state:D stack:26824 pid:6326 tgid:6324 ppid:5957 task_flags:0x400548 flags:0x00080000 Call Trace: <TASK> context_switch kernel/sched/core.c:5256 [inline] __schedule+0x1139/0x6150 kernel/sched/core.c:6863 __schedule_loop kernel/sched/core.c:6945 [inline] schedule+0xe7/0x3a0 kernel/sched/core.c:6960 schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75 do_wait_for_common kernel/sched/completion.c:100 [inline] __wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121 io_wq_exit_workers io_uring/io-wq.c:1328 [inline] io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356 io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203 io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651 io_uring_files_cancel include/linux/io_uring.h:19 [inline] do_exit+0x2ce/0x2bd0 kernel/exit.c:911 do_group_exit+0xd3/0x2a0 kernel/exit.c:1112 get_signal+0x2671/0x26d0 kernel/signal.c:3034 arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline] exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa02738f749 RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098 RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98 There's really nothing wrong here, outside of processing these reads will take a LONG time. However, we can speed up the exit by checking the IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will exit the ring after queueing up all of these reads. Then once the first item is processed, io-wq will simply cancel the rest. That should avoid syzbot running into this complaint again. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/ Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-20 07:55:59 -07:00
Jens Axboe	b994ace83a	io_uring/waitid: fix KCSAN warning on io_waitid->head Storing of the iw->head entry inside the wait_queue callback, or when removing a waitid item, really should use proper load/store acquire/release semantics, and KCSAN correctly warns of that. Ensure that they do so. Reported-by: syzbot+eb441775f4f948a0902f@syzkaller.appspotmail.com Fixes: `a48c0cbf28` ("io_uring/waitid: have io_waitid_complete() remove wait queue entry") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-19 19:55:30 -07:00
Jens Axboe	4b97480554	io_uring/rw: free potentially allocated iovec on cache put failure If a read/write request goes through io_req_rw_cleanup() and has an allocated iovec attached and fails to put to the rw_cache, then it may end up with an unaccounted iovec pointer. Have io_rw_recycle() return whether it recycled the request or not, and use that to gauge whether to free a potential iovec or not. Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-19 06:59:06 -07:00
Linus Torvalds	216c7a0326	io_uring-6.19-20260116 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlrBd8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkJGEADcc/bWQrzg/vjW+FA05ddxJ3fQJZmtxveq jqODAk7zf04CkD1PrpdhWWx77Y/2wnP8sgMsAVdSQLkG+j5GWRI4DDmAOt7MxHw7 SzT1QxQkQb2aXgC2qtnqJnWrfj6If/gWEyBapdwaxOsABOu/zi4TgmjqFn6TO30D MG/+CnciabJCLLa27CkmSgIkA4enND23s3UnUX6qu9PET337xkNKm8i9pOWJN5q8 pAgwHTRTMT75KtlM7oU6RLm4h+ifHUZC2mhZ3gDsAHfzlXEHlc58iVF3+QdzylLg 6OUIdH/l4BnM5spez2sLijmq9rUJXg/AYyDBpK7riVcTu6BbhqbbTaM6uHGVNPZd ScqsHHeURRCUk0Q9SPiS8e34JOFgt2yOoTMks22YvSDYcaTgYpjzNIEy659sxXcy JZWmSqs4Hubicr/nv7yY4Fapsv0MB+m/Av0QQ6ftCnVUa5nA54unEF861+YYT0oF sfUozOulLjuXgP1qp0uI66E3Db9bH1W69qP/vGKvgOkxySo/lGz15DZ40hazkUxF YACITZmhzZ5WVERfBA5elmZHiYPOHqwMYU5qAHth2BrsK/91+3yg85vscOCHdSYL ch0JUYlhEZ/LfKRqrA6+1cwxor7Jog5Q5BuWcfqBG3ZiV1+LOopYLCFcGyeBzxne tTgyow5vhQ== =zH2C -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Just a single fix moving local task_work inside the cancelation loop, rather than only before cancelations. If any cancelations generate task_work, we do need to re-run it" * tag 'io_uring-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: move local task_work in exit cancel loop	2026-01-16 20:56:56 -08:00
Al Viro	5b9d406ff7	filename_...xattr(): don't consume filename reference Callers switched to CLASS(filename_maybe_null) (in fs/xattr.c) and CLASS(filename_complete_delayed) (in io_uring/xattr.c). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:52:03 -05:00
Al Viro	e50aae1d39	non-consuming variants of do_{unlinkat,rmdir}() similar to previous commit; replacements are filename_{unlinkat,rmdir}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:51:50 -05:00
Al Viro	dc912db15a	non-consuming variant of do_mkdirat() similar to previous commit; replacement is filename_mkdirat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:48:49 -05:00
Al Viro	da72b76aae	non-consuming variant of do_symlinkat() similar to previous commit; replacement is filename_symlinkat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:48:16 -05:00
Al Viro	037193b0ae	non-consuming variant of do_linkat() similar to previous commit; replacement is filename_linkat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:47:42 -05:00
Al Viro	e6d50234cc	non-consuming variant of do_renameat2() filename_renameat2() replaces do_renameat2(); unlike the latter, it does not drop filename references - these days it can be just as easily arranged in the caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-16 12:46:57 -05:00
Jens Axboe	8661d0b142	io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL This currently isn't supported, and due to a recent commit, it also cannot easily be supported by io_uring due to hash_node and IOPOLL completion data overlapping. This can be revisited if we ever do support cancelations of requests that have gone to the block stack. Suggested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-14 22:04:11 -07:00
Jens Axboe	697a5284ad	io_uring: fix IOPOLL with passthrough I/O A previous commit improving IOPOLL made an incorrect assumption that task_work isn't used with IOPOLL. This can cause crashes when doing passthrough I/O on nvme, where queueing the completion task_work will trample on the same memory that holds the completed list of requests. Fix it up by shuffling the members around, so we're not sharing any parts that end up getting used in this path. Fixes: `3c7d76d612` ("io_uring: IOPOLL polling improvements") Reported-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/linux-block/CAHj4cs_SLPj9v9w5MgfzHKy+983enPx3ZQY2kMuMJ1202DBefw@mail.gmail.com/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-14 22:03:49 -07:00
Ming Lei	da579f05ef	io_uring: move local task_work in exit cancel loop With IORING_SETUP_DEFER_TASKRUN, task work is queued to ctx->work_llist (local work) rather than the fallback list. During io_ring_exit_work(), io_move_task_work_from_local() was called once before the cancel loop, moving work from work_llist to fallback_llist. However, task work can be added to work_llist during the cancel loop itself. There are two cases: 1) io_kill_timeouts() is called from io_uring_try_cancel_requests() to cancel pending timeouts, and it adds task work via io_req_queue_tw_complete() for each cancelled timeout: 2) URING_CMD requests like ublk can be completed via io_uring_cmd_complete_in_task() from ublk_queue_rq() during canceling, given ublk request queue is only quiesced when canceling the 1st uring_cmd. Since io_allowed_defer_tw_run() returns false in io_ring_exit_work() (kworker != submitter_task), io_run_local_work() is never invoked, and the work_llist entries are never processed. This causes io_uring_try_cancel_requests() to loop indefinitely, resulting in 100% CPU usage in kworker threads. Fix this by moving io_move_task_work_from_local() inside the cancel loop, ensuring any work on work_llist is moved to fallback before each cancel attempt. Cc: stable@vger.kernel.org Fixes: `c0e0d6ba25` ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-14 10:18:19 -07:00
Al Viro	541003b576	rename do_filp_open() to do_file_open() "filp" thing never made sense; seeing that there are exactly 4 callers in the entire tree (and it's neither exported nor even declared in linux//.h), there's no point keeping that ugliness. FWIW, the 'filp' thing did originate in OSD&I; for some reason Tanenbaum decided to call the object representing an opened file 'struct filp', the last letter standing for 'position'. In all Unices, Linux included, the corresponding object had always been 'struct file'... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Al Viro	9fa3ec8458	allow incomplete imports of filenames There are two filename-related problems in io_uring and its interplay with audit. Filenames are imported when request is submitted and used when it is processed. Unfortunately, the latter may very well happen in a different thread. In that case the reference to filename is put into the wrong audit_context - that of submitting thread, not the processing one. Audit logics is called by the latter, and it really wants to be able to find the names in audit_context current (== processing) thread. Another related problem is the headache with refcounts - normally all references to given struct filename are visible only to one thread (the one that uses that struct filename). io_uring violates that - an extra reference is stashed in audit_context of submitter. It gets dropped when submitter returns to userland, which can happen simultaneously with processing thread deciding to drop the reference it got. We paper over that by making refcount atomic, but that means pointless headache for everyone. Solution: the notion of partially imported filenames. Namely, already copied from userland, but not exposed to audit yet. io_uring can create that in submitter thread, and complete the import (obtaining the usual reference to struct filename) in processing thread. Object: struct delayed_filename. Primitives for working with it: delayed_getname(&delayed_filename, user_string) - copies the name from userland, returning 0 and stashing the address of (still incomplete) struct filename in delayed_filename on success and returning -E... on error. delayed_getname_uflags(&delayed_filename, user_string, atflags) - similar, in the same relation to delayed_getname() as getname_uflags() is to getname() complete_getname(&delayed_filename) - completes the import of filename stashed in delayed_filename and returns struct filename to caller, emptying delayed_filename. CLASS(filename_complete_delayed, name)(&delayed_filename) - variant of CLASS(filename) with complete_getname() for constructor. dismiss_delayed_filename(&delayed_filename) - destructor; drops whatever might be stashed in delayed_filename, emptying it. putname_to_delayed(&delayed_filename, name) - if name is shared, stashes its copy into delayed_filename and drops the reference to name, otherwise stashes the name itself in there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Jens Axboe	d6406c45f1	io_uring: track restrictions separately for IORING_OP and IORING_REGISTER It's quite likely that only register opcode restrictions exists, in which case we'd never need to check the normal opcodes. Split ctx->restricted into two separate fields, one for I/O opcodes, and one for register opcodes. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:31:48 -07:00
Jens Axboe	991fb85a1d	io_uring: move ctx->restricted check into io_check_restriction() Just a cleanup, makes the code easier to read without too many dependent nested checks. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:31:48 -07:00
Jens Axboe	09bd84421d	io_uring/register: set ctx->restricted when restrictions are parsed Rather than defer this until the rings are enabled, just set it upfront when the restrictions are parsed and enabled anyway. There's no reason to defer this setting until the rings are enabled. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:31:48 -07:00
Jens Axboe	e6ed0f051d	io_uring/register: have io_parse_restrictions() set restrictions enabled Rather than leave this to the caller, have io_parse_restrictions() set ->registered = true if restrictions have been enabled. This is in preparation for having finer grained restrictions. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:30:41 -07:00
Jens Axboe	51fff55a66	io_uring/register: have io_parse_restrictions() return number of ops Rather than return 0 on success, return >= 0 for success, where the return value is that number of parsed entries. As before, any < 0 return is an error. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-13 10:30:41 -07:00
Caleb Sander Mateos	130a827607	io_uring/register: drop io_register_enable_rings() submitter_task check io_register_enable_rings() checks that the io_ring_ctx is IORING_SETUP_R_DISABLED, which ensures submitter_task hasn't been assigned by io_uring_create() or a previous io_register_enable_rings() call. So drop the redundant check that submitter_task is NULL. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-12 11:21:38 -07:00
Caleb Sander Mateos	bcd4c95737	io_uring/msg_ring: drop unnecessary submitter_task checks __io_msg_ring_data() checks that the target_ctx isn't IORING_SETUP_R_DISABLED before calling io_msg_data_remote(), which calls io_msg_remote_post(). So submitter_task can't be modified concurrently with the read in io_msg_remote_post(). Additionally, submitter_task must exist, as io_msg_data_remote() is only called for io_msg_need_remote(), i.e. task_complete is set, which requires IORING_SETUP_DEFER_TASKRUN, which in turn requires IORING_SETUP_SINGLE_ISSUER. And submitter_task is assigned in io_uring_create() or io_register_enable_rings() before enabling any IORING_SETUP_SINGLE_ISSUER io_ring_ctx. Similarly, io_msg_send_fd() checks IORING_SETUP_R_DISABLED and io_msg_need_remote() before calling io_msg_fd_remote(). submitter_task therefore can't be modified concurrently with the read in io_msg_fd_remote() and must be non-null. io_register_enable_rings() can't run concurrently because it's called from io_uring_register() -> __io_uring_register() with uring_lock held. Thus, replace the READ_ONCE() and WRITE_ONCE() of submitter_task with plain loads and stores. And remove the NULL checks of submitter_task in io_msg_remote_post() and io_msg_fd_remote(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-12 11:21:38 -07:00
Caleb Sander Mateos	7a8737e113	io_uring: use release-acquire ordering for IORING_SETUP_R_DISABLED io_uring_enter(), __io_msg_ring_data(), and io_msg_send_fd() read ctx->flags and ctx->submitter_task without holding the ctx's uring_lock. This means they may race with the assignment to ctx->submitter_task and the clearing of IORING_SETUP_R_DISABLED from ctx->flags in io_register_enable_rings(). Ensure the correct ordering of the ctx->flags and ctx->submitter_task memory accesses by storing to ctx->flags using release ordering and loading it using acquire ordering. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `4add705e4e` ("io_uring: remove io_register_submitter") Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-12 11:21:38 -07:00
Linus Torvalds	68ad2095ca	io_uring-6.19-20260109 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlhS1wQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphGvD/9NR0RzWZdM0DwfbK4kyzfmQPCSs1kkqQF4 LECsSc3B7OrJ/4yX27CiWNRlGdHWpmrOc8mtlAiUv+eArpoBmatjfn1UZACN0u/t CC0/ZXeYA6NQ8vnbnQZk+guHE7r9K66EFPFvMcEWmGGQ/CUBUKt1gkDkB1gD8qBp pdI/A+tZujCQA3XuyCE+qc5GJX+cFXqEx06GRDGQ+UnnAsJmSrtbUtZrEZsladMj 16dclDfOX2X0bu9+P42rSkV2IrjwddNntDsLyF933uaayAJX9HbTTWxxw/mWr2Bt Be0Xh1+FniVAQFFM7qFQRqrWqecKrnh6RKg9lufWiuB4d5rq7eJT1xGb+IXX6xlr j/Lwbi8UkjpJmG1xnyWtk9oDQK4h+7p5MvCgCSqLrp1rY8nYT1CeCEzt1OJjeVWj cqm2hhkEUcioCz4gTHU8PBRxhfd4PRr/GBwZJ4jBFBeFTip1vy9kAn94Afrk/VLH HAreWZtsNsTDTF9cUSXyKDHYR9uGSi8NpZSEEV8dUaAKYpYBNSIutX5uqT9NBs8y 3TJ0NrhlpVJIGwa5XtwKli62CXNsibQlNbnsd092+zvkiAAiYUiLNRaQnt3MGctk 4eBp0sWTUkHxxKO3njUIDXZPB5g9jZvJxpjqK+V0CYzVdeimw2qs8fBQJvnYPtNw k3C0E2aJKQ== =ObKS -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: "A single fix for a regression introduced in 6.15, where a failure to wake up idle io-wq workers at ring exit will wait for the timeout to expire. This isn't normally noticeable, as the exit is async. But if a parent task created a thread that sets up a ring and uses requests that cause io-wq threads to be created, and the parent task then waits for the thread to exit, then it can take 5 seconds for that pthread_join() to succeed as the child thread is waiting for its children to exit. On top of that, just a basic cleanup as well" * tag 'io_uring-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/io-wq: remove io_wq_for_each_worker() return value io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic	2026-01-09 15:21:15 -10:00
Ming Lei	15f506a77a	io_uring: remove nr_segs recalculation in io_import_kbuf() io_import_kbuf() recalculates iter->nr_segs to reflect only the bvecs needed for the requested byte range. This was added to provide an accurate segment count to bio_iov_bvec_set(), which copied nr_segs to bio->bi_vcnt for use as a bio split hint. The previous two patches eliminated this dependency: - bio_may_need_split() now uses bi_iter instead of bi_vcnt for split decisions - bio_iov_bvec_set() no longer copies nr_segs to bi_vcnt Since nr_segs is no longer used for bio split decisions, the recalculation loop is unnecessary. The iov_iter already has the correct bi_size to cap iteration, so an oversized nr_segs is harmless. Link: https://lkml.org/lkml/2025/4/16/351 Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-07 08:06:33 -07:00
Gabriel Krisman Bertazi	48ed70131e	io_uring: Trim out unused includes Clean up some left overs of refactoring io_uring into multiple files. Compile tested with a few configurations. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-05 17:08:20 -07:00
Jens Axboe	e4fdbca2dc	io_uring/io-wq: remove io_wq_for_each_worker() return value The only use of this helper is to iterate all of the workers, and hence all callers will pass in a func that always returns false to do that. As none of the callers use the return value, get rid of it. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-05 15:39:20 -07:00
Jens Axboe	e0392a10c9	io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic A previous commit added this helper, and had it terminate if false is returned from the handler. However, that is completely opposite, it should abort the loop if true is returned. Fix this up by having io_wq_for_each_worker() keep iterating as long as false is returned, and only abort if true is returned. Cc: stable@vger.kernel.org Fixes: `751eedc4b4` ("io_uring/io-wq: move worker lists to struct io_wq_acct") Reported-by: Lewis Campbell <info@lewiscampbell.tech> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-05 14:37:33 -07:00
Linus Torvalds	509b5b1152	io_uring-6.19-20260102 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlX7O8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkzbD/0SoEnTZ+jlbJojq6eAFtYAU3ial6sRdKC9 15+WqlsMN5MHoV/xLMqHGxofpxCyXMmZZSPholWaUIiGJDcf4Q4olFFDTAgZPZYk XxpN9KeE4/n17eFXe+TE/D172MVM0gt9QbJFoV+TLyayrGiB5QyocH6Vg4FoWvjr YvyicIRE3SLiBQ8zdfPC4SR28VBE3LKZxjZJxr2HQjJQw4O4/+gKkYz7upACc4Xk qN3JioIayuM3hrqcBSm7P0t4tlTCYHZvcGr7WI26CV6hcHD7j7N9jOVPZb4ce8et GIYwASYx4FTPrzAebQXXNL39RjoSeaRa/ppcdFHbT9ZZkI9yY9g3umg3kEml8RkF DFFwmPxlz2RuRLs+KdZ4UjLRf14W5qYlcThN7bgpTH4H0XUeDzT7HI9BiXBC7gjl p0Z1Y3NPAzMxil48ZPpopJxmQGcBIC8fMnDT0KVpvuILrN3ME0TMg82lQ2X/eTwf S/oPLebqqWy4N8Ff5x+GYmWxZvFEOxmO0AoSSiN3nlZ1skNqRlpMISTsFJXy+luq V31d0cLBfrWL9MNTE+yjLNT/5pc1l+HgVLxdoxCioEKWXXdB27YEDlh0CVNtjZ9j /ZVMJcZhzRBUvWLUvzQrtY65m0I8h6XYJAr7TXbbsL70yFAsgQmBUZPklqe6eijy HFYYO4vnJg== =KgoB -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Removed dead argument length for io_uring_validate_mmap_request() - Use GFP_NOWAIT for overflow CQEs on legacy ring setups rather than GFP_ATOMIC, which makes it play nicer with memcg limits - Fix a potential circular locking issue with tctx node removal and exec based cancelations * tag 'io_uring-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/memmap: drop unused sz param in io_uring_validate_mmap_request() io_uring/tctx: add separate lock for list of tctx's in ctx io_uring: use GFP_NOWAIT for overflow CQEs on legacy rings	2026-01-02 12:07:55 -08:00
Caleb Sander Mateos	70eafc7430	io_uring/memmap: drop unused sz param in io_uring_validate_mmap_request() io_uring_validate_mmap_request() doesn't use its size_t sz argument, so remove it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-01 08:16:48 -07:00
Jens Axboe	5623eb1ed0	io_uring/tctx: add separate lock for list of tctx's in ctx ctx->tcxt_list holds the tasks using this ring, and it's currently protected by the normal ctx->uring_lock. However, this can cause a circular locking issue, as reported by syzbot, where cancelations off exec end up needing to remove an entry from this list: ====================================================== WARNING: possible circular locking dependency detected syzkaller #0 Tainted: G L ------------------------------------------------------ syz.0.9999/12287 is trying to acquire lock: ffff88805851c0a8 (&ctx->uring_lock){+.+.}-{4:4}, at: io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 but task is already holding lock: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline] ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&sig->cred_guard_mutex){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 proc_pid_attr_write+0x547/0x630 fs/proc/base.c:2837 vfs_write+0x27e/0xb30 fs/read_write.c:684 ksys_write+0x145/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (sb_writers#3){.+.+}-{0:0}: percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline] percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline] __sb_start_write include/linux/fs/super.h:19 [inline] sb_start_write+0x4d/0x1c0 include/linux/fs/super.h:125 mnt_want_write+0x41/0x90 fs/namespace.c:499 open_last_lookups fs/namei.c:4529 [inline] path_openat+0xadd/0x3dd0 fs/namei.c:4784 do_filp_open+0x1fa/0x410 fs/namei.c:4814 io_openat2+0x3e0/0x5c0 io_uring/openclose.c:143 __io_issue_sqe+0x181/0x4b0 io_uring/io_uring.c:1792 io_issue_sqe+0x165/0x1060 io_uring/io_uring.c:1815 io_queue_sqe io_uring/io_uring.c:2042 [inline] io_submit_sqe io_uring/io_uring.c:2320 [inline] io_submit_sqes+0xbf4/0x2140 io_uring/io_uring.c:2434 __do_sys_io_uring_enter io_uring/io_uring.c:3280 [inline] __se_sys_io_uring_enter+0x2e0/0x2b60 io_uring/io_uring.c:3219 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (&ctx->uring_lock){+.+.}-{4:4}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237 lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868 __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195 io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646 io_uring_task_cancel include/linux/io_uring.h:24 [inline] begin_new_exec+0x10ed/0x2440 fs/exec.c:1131 load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010 search_binary_handler fs/exec.c:1669 [inline] exec_binprm fs/exec.c:1701 [inline] bprm_execve+0x92e/0x1400 fs/exec.c:1753 do_execveat_common+0x510/0x6a0 fs/exec.c:1859 do_execve fs/exec.c:1933 [inline] __do_sys_execve fs/exec.c:2009 [inline] __se_sys_execve fs/exec.c:2004 [inline] __x64_sys_execve+0x94/0xb0 fs/exec.c:2004 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: &ctx->uring_lock --> sb_writers#3 --> &sig->cred_guard_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sig->cred_guard_mutex); lock(sb_writers#3); lock(&sig->cred_guard_mutex); lock(&ctx->uring_lock); * DEADLOCK * 1 lock held by syz.0.9999/12287: #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline] #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733 stack backtrace: CPU: 0 UID: 0 PID: 12287 Comm: syz.0.9999 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_circular_bug+0x2e2/0x300 kernel/locking/lockdep.c:2043 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237 lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868 __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195 io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646 io_uring_task_cancel include/linux/io_uring.h:24 [inline] begin_new_exec+0x10ed/0x2440 fs/exec.c:1131 load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010 search_binary_handler fs/exec.c:1669 [inline] exec_binprm fs/exec.c:1701 [inline] bprm_execve+0x92e/0x1400 fs/exec.c:1753 do_execveat_common+0x510/0x6a0 fs/exec.c:1859 do_execve fs/exec.c:1933 [inline] __do_sys_execve fs/exec.c:2009 [inline] __se_sys_execve fs/exec.c:2004 [inline] __x64_sys_execve+0x94/0xb0 fs/exec.c:2004 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff3a8b8f749 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ff3a9a97038 EFLAGS: 00000246 ORIG_RAX: 000000000000003b RAX: ffffffffffffffda RBX: 00007ff3a8de5fa0 RCX: 00007ff3a8b8f749 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000200000000400 RBP: 00007ff3a8c13f91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ff3a8de6038 R14: 00007ff3a8de5fa0 R15: 00007ff3a8f0fa28 </TASK> Add a separate lock just for the tctx_list, tctx_lock. This can nest under ->uring_lock, where necessary, and be used separately for list manipulation. For the cancelation off exec side, this removes the need to grab ->uring_lock, hence fixing the circular locking dependency. Reported-by: syzbot+b0e3b77ffaa8a4067ce5@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-01 08:16:40 -07:00

1 2 3 4 5 ...

2020 commits