Commit graph

54 commits

Author SHA1 Message Date
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
f5d4feed17 for-7.0/io_uring-20260206
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJxsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpk6+EACamMdw6WU4VVNjUtjT93FuXxor4ioyhowJ
 myRtKG3ZvYrE63Z8F1dCQE28RXi9n6MhGxabCq8WZVGkhTv27DuaBkDjU4T8oCnP
 EYhs5a3sdRXfKuIlqVbxuiFdmiPHEP0vh3/MviKx9Ju3/Po3OEWKBalNMevfGkS4
 bRNp9IQkAYNSRhGma2ni9Rnc5welWmhpsxUKFdGtPRX53ZlYegiZxKlfKMB4/SQ+
 7XAWKhy9dOGVo4DpLof7mCX6hMeX+FoNkJzF6cTMO/IF//lCLjI9BN4SMiI6mmEN
 RY6PLJiFraoQx8wdr3J1LtBCNXzzj6cPk6PNHKtsodoafe2oYFNLNgfAa9pHDzfM
 12kvy58au0cQG6TnS2eNlqM2GN116mJi+k00E+UW4iaXXtpqcdcBrLlS+Q5hJ78C
 9MBLQofv7D06C6kbpxV2pVS1u4oxefjl19wWLqLKx/VytCHrsaTm50n1r0k7YLCc
 plvPkQRQobqpp2GtcaXcfmsi1Vfu4jzMBAN+rTN4/te0kudNqL9+hPvrejIMEURc
 2AcktMAHC8wjpr93dFASXiWh/fdyhV4e2a/D/ML4PXxhnCfnGx5s5Tp/pGjePHEU
 dLZm9vadmr/Yrdgycf9gQ8mz9IxI9FNJCKbI7lf7+/KJXe7DwngOa6VHNblWBRHv
 YoX6bG1yQQ==
 =Q248
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Clean up the IORING_SETUP_R_DISABLED and submitter task checking,
   mostly just in preparation for relaxing the locking for SINGLE_ISSUER
   in the future.

 - Improve IOPOLL by using a doubly linked list to manage completions.

   Previously it was singly listed, which meant that to complete request
   N in the chain 0..N-1 had to have completed first. With a doubly
   linked list we can complete whatever request completes in that order,
   rather than need to wait for a consecutive range to be available.
   This reduces latencies.

 - Improve the restriction setup and checking. Mostly in preparation for
   adding further features on top of that. Coming in a separate pull
   request.

 - Split out task_work and wait handling into separate files. These are
   mostly nicely abstracted already, but still remained in the
   io_uring.c file which is on the larger side.

 - Use GFP_KERNEL_ACCOUNT in a few more spots, where appropriate.

 - Ensure even the idle io-wq worker exits if a task no longer has any
   rings open.

 - Add support for a non-circular submission queue.

   By default, the SQ ring keeps moving around, even if only a few
   entries are used for each submission. This can be wasteful in terms
   of cachelines.

   If IORING_SETUP_SQ_REWIND is set for the ring when created, each
   submission will start at offset 0 instead of where we last left off
   doing submissions.

 - Various little cleanups

* tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (30 commits)
  io_uring/kbuf: fix memory leak if io_buffer_add_list fails
  io_uring: Add SPDX id lines to remaining source files
  io_uring: allow io-wq workers to exit when unused
  io_uring/io-wq: add exit-on-idle state
  io_uring/net: don't continue send bundle if poll was required for retry
  io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently
  io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation
  io_uring/io-wq: handle !sysctl_hung_task_timeout_secs
  io_uring: fix bad indentation for setup flags if statement
  io_uring/rsrc: take unsigned index in io_rsrc_node_lookup()
  io_uring: introduce non-circular SQ
  io_uring: split out CQ waiting code into wait.c
  io_uring: split out task work code into tw.c
  io_uring/io-wq: don't trigger hung task for syzbot craziness
  io_uring: add IO_URING_EXIT_WAIT_MAX definition
  io_uring/sync: validate passed in offset
  io_uring/eventfd: remove unused ctx->evfd_last_cq_tail member
  io_uring/timeout: annotate data race in io_flush_timeouts()
  io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL
  io_uring: fix IOPOLL with passthrough I/O
  ...
2026-02-09 17:22:00 -08:00
Li Chen
38aa434ab9 io_uring/io-wq: add exit-on-idle state
io-wq uses an idle timeout to shrink the pool, but keeps the last worker
around indefinitely to avoid churn.

For tasks that used io_uring for file I/O and then stop using io_uring,
this can leave an iou-wrk-* thread behind even after all io_uring
instances are gone. This is unnecessary overhead and also gets in the
way of process checkpoint/restore.

Add an exit-on-idle state that makes all io-wq workers exit as soon as
they become idle, and provide io_wq_set_exit_on_idle() to toggle it.

Signed-off-by: Li Chen <me@linux.beauty>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-02 08:10:23 -07:00
Jens Axboe
816095894c io_uring/io-wq: handle !sysctl_hung_task_timeout_secs
If the hung_task_timeout sysctl is set to 0, then we'll end up busy
looping inside io_wq_exit_workers() after an earlier commit switched to
using wait_for_completion_timeout(). Use the maximum schedule timeout
value for that case.

Fixes: 1f293098a3 ("io_uring/io-wq: don't trigger hung task for syzbot craziness")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-23 13:58:03 -07:00
Jens Axboe
1f293098a3 io_uring/io-wq: don't trigger hung task for syzbot craziness
Use the same trick that blk_io_schedule() does to avoid triggering the
hung task warning (and potential reboot/panic, depending on system
settings), and only wait for half the hung task timeout at the time.
If we exceed the default IO_URING_EXIT_WAIT_MAX period where we expect
things to certainly have finished unless there's a bug, then throw a
WARN_ON_ONCE() for that case.

Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Tested-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22 07:25:35 -07:00
Jens Axboe
10dc959398 io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loop
Currently this is checked before running the pending work. Normally this
is quite fine, as work items either end up blocking (which will create a
new worker for other items), or they complete fairly quickly. But syzbot
reports an issue where io-wq takes seemingly forever to exit, and with a
bit of debugging, this turns out to be because it queues a bunch of big
(2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't
support ->read_iter(), loop_rw_iter() ends up handling them. Each read
returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of
these pending, processing the whole chain can take a long time. Easily
longer than the syzbot uninterruptible sleep timeout of 140 seconds.
This then triggers a complaint off the io-wq exit path:

INFO: task syz.4.135:6326 blocked for more than 143 seconds.
      Not tainted syzkaller #0
      Blocked by coredump.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.4.135       state:D stack:26824 pid:6326  tgid:6324  ppid:5957   task_flags:0x400548 flags:0x00080000
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5256 [inline]
 __schedule+0x1139/0x6150 kernel/sched/core.c:6863
 __schedule_loop kernel/sched/core.c:6945 [inline]
 schedule+0xe7/0x3a0 kernel/sched/core.c:6960
 schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75
 do_wait_for_common kernel/sched/completion.c:100 [inline]
 __wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121
 io_wq_exit_workers io_uring/io-wq.c:1328 [inline]
 io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356
 io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203
 io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651
 io_uring_files_cancel include/linux/io_uring.h:19 [inline]
 do_exit+0x2ce/0x2bd0 kernel/exit.c:911
 do_group_exit+0xd3/0x2a0 kernel/exit.c:1112
 get_signal+0x2671/0x26d0 kernel/signal.c:3034
 arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337
 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline]
 exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
 syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline]
 do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa02738f749
RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098
RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98

There's really nothing wrong here, outside of processing these reads
will take a LONG time. However, we can speed up the exit by checking the
IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will
exit the ring after queueing up all of these reads. Then once the first
item is processed, io-wq will simply cancel the rest. That should avoid
syzbot running into this complaint again.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/
Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-20 07:55:59 -07:00
Jens Axboe
e4fdbca2dc io_uring/io-wq: remove io_wq_for_each_worker() return value
The only use of this helper is to iterate all of the workers, and
hence all callers will pass in a func that always returns false to do
that. As none of the callers use the return value, get rid of it.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-05 15:39:20 -07:00
Jens Axboe
e0392a10c9 io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic
A previous commit added this helper, and had it terminate if false is
returned from the handler. However, that is completely opposite, it
should abort the loop if true is returned.

Fix this up by having io_wq_for_each_worker() keep iterating as long
as false is returned, and only abort if true is returned.

Cc: stable@vger.kernel.org
Fixes: 751eedc4b4 ("io_uring/io-wq: move worker lists to struct io_wq_acct")
Reported-by: Lewis Campbell <info@lewiscampbell.tech>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-05 14:37:33 -07:00
Caleb Sander Mateos
34c78b8610 io_uring/io-wq: always retry worker create on ERESTART*
If a task has a pending signal when create_io_thread() is called,
copy_process() will return -ERESTARTNOINTR. io_should_retry_thread()
will request a retry of create_io_thread() up to WORKER_INIT_LIMIT = 3
times. If all retries fail, the io_uring request will fail with
ECANCELED.
Commit 3918315c5dc ("io-wq: backoff when retrying worker creation")
added a linear backoff to allow the thread to handle its signal before
the retry. However, a thread receiving frequent signals may get unlucky
and have a signal pending at every retry. Since the userspace task
doesn't control when it receives signals, there's no easy way for it to
prevent the create_io_thread() failure due to pending signals. The task
may also lack the information necessary to regenerate the canceled SQE.
So always retry the create_io_thread() on the ERESTART* errors,
analogous to what a fork() syscall would do. EAGAIN can occur due to
various persistent conditions such as exceeding RLIMIT_NPROC, so respect
the WORKER_INIT_LIMIT retry limit for EAGAIN errors.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-04 07:18:02 -07:00
Max Kellermann
cd4ea81be3 io_uring/io-wq: fix max_workers breakage and nr_workers underflow
Commit 88e6c42e40de ("io_uring/io-wq: add check free worker before
create new worker") reused the variable `do_create` for something
else, abusing it for the free worker check.

This caused the value to effectively always be `true` at the time
`nr_workers < max_workers` was checked, but it should really be
`false`.  This means the `max_workers` setting was ignored, and worse:
if the limit had already been reached, incrementing `nr_workers` was
skipped even though another worker would be created.

When later lots of workers exit, the `nr_workers` field could easily
underflow, making the problem worse because more and more workers
would be created without incrementing `nr_workers`.

The simple solution is to use a different variable for the free worker
check instead of using one variable for two different things.

Cc: stable@vger.kernel.org
Fixes: 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-15 10:46:13 -06:00
Fengnan Chang
9d83e1f05c io_uring/io-wq: add check free worker before create new worker
After commit 0b2b066f8a ("io_uring/io-wq: only create a new worker
if it can make progress"), in our produce environment, we still
observe that part of io_worker threads keeps creating and destroying.
After analysis, it was confirmed that this was due to a more complex
scenario involving a large number of fsync operations, which can be
abstracted as frequent write + fsync operations on multiple files in
a single uring instance. Since write is a hash operation while fsync
is not, and fsync is likely to be suspended during execution, the
action of checking the hash value in
io_wqe_dec_running cannot handle such scenarios.
Similarly, if hash-based work and non-hash-based work are sent at the
same time, similar issues are likely to occur.
Returning to the starting point of the issue, when a new work
arrives, io_wq_enqueue may wake up free worker A, while
io_wq_dec_running may create worker B. Ultimately, only one of A and
B can obtain and process the task, leaving the other in an idle
state. In the end, the issue is caused by inconsistent logic in the
checks performed by io_wq_enqueue and io_wq_dec_running.
Therefore, the problem can be resolved by checking for available
workers in io_wq_dec_running.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Reviewed-by: Diangang Li <lidiangang@bytedance.com>
Link: https://lore.kernel.org/r/20250813120214.18729-1-changfengnan@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-13 06:31:10 -06:00
Penglei Jiang
89465d923b io_uring: fix task leak issue in io_wq_create()
Add missing put_task_struct() in the error path

Cc: stable@vger.kernel.org
Fixes: 0f8baa3c98 ("io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()")
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/r/20250615163906.2367-1-superman.xpt@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-15 12:58:39 -06:00
Jens Axboe
0b2b066f8a io_uring/io-wq: only create a new worker if it can make progress
Hashed work is serialized by io-wq, intended to be used for cases like
serializing buffered writes to a regular file, where the file system
will serialize the workers anyway with a mutex or similar. Since they
would be forcibly serialized and blocked, it's more efficient for io-wq
to handle these individually rather than issue them in parallel.

If a worker is currently handling a hashed work item and gets blocked,
don't create a new worker if the next work item is also hashed and
mapped to the same bucket. That new worker would not be able to make any
progress anyway.

Reported-by: Fengnan Chang <changfengnan@bytedance.com>
Reported-by: Diangang Li <lidiangang@bytedance.com>
Link: https://lore.kernel.org/io-uring/20250522090909.73212-1-changfengnan@bytedance.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 06:14:10 -06:00
Jens Axboe
8343cae362 io_uring/io-wq: ignore non-busy worker going to sleep
When an io-wq worker goes to sleep, it checks if there's work to do.
If there is, it'll create a new worker. But if this worker is currently
idle, it'll either get woken right back up immediately, or someone
else has already created the necessary worker to handle this work.

Only go through the worker creation logic if the current worker is
currently handling a work item. That means it's being scheduled out as
part of handling that work, not just going to sleep on its own.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 06:14:07 -06:00
Jens Axboe
e37dfc0530 io_uring/io-wq: move hash helpers to the top
Just in preparation for using them higher up in the function, move
these generic helpers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 06:14:04 -06:00
Caleb Sander Mateos
9fe99eed91 io_uring/wq: avoid indirect do_work/free_work calls
struct io_wq stores do_work and free_work function pointers which are
called on each work item. But these function pointers are always set to
io_wq_submit_work and io_wq_free_work, respectively. So remove these
function pointers and just call the functions directly.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250329161527.3281314-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Jens Axboe
c0d8c0362b Merge branch 'io_uring-6.14' into for-6.15/io_uring
Merge mainline fixes into 6.15 branch, as upcoming patches depend on
fixes that went into the 6.14 mainline branch.

* io_uring-6.14:
  io_uring/net: save msg_control for compat
  io_uring/rw: clean up mshot forced sync mode
  io_uring/rw: move ki_complete init into prep
  io_uring/rw: don't directly use ki_complete
  io_uring/rw: forbid multishot async reads
  io_uring/rsrc: remove unused constants
  io_uring: fix spelling error in uapi io_uring.h
  io_uring: prevent opcode speculation
  io-wq: backoff when retrying worker creation
2025-02-27 07:18:01 -07:00
Max Kellermann
7d568502ef io_uring/io-wq: pass io_wq to io_get_next_work()
The only caller has already determined this pointer, so let's skip
the redundant dereference.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-7-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann
486ba4d84d io_uring/io-wq: do not use bogus hash value
Previously, the `hash` variable was initialized with `-1` and only
updated by io_get_next_work() if the current work was hashed.  Commit
60cf46ae60 ("io-wq: hash dependent work") changed this to always
call io_get_work_hash() even if the work was not hashed.  This caused
the `hash != -1U` check to always be true, adding some overhead for
the `hash->wait` code.

This patch fixes the regression by checking the `IO_WQ_WORK_HASHED`
flag.

Perf diff for a flood of `IORING_OP_NOP` with `IOSQE_ASYNC`:

    38.55%     -1.57%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.86%     -0.72%  [kernel.kallsyms]  [k] io_worker_handle_work
     0.10%     +0.67%  [kernel.kallsyms]  [k] put_prev_entity
     1.96%     +0.59%  [kernel.kallsyms]  [k] io_nop_prep
     3.31%     -0.51%  [kernel.kallsyms]  [k] try_to_wake_up
     7.18%     -0.47%  [kernel.kallsyms]  [k] io_wq_free_work

Fixes: 60cf46ae60 ("io-wq: hash dependent work")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-6-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann
6ee78354ea io_uring/io-wq: cache work->flags in variable
This eliminates several redundant atomic reads and therefore reduces
the duration the surrounding spinlocks are held.

In several io_uring benchmarks, this reduced the CPU time spent in
queued_spin_lock_slowpath() considerably:

io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:

    38.86%     -1.49%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.75%     +0.36%  [kernel.kallsyms]  [k] io_worker_handle_work
     2.60%     +0.19%  [kernel.kallsyms]  [k] io_nop
     3.92%     +0.18%  [kernel.kallsyms]  [k] io_req_task_complete
     6.34%     -0.18%  [kernel.kallsyms]  [k] io_wq_submit_work

HTTP server, static file:

    42.79%     -2.77%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     2.08%     +0.23%  [kernel.kallsyms]     [k] io_wq_submit_work
     1.19%     +0.20%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.46%     +0.15%  [kernel.kallsyms]     [k] ep_poll_callback
     1.80%     +0.15%  [kernel.kallsyms]     [k] io_worker_handle_work

HTTP server, PHP:

    35.03%     -1.80%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     0.84%     +0.21%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.39%     +0.12%  [kernel.kallsyms]     [k] _copy_to_iter
     0.21%     +0.10%  [kernel.kallsyms]     [k] update_sd_lb_stats

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-5-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann
751eedc4b4 io_uring/io-wq: move worker lists to struct io_wq_acct
Have separate linked lists for bounded and unbounded workers.  This
way, io_acct_activate_free_worker() sees only workers relevant to it
and doesn't need to skip irrelevant ones.  This speeds up the
linked list traversal (under acct->lock).

The `io_wq.lock` field is moved to `io_wq_acct.workers_lock`.  It did
not actually protect "access to elements below", that is, not all of
them; it only protected access to the worker lists.  By having two
locks instead of one, contention on this lock is reduced.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-4-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann
3d3bafd35f io_uring/io-wq: add io_worker.acct pointer
This replaces the `IO_WORKER_F_BOUND` flag.  All code that checks this
flag is not interested in knowing whether this is a "bound" worker;
all it does with this flag is determine the `io_wq_acct` pointer.  At
the cost of an extra pointer field, we can eliminate some fragile
pointer arithmetic.  In turn, the `create_index` and `index` fields
are not needed anymore.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-3-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann
3c75635f8e io_uring/io-wq: eliminate redundant io_work_get_acct() calls
Instead of calling io_work_get_acct() again, pass acct to
io_wq_insert_work() and io_wq_remove_pending().

This atomic access in io_work_get_acct() was done under the
`acct->lock`, and optimizing it away reduces lock contention a bit.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-2-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Uday Shankar
13918315c5 io-wq: backoff when retrying worker creation
When io_uring submission goes async for the first time on a given task,
we'll try to create a worker thread to handle the submission. Creating
this worker thread can fail due to various transient conditions, such as
an outstanding signal in the forking thread, so we have retry logic with
a limit of 3 retries. However, this retry logic appears to be too
aggressive/fast - we've observed a thread blowing through the retry
limit while having the same outstanding signal the whole time. Here's an
excerpt of some tracing that demonstrates the issue:

First, signal 26 is generated for the process. It ends up getting routed
to thread 92942.

 0)   cbd-92284    /* signal_generate: sig=26 errno=0 code=-2 comm=psblkdASD pid=92934 grp=1 res=0 */

This causes create_io_thread in the signalled thread to fail with
ERESTARTNOINTR, and thus a retry is queued.

13) task_th-92942  /* io_uring_queue_async_work: ring 000000007325c9ae, request 0000000080c96d8e, user_data 0x0, opcode URING_CMD, flags 0x8240001, normal queue, work 000000006e96dd3f */
13) task_th-92942  io_wq_enqueue() {
13) task_th-92942    _raw_spin_lock();
13) task_th-92942    io_wq_activate_free_worker();
13) task_th-92942    _raw_spin_lock();
13) task_th-92942    create_io_worker() {
13) task_th-92942      __kmalloc_cache_noprof();
13) task_th-92942      __init_swait_queue_head();
13) task_th-92942      kprobe_ftrace_handler() {
13) task_th-92942        get_kprobe();
13) task_th-92942        aggr_pre_handler() {
13) task_th-92942          pre_handler_kretprobe();
13) task_th-92942          /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
13) task_th-92942        } /* aggr_pre_handler */
...
13) task_th-92942        } /* copy_process */
13) task_th-92942      } /* create_io_thread */
13) task_th-92942      kretprobe_rethook_handler() {
13) task_th-92942        /* create_exit: (create_io_worker+0x8a/0x1a0 <- create_io_thread) arg1=0xfffffffffffffdff */
13) task_th-92942      } /* kretprobe_rethook_handler */
13) task_th-92942    queue_work_on() {
...

The CPU is then handed to a kworker to process the queued retry:

------------------------------------------
 13) task_th-92942  => kworker-54154
------------------------------------------
13) kworker-54154  io_workqueue_create() {
13) kworker-54154    io_queue_worker_create() {
13) kworker-54154      task_work_add() {
13) kworker-54154        wake_up_state() {
13) kworker-54154          try_to_wake_up() {
13) kworker-54154            _raw_spin_lock_irqsave();
13) kworker-54154            _raw_spin_unlock_irqrestore();
13) kworker-54154          } /* try_to_wake_up */
13) kworker-54154        } /* wake_up_state */
13) kworker-54154        kick_process();
13) kworker-54154      } /* task_work_add */
13) kworker-54154    } /* io_queue_worker_create */
13) kworker-54154  } /* io_workqueue_create */

And then we immediately switch back to the original task to try creating
a worker again. This fails, because the original task still hasn't
handled its signal.

-----------------------------------------
 13) kworker-54154  => task_th-92942
------------------------------------------
13) task_th-92942  create_worker_cont() {
13) task_th-92942    kprobe_ftrace_handler() {
13) task_th-92942      get_kprobe();
13) task_th-92942      aggr_pre_handler() {
13) task_th-92942        pre_handler_kretprobe();
13) task_th-92942        /* create_enter: (create_io_thread+0x0/0x50) fn=0xffffffff8172c0e0 arg=0xffff888996bb69c0 node=-1 */
13) task_th-92942      } /* aggr_pre_handler */
13) task_th-92942    } /* kprobe_ftrace_handler */
13) task_th-92942    create_io_thread() {
13) task_th-92942      copy_process() {
13) task_th-92942        task_active_pid_ns();
13) task_th-92942        _raw_spin_lock_irq();
13) task_th-92942        recalc_sigpending();
13) task_th-92942        _raw_spin_lock_irq();
13) task_th-92942      } /* copy_process */
13) task_th-92942    } /* create_io_thread */
13) task_th-92942    kretprobe_rethook_handler() {
13) task_th-92942      /* create_exit: (create_worker_cont+0x35/0x1b0 <- create_io_thread) arg1=0xfffffffffffffdff */
13) task_th-92942    } /* kretprobe_rethook_handler */
13) task_th-92942    io_worker_release();
13) task_th-92942    queue_work_on() {
13) task_th-92942      clear_pending_if_disabled();
13) task_th-92942      __queue_work() {
13) task_th-92942      } /* __queue_work */
13) task_th-92942    } /* queue_work_on */
13) task_th-92942  } /* create_worker_cont */

The pattern repeats another couple times until we blow through the retry
counter, at which point we give up. All outstanding work is canceled,
and the io_uring command which triggered all this is failed with
ECANCELED:

13) task_th-92942  io_acct_cancel_pending_work() {
...
13) task_th-92942  /* io_uring_complete: ring 000000007325c9ae, req 0000000080c96d8e, user_data 0x0, result -125, cflags 0x0 extra1 0 extra2 0  */

Finally, the task gets around to processing its outstanding signal 26,
but it's too late.

13) task_th-92942  /* signal_deliver: sig=26 errno=0 code=-2 sa_handler=59566a0 sa_flags=14000000 */

Try to address this issue by adding a small scaling delay when retrying
worker creation. This should give the forking thread time to handle its
signal in the above case. This isn't a particularly satisfying solution,
as sufficiently paradoxical scheduling would still have us hitting the
same issue, and I'm open to suggestions for something better. But this
is likely to prevent this (already rare) issue from hitting in practice.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250208-wq_retry-v2-1-4f6f5041d303@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-14 15:34:28 -07:00
Kees Cook
3a3f61ce5e exec: Make sure task->comm is always NUL-terminated
Using strscpy() meant that the final character in task->comm may be
non-NUL for a moment before the "string too long" truncation happens.

Instead of adding a new use of the ambiguous strncpy(), we'd want to
use memtostr_pad() which enforces being able to check at compile time
that sizes are sensible, but this requires being able to see string
buffer lengths. Instead of trying to inline __set_task_comm() (which
needs to call trace and perf functions), just open-code it. But to
make sure we're always safe, add compile-time checking like we already
do for get_task_comm().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-12-16 16:53:00 -08:00
Felix Moessbauer
84eacf177f io_uring/io-wq: inherit cpuset of cgroup in io worker
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).

When creating a new io worker, this worker should inherit the cpuset
of the cgroup.

Fixes: da64d6db3b ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240910171157.166423-3-felix.moessbauer@siemens.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 07:27:56 -06:00
Felix Moessbauer
0997aa5497 io_uring/io-wq: do not allow pinning outside of cpuset
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).

When changing the affinity of the io_wq thread via syscall, we must only
allow cpumasks within the limits defined by the cpuset controller of the
cgroup (if enabled).

Fixes: da64d6db3b ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240910171157.166423-2-felix.moessbauer@siemens.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 07:27:56 -06:00
Pavel Begunkov
0453aad676 io_uring/io-wq: limit retrying worker initialisation
If io-wq worker creation fails, we retry it by queueing up a task_work.
tasK_work is needed because it should be done from the user process
context. The problem is that retries are not limited, and if queueing a
task_work is the reason for the failure, we might get into an infinite
loop.

It doesn't seem to happen now but it would with the following patch
executing task_work in the freezer's loop. For now, arbitrarily limit the
number of attempts to create a worker.

Cc: stable@vger.kernel.org
Fixes: 3146cba99a ("io-wq: make worker creation resilient against signals")
Reported-by: Julian Orth <ju.orth@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8280436925db88448c7c85c6656edee1a43029ea.1720634146.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-11 01:51:44 -06:00
Jens Axboe
3474d1b93f io_uring/io-wq: make io_wq_work flags atomic
The work flags can be set/accessed from different tasks, both the
originator of the request, and the io-wq workers. While modifications
aren't concurrent, it still makes KMSAN unhappy. There's no real
downside to just making the flag reading/manipulation use proper
atomics here.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16 14:54:55 -06:00
Su Hui
91215f70ea io_uring/io-wq: avoid garbage value of 'match' in io_wq_enqueue()
Clang static checker (scan-build) warning:
o_uring/io-wq.c:line 1051, column 3
The expression is an uninitialized value. The computed value will
also be garbage.

'match.nr_pending' is used in io_acct_cancel_pending_work(), but it is
not fully initialized. Change the order of assignment for 'match' to fix
this problem.

Fixes: 42abc95f05 ("io-wq: decouple work_list protection from the big wqe->lock")
Signed-off-by: Su Hui <suhui@nfschina.com>
Link: https://lore.kernel.org/r/20240604121242.2661244-1-suhui@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-04 07:39:00 -06:00
Breno Leitao
8a56530492 io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
Utilize set_bit() and test_bit() on worker->flags within io_uring/io-wq
to address potential data races.

The structure io_worker->flags may be accessed through various data
paths, leading to concurrency issues. When KCSAN is enabled, it reveals
data races occurring in io_worker_handle_work and
io_wq_activate_free_worker functions.

	 BUG: KCSAN: data-race in io_worker_handle_work / io_wq_activate_free_worker
	 write to 0xffff8885c4246404 of 4 bytes by task 49071 on cpu 28:
	 io_worker_handle_work (io_uring/io-wq.c:434 io_uring/io-wq.c:569)
	 io_wq_worker (io_uring/io-wq.c:?)
<snip>

	 read to 0xffff8885c4246404 of 4 bytes by task 49024 on cpu 5:
	 io_wq_activate_free_worker (io_uring/io-wq.c:? io_uring/io-wq.c:285)
	 io_wq_enqueue (io_uring/io-wq.c:947)
	 io_queue_iowq (io_uring/io_uring.c:524)
	 io_req_task_submit (io_uring/io_uring.c:1511)
	 io_handle_tw_list (io_uring/io_uring.c:1198)
<snip>

Line numbers against commit 18daea77cc ("Merge tag 'for-linus' of
git://git.kernel.org/pub/scm/virt/kvm/kvm").

These races involve writes and reads to the same memory location by
different tasks running on different CPUs. To mitigate this, refactor
the code to use atomic operations such as set_bit(), test_bit(), and
clear_bit() instead of basic "and" and "or" operations. This ensures
thread-safe manipulation of worker flags.

Also, move `create_index` to avoid holes in the structure.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240507170002.2269003-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 13:17:19 -06:00
Gabriel Krisman Bertazi
24c3fc5c75 io-wq: Drop intermediate step between pending list and active work
next_work is only used to make the work visible for
cancellation. Instead, we can just directly write to cur_work before
dropping the acct_lock and avoid the extra hop.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240416021054.3940-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:20:32 -06:00
Gabriel Krisman Bertazi
068c27e32e io-wq: write next_work before dropping acct_lock
Commit 361aee450c ("io-wq: add intermediate work step between pending
list and active work") closed a race between a cancellation and the work
being removed from the wq for execution.  To ensure the request is
always reachable by the cancellation, we need to move it within the wq
lock, which also synchronizes the cancellation.  But commit
42abc95f05 ("io-wq: decouple work_list protection from the big
wqe->lock") replaced the wq lock here and accidentally reintroduced the
race by releasing the acct_lock too early.

In other words:

        worker                |     cancellation
work = io_get_next_work()     |
raw_spin_unlock(&acct->lock); |
			      |
                              | io_acct_cancel_pending_work
                              | io_wq_worker_cancel()
worker->next_work = work

Using acct_lock is still enough since we synchronize on it on
io_acct_cancel_pending_work.

Fixes: 42abc95f05 ("io-wq: decouple work_list protection from the big wqe->lock")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240416021054.3940-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:20:32 -06:00
Jeff Moyer
0f8baa3c98 io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()
I received a bug report with the following signature:

[ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8
[ 1759.944564] #PF: supervisor read access in kernel mode
[ 1759.949732] #PF: error_code(0x0000) - not-present page
[ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0
[ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI
[ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1
[ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018
[ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0
[ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48
[ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286
[ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000
[ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400
[ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000
[ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50
[ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548
[ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000
[ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0
[ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1760.086325] PKRU: 55555554
[ 1760.089044] Call Trace:
[ 1760.091501] <TASK>
[ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df
[ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df
[ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0
[ 1760.106584] ? __die_body.cold+0x8/0xd
[ 1760.110356] ? page_fault_oops+0x134/0x170
[ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110
[ 1760.119298] ? exc_page_fault+0xa8/0x150
[ 1760.123247] ? asm_exc_page_fault+0x22/0x30
[ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10
[ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10
[ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0
[ 1760.142527] __io_wq_cpu_online+0x54/0xb0
[ 1760.146558] cpuhp_invoke_callback+0x109/0x460
[ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10
[ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 1760.160320] cpuhp_thread_fun+0x8d/0x140
[ 1760.164266] smpboot_thread_fn+0xd3/0x1a0
[ 1760.168297] kthread+0xdd/0x100
[ 1760.171457] ? __pfx_kthread+0x10/0x10
[ 1760.175225] ret_from_fork+0x29/0x50
[ 1760.178826] </TASK>
[ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1760.248623] CR2: ffffffffffffffe8

A cpu hotplug callback was issued before wq->all_list was initialized.
This results in a null pointer dereference.  The fix is to fully setup
the io_wq before calling cpuhp_state_add_instance_nocalls().

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-05 14:11:18 -06:00
Pavel Begunkov
45500dc4e0 io_uring: break out of iowq iopoll on teardown
io-wq will retry iopoll even when it failed with -EAGAIN. If that
races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers,
such workers might potentially infinitely spin retrying iopoll again and
again and each time failing on some allocation / waiting / etc. Don't
keep spinning if io-wq is dying.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-07 09:02:27 -06:00
Jens Axboe
ebdfefc09c io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used
If we setup the ring with SQPOLL, then that polling thread has its
own io-wq setup. This means that if the application uses
IORING_REGISTER_IOWQ_AFF to set the io-wq affinity, we should not be
setting it for the invoking task, but rather the sqpoll task.

Add an sqpoll helper that parks the thread and updates the affinity,
and use that one if we're using SQPOLL.

Fixes: fe76421d1d ("io_uring: allow user configurable IO thread CPU affinity")
Cc: stable@vger.kernel.org # 5.10+
Link: https://github.com/axboe/liburing/discussions/884
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-16 13:40:28 -06:00
Jens Axboe
22f7fb80e6 io_uring/io-wq: don't gate worker wake up success on wake_up_process()
All we really care about is finding a free worker. If said worker is
already running, it's either starting new work already or it's just
finishing up existing work. For the latter, we'll be finding this work
item next anyway, and for the former, if the worker does go to sleep,
it'll create a new worker anyway as we have pending items.

This reduces try_to_wake_up() overhead considerably:

23.16%    -10.46%  [kernel.kallsyms]      [k] try_to_wake_up

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:36:20 -06:00
Jens Axboe
de36a15f9a io_uring/io-wq: reduce frequency of acct->lock acquisitions
When we check if we have work to run, we grab the acct lock, check,
drop it, and then return the result. If we do have work to run, then
running the work will again grab acct->lock and get the work item.

This causes us to grab acct->lock more frequently than we need to.
If we have work to do, have io_acct_run_queue() return with the acct
lock still acquired. io_worker_handle_work() is then always invoked
with the acct lock already held.

In a simple test cases that stats files (IORING_OP_STATX always hits
io-wq), we see a nice reduction in locking overhead with this change:

19.32%   -12.55%  [kernel.kallsyms]      [k] __cmpwait_case_32
20.90%   -12.07%  [kernel.kallsyms]      [k] queued_spin_lock_slowpath

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:36:17 -06:00
Jens Axboe
78848b9b05 io_uring/io-wq: don't grab wq->lock for worker activation
The worker free list is RCU protected, and checks for workers going away
when iterating it. There's no need to hold the wq->lock around the
lookup.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:36:12 -06:00
Jens Axboe
adeaa3f290 io_uring/io-wq: clear current->worker_private on exit
A recent fix stopped clearing PF_IO_WORKER from current->flags on exit,
which meant that we can now call inc/dec running on the worker after it
has been removed if it ends up scheduling in/out as part of exit.

If this happens after an RCU grace period has passed, then the struct
pointed to by current->worker_private may have been freed, and we can
now be accessing memory that is freed.

Ensure this doesn't happen by clearing the task worker_private field.
Both io_wq_worker_running() and io_wq_worker_sleeping() check this
field before going any further, and we don't need any accounting etc
done after this worker has exited.

Fixes: fd37b88400 ("io_uring/io-wq: don't clear PF_IO_WORKER on exit")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 12:54:55 -06:00
Jens Axboe
fd37b88400 io_uring/io-wq: don't clear PF_IO_WORKER on exit
A recent commit gated the core dumping task exit logic on current->flags
remaining consistent in terms of PF_{IO,USER}_WORKER at task exit time.
This exposed a problem with the io-wq handling of that, which explicitly
clears PF_IO_WORKER before calling do_exit().

The reasons for this manual clear of PF_IO_WORKER is historical, where
io-wq used to potentially trigger a sleep on exit. As the io-wq thread
is exiting, it should not participate any further accounting. But these
days we don't need to rely on current->flags anymore, so we can safely
remove the PF_IO_WORKER clearing.

Reported-by: Zorro Lang <zlang@redhat.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/ZIZSPyzReZkGBEFy@dread.disaster.area/
Fixes: f9010dbdce ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-12 11:16:09 -07:00
Jens Axboe
07d99096e1 io_uring/io-wq: drop outdated comment
Since the move to PF_IO_WORKER, we don't juggle memory context manually
anymore. Remove that outdated part of the comment for __io_worker_idle().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Gabriel Krisman Bertazi
eb47943f22 io-wq: Drop struct io_wqe
Since commit 0654b05e7e65 ("io_uring: One wqe per wq"), we have just a
single io_wqe instance embedded per io_wq.  Drop the extra structure in
favor of accessing struct io_wq directly, cleaning up quite a bit of
dereferences and backpointers.

No functional changes intended.  Tested with liburing's testsuite
and mmtests performance microbenchmarks.  I didn't observe any
performance regressions.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Gabriel Krisman Bertazi
dfd63baf89 io-wq: Move wq accounting to io_wq
Since we now have a single io_wqe per io_wq instead of per-node, and in
preparation to its removal, move the accounting into the parent
structure.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:14 -06:00
Breno Leitao
da64d6db3b io_uring: One wqe per wq
Right now io_wq allocates one io_wqe per NUMA node.  As io_wq is now
bound to a task, the task basically uses only the NUMA local io_wqe, and
almost never changes NUMA nodes, thus, the other wqes are mostly
unused.

Allocate just one io_wqe embedded into io_wq, and uses all possible cpus
(cpu_possible_mask) in the io_wqe->cpumask.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230310201107.4020580-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:21 -06:00
Jens Axboe
01e68ce08a io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers
Every now and then reports come in that are puzzled on why changing
affinity on the io-wq workers fails with EINVAL. This happens because they
set PF_NO_SETAFFINITY as part of their creation, as io-wq organizes
workers into groups based on what CPU they are running on.

However, this is purely an optimization and not a functional requirement.
We can allow setting affinity, and just lazily update our worker to wqe
mappings. If a given io-wq thread times out, it normally exits if there's
no more work to do. The exception is if it's the last worker available.
For the timeout case, check the affinity of the worker against group mask
and exit even if it's the last worker. New workers should be created with
the right mask and in the right location.

Reported-by:Daniel Dao <dqminh@cloudflare.com>
Link: https://lore.kernel.org/io-uring/CA+wXwBQwgxB3_UphSny-yAP5b26meeOu1W4TwYVcD_+5gOhvPw@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-08 08:48:13 -07:00
Jens Axboe
e6db6f9398 io_uring/io-wq: only free worker if it was allocated for creation
We have two types of task_work based creation, one is using an existing
worker to setup a new one (eg when going to sleep and we have no free
workers), and the other is allocating a new worker. Only the latter
should be freed when we cancel task_work creation for a new worker.

Fixes: af82425c6a ("io_uring/io-wq: free worker if task_work creation is canceled")
Reported-by: syzbot+d56ec896af3637bdb7e4@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-08 10:39:17 -07:00
Jens Axboe
af82425c6a io_uring/io-wq: free worker if task_work creation is canceled
If we cancel the task_work, the worker will never come into existance.
As this is the last reference to it, ensure that we get it freed
appropriately.

Cc: stable@vger.kernel.org
Reported-by: 진호 <wnwlsgh98@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-02 16:49:46 -07:00
Rafael Mendonca
996d3efeb0 io-wq: Fix memory leak in worker creation
If the CPU mask allocation for a node fails, then the memory allocated for
the 'io_wqe' struct of the current node doesn't get freed on the error
handling path, since it has not yet been added to the 'wqes' array.

This was spotted when fuzzing v6.1-rc1 with Syzkaller:
BUG: memory leak
unreferenced object 0xffff8880093d5000 (size 1024):
  comm "syz-executor.2", pid 7701, jiffies 4295048595 (age 13.900s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000cb463369>] __kmem_cache_alloc_node+0x18e/0x720
    [<00000000147a3f9c>] kmalloc_node_trace+0x2a/0x130
    [<000000004e107011>] io_wq_create+0x7b9/0xdc0
    [<00000000c38b2018>] io_uring_alloc_task_context+0x31e/0x59d
    [<00000000867399da>] __io_uring_add_tctx_node.cold+0x19/0x1ba
    [<000000007e0e7a79>] io_uring_setup.cold+0x1b80/0x1dce
    [<00000000b545e9f6>] __x64_sys_io_uring_setup+0x5d/0x80
    [<000000008a8a7508>] do_syscall_64+0x5d/0x90
    [<000000004ac08bec>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 0e03496d19 ("io-wq: use private CPU mask")
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Link: https://lore.kernel.org/r/20221020014710.902201-1-rafaelmendsr@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20 05:48:59 -07:00