linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-08 02:44:41 +01:00

Author	SHA1	Message	Date
Jens Axboe	f4d0668b38	io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots __io_fixed_fd_install() returns 0 on success for non-alloc mode (specific slot), not the slot index. io_pipe_fixed() used this return value directly as the slot index in fds[], which can cause the reported values returned via copy_to_user() to be incorrect, or the error path operating on the incorrect direct descriptor. Fix by computing the actual 0-based slot index (slot - 1) for specific slot mode, while preserving the existing behavior for auto-alloc mode where __io_fixed_fd_install() already returns the allocated index. Cc: stable@vger.kernel.org Fixes: `53db8a71ec` ("io_uring: add support for IORING_OP_PIPE") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-11 20:31:21 -07:00
Linus Torvalds	591beb0e3a	io_uring-bpf-restrictions.4-20260206 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJ1kQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpky8EAChIL3uJ5Vmv+oQTxT4EVb1wpc8U/XzXWU5 Q5F9IpZZCGO7+i015Y7iTTqDRixjblRaWpWzZZP8vflWDUS8LESNZLQdcoEnxaiv P367KNPUGwxejcKsu8PvZvfnX6JWSQoNstcDmrwkCF0ND2UUfvvMZyn3uKhkbBRY h5Ehcqkvqc1OJDAWC7+yPzYAmB01uRPQ6sc9/GeujznHPlfbvie4u6gBvvfXeirT 592zbVftINMrm6Twd6zl4n+HNAn+CUoyVMppeeddv5IcyFPm9uz/dLOZBXTz6552 jFYNmB0U4g+SxGXMyqp37YISTALnuY+57y5eXmEAtgkEeE3HrF+F/ZdxQHwXSpo3 T2Lb9IOqFyHtSvq678HZ37JB6aIYbBE/mZdNf8FFFpnPJGb5Ey7d50qPp/ywVq0H p9CahbpkzGUBMsZ+koew0YHiFdWV9tww+/Bnk5dTtn2197uyaHsLdmbf4C36GWke Bk5cwNgU+3DMFAfTiL9m+AIXYsJkBayRJn+hViTrF5AL7gcGiBryGF43FOSKoYuq f0mniDnGSwvn86VZPuZQ6wBRHZPEMR3OlaUXn6XrUU6cYyvMg0pBZV+QHF7zlsSP 2sdfUbPL5TxexF3G8dsxlDIypz9Z6TCoUCfU0WiiUETnCrVNkXfIY846A+w08p0b ejBjzrwRtQ== =CqJq -----END PGP SIGNATURE----- Merge tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring bpf filters from Jens Axboe: "This adds support for both cBPF filters for io_uring, as well as task inherited restrictions and filters. seccomp and io_uring don't play along nicely, as most of the interesting data to filter on resides somewhat out-of-band, in the submission queue ring. As a result, things like containers and systemd that apply seccomp filters, can't filter io_uring operations. That leaves them with just one choice if filtering is critical - filter the actual io_uring_setup(2) system call to simply disallow io_uring. That's rather unfortunate, and has limited us because of it. io_uring already has some filtering support. It requires the ring to be setup in a disabled state, and then a filter set can be applied. This filter set is completely bi-modal - an opcode is either enabled or it's not. Once a filter set is registered, the ring can be enabled. This is very restrictive, and it's not useful at all to systemd or containers which really want both broader and more specific control. This first adds support for cBPF filters for opcodes, which enables tighter control over what exactly a specific opcode may do. As examples, specific support is added for IORING_OP_OPENAT/OPENAT2, allowing filtering on resolve flags. And another example is added for IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These are both common use cases. cBPF was chosen rather than eBPF, because the latter is often restricted in containers as well. These filters are run post the init phase of the request, which allows filters to even dip into data that is being passed in struct in user memory, as the init side of requests make that data stable by bringing it into the kernel. This allows filtering without needing to copy this data twice, or have filters etc know about the exact layout of the user data. The filters get the already copied and sanitized data passed. On top of that support is added for per-task filters, meaning that any ring created with a task that has a per-task filter will get those filters applied when it's created. These filters are inherited across fork as well. Once a filter has been registered, any further added filters may only further restrict what operations are permitted. Filters cannot change the return value of an operation, they can only permit or deny it based on the contents" * tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: allow registration of per-task restrictions io_uring: add task fork hook io_uring/bpf_filter: add ref counts to struct io_bpf_filter io_uring/bpf_filter: cache lookup table in ctx->bpf_filters io_uring/bpf_filter: allow filtering on contents of struct open_how io_uring/net: allow filtering on IORING_OP_SOCKET data io_uring: add support for BPF filtering for opcode restrictions	2026-02-09 17:31:17 -08:00
Jens Axboe	8768770cf5	io_uring/bpf_filter: allow filtering on contents of struct open_how This adds custom filtering for IORING_OP_OPENAT and IORING_OP_OPENAT2, where the open_how flags, mode, and resolve can be checked by filters. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 11:10:46 -07:00
Al Viro	541003b576	rename do_filp_open() to do_file_open() "filp" thing never made sense; seeing that there are exactly 4 callers in the entire tree (and it's neither exported nor even declared in linux//.h), there's no point keeping that ugliness. FWIW, the 'filp' thing did originate in OSD&I; for some reason Tanenbaum decided to call the object representing an opened file 'struct filp', the last letter standing for 'position'. In all Unices, Linux included, the corresponding object had always been 'struct file'... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Al Viro	9fa3ec8458	allow incomplete imports of filenames There are two filename-related problems in io_uring and its interplay with audit. Filenames are imported when request is submitted and used when it is processed. Unfortunately, the latter may very well happen in a different thread. In that case the reference to filename is put into the wrong audit_context - that of submitting thread, not the processing one. Audit logics is called by the latter, and it really wants to be able to find the names in audit_context current (== processing) thread. Another related problem is the headache with refcounts - normally all references to given struct filename are visible only to one thread (the one that uses that struct filename). io_uring violates that - an extra reference is stashed in audit_context of submitter. It gets dropped when submitter returns to userland, which can happen simultaneously with processing thread deciding to drop the reference it got. We paper over that by making refcount atomic, but that means pointless headache for everyone. Solution: the notion of partially imported filenames. Namely, already copied from userland, but not exposed to audit yet. io_uring can create that in submitter thread, and complete the import (obtaining the usual reference to struct filename) in processing thread. Object: struct delayed_filename. Primitives for working with it: delayed_getname(&delayed_filename, user_string) - copies the name from userland, returning 0 and stashing the address of (still incomplete) struct filename in delayed_filename on success and returning -E... on error. delayed_getname_uflags(&delayed_filename, user_string, atflags) - similar, in the same relation to delayed_getname() as getname_uflags() is to getname() complete_getname(&delayed_filename) - completes the import of filename stashed in delayed_filename and returns struct filename to caller, emptying delayed_filename. CLASS(filename_complete_delayed, name)(&delayed_filename) - variant of CLASS(filename) with complete_getname() for constructor. dismiss_delayed_filename(&delayed_filename) - destructor; drops whatever might be stashed in delayed_filename, emptying it. putname_to_delayed(&delayed_filename, name) - if name is shared, stashes its copy into delayed_filename and drops the reference to name, otherwise stashes the name itself in there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Prithvi Tambewagh	b14fad5553	io_uring: fix filename leak in __io_openat_prep() __io_openat_prep() allocates a struct filename using getname(). However, for the condition of the file being installed in the fixed file table as well as having O_CLOEXEC flag set, the function returns early. At that point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this, the memory for the newly allocated struct filename is not cleaned up, causing a memory leak. Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the successful getname() call, so that when the request is torn down, the filename will be cleaned up, along with other resources needing cleanup. Reported-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f Tested-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com> Fixes: `b9445598d8` ("io_uring: openat directly into fixed fd table") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-25 07:58:33 -07:00
Caleb Sander Mateos	5d4c52bfa8	io_uring: don't include filetable.h in io_uring.h io_uring/io_uring.h doesn't use anything declared in io_uring/filetable.h, so drop the unnecessary #include. Add filetable.h includes in .c files previously relying on the transitive include from io_uring.h. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 13:20:46 -06:00
Jens Axboe	dd765ba872	fs/pipe: set FMODE_NOWAIT in create_pipe_files() Rather than have the caller set the FMODE_NOWAIT flags for both output files, move it to create_pipe_files() where other f_mode flags are set anyway with stream_open(). With that, both __do_pipe_flags() and io_pipe() can remove the manual setting of the NOWAIT flags. No intended functional changes, just a code cleanup. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/1f0473f8-69f3-4eb1-aa77-3334c6a71d24@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-10 13:16:19 +02:00
Jens Axboe	8bb9d6ccd3	io_uring: finish IOU_OK -> IOU_COMPLETE transition IOU_COMPLETE is more descriptive, in that it explicitly says that the return value means "please post a completion for this request". This patch completes the transition from IOU_OK to IOU_COMPLETE, replacing existing IOU_OK users. This is a purely mechanical change. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-21 08:41:16 -06:00
Jens Axboe	53db8a71ec	io_uring: add support for IORING_OP_PIPE This works just like pipe2(2), except it also supports fixed file descriptors. Used in a similar fashion as for other fd instantiating opcodes (like accept, socket, open, etc), where sqe->file_slot is set appropriately if two direct descriptors are desired rather than a set of normal file descriptors. sqe->addr must be set to a pointer to an array of 2 integers, which is where the fixed/normal file descriptors are copied to. sqe->pipe_flags contains flags, same as what is allowed for pipe2(2). Future expansion of per-op private flags can go in sqe->ioprio, like we do for other opcodes that take both a "syscall" flag set and an io_uring opcode specific flag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Paul Moore	16bae3e137	io_uring: enable audit and restrict cred override for IORING_OP_FIXED_FD_INSTALL We need to correct some aspects of the IORING_OP_FIXED_FD_INSTALL command to take into account the security implications of making an io_uring-private file descriptor generally accessible to a userspace task. The first change in this patch is to enable auditing of the FD_INSTALL operation as installing a file descriptor into a task's file descriptor table is a security relevant operation and something that admins/users may want to audit. The second change is to disable the io_uring credential override functionality, also known as io_uring "personalities", in the FD_INSTALL command. The credential override in FD_INSTALL is particularly problematic as it affects the credentials used in the security_file_receive() LSM hook. If a task were to request a credential override via REQ_F_CREDS on a FD_INSTALL operation, the LSM would incorrectly check to see if the overridden credentials of the io_uring were able to "receive" the file as opposed to the task's credentials. After discussions upstream, it's difficult to imagine a use case where we would want to allow a credential override on a FD_INSTALL operation so we are simply going to block REQ_F_CREDS on IORING_OP_FIXED_FD_INSTALL operations. Fixes: `dc18b89ab1` ("io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL") Signed-off-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20240123215501.289566-2-paul@paul-moore.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-23 15:25:14 -07:00
Jens Axboe	dc18b89ab1	io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL io_uring can currently open/close regular files or fixed/direct descriptors. Or you can instantiate a fixed descriptor from a regular one, and then close the regular descriptor. But you currently can't turn a purely fixed/direct descriptor into a regular file descriptor. IORING_OP_FIXED_FD_INSTALL adds support for installing a direct descriptor into the normal file table, just like receiving a file descriptor or opening a new file would do. This is all nicely abstracted into receive_fd(), and hence adding support for this is truly trivial. Since direct descriptors are only usable within io_uring itself, it can be useful to turn them into real file descriptors if they ever need to be accessed via normal syscalls. This can either be a transitory thing, or just a permanent transition for a given direct descriptor. By default, new fds are installed with O_CLOEXEC set. The application can disable O_CLOEXEC by setting IORING_FIXED_FD_NO_CLOEXEC in the sqe->install_fd_flags member. Suggested-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-12 07:42:57 -07:00
Christian Brauner	24fa3ae946	file: remove pointless wrapper Only io_uring uses __close_fd_get_file(). All it does is hide current->files but io_uring accesses files_struct directly right now anyway so it's a bit pointless. Just rename pick_file() to file_close_fd_locked() and let io_uring use it. Add a lockdep assert in there that we expect the caller to hold file_lock while we're at it. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-2-e73ca6f4ea83@kernel.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-12-12 14:24:13 +01:00
Christian Brauner	50d910d273	io_uring: use files_lookup_fd_locked() While valid we don't need to open-code rcu dereferences if we're acquiring file_lock anyway. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20231010030615.GO800259@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-19 11:02:49 +02:00
Aleksa Sarai	72dbde0f2a	io_uring: correct check for O_TMPFILE O_TMPFILE is actually __O_TMPFILE\|O_DIRECTORY. This means that the old check for whether RESOLVE_CACHED can be used would incorrectly think that O_DIRECTORY could not be used with RESOLVE_CACHED. Cc: stable@vger.kernel.org # v5.12+ Fixes: `3a81fd0204` ("io_uring: enable LOOKUP_CACHED path resolution for filename lookups") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20230807-resolve_cached-o_tmpfile-v3-1-e49323e1ef6f@cyphar.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-07 12:34:23 -06:00
Amir Goldstein	7b8c9d7bb4	fsnotify: move fsnotify_open() hook into do_dentry_open() fsnotify_open() hook is called only from high level system calls context and not called for the very many helpers to open files. This may makes sense for many of the special file open cases, but it is inconsistent with fsnotify_close() hook that is called for every last fput() of on a file object with FMODE_OPENED. As a result, it is possible to observe ACCESS, MODIFY and CLOSE events without ever observing an OPEN event. Fix this inconsistency by replacing all the fsnotify_open() hooks with a single hook inside do_dentry_open(). If there are special cases that would like to opt-out of the possible overhead of fsnotify() call in fsnotify_open(), they would probably also want to avoid the overhead of fsnotify() call in the rest of the fsnotify hooks, so they should be opening that file with the __FMODE_NONOTIFY flag. However, in the majority of those cases, the s_fsnotify_connectors optimization in fsnotify_parent() would be sufficient to avoid the overhead of fsnotify() call anyway. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>	2023-06-12 10:43:45 +02:00
Dylan Yudaken	0ffae640ad	io_uring: always go async for unsupported open flags No point in issuing -> return -EAGAIN -> go async, when it can be done upfront. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20230127135227.3646353-5-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:26 -07:00
Stefan Metzmacher	f2ccb5aed7	io_uring: make io_kiocb_to_cmd() typesafe We need to make sure (at build time) that struct io_cmd_data is not casted to a structure that's larger. Signed-off-by: Stefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/c024cdf25ae19fc0319d4180e2298bade8ed17b8.1660201408.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-12 17:01:00 -06:00
Jens Axboe	f110ed8498	io_uring: split out fixed file installation and removal Put it with the filetable code, which is where it belongs. While doing so, have the helpers take a ctx rather than an io_kiocb. It doesn't make sense to use a request, as it's not an operation on the request itself. It applies to the ring itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Pavel Begunkov	27a9d66fec	io_uring: kill extra io_uring_types.h includes io_uring/io_uring.h already includes io_uring_types.h, no need to include it every time. Kill it in a bunch of places, it prepares us for following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94d8c943fbe0ef949981c508ddcee7fc1c18850f.1655384063.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Jens Axboe	7357298448	io_uring: move rsrc related data, core, and commands Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	cd40cae29e	io_uring: split out open/close operations Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:11 -06:00

22 commits