linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-08 02:24:32 +01:00

Author	SHA1	Message	Date
Linus Torvalds	78d964c47f	block-7.0-20260227 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmhxjgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqZND/9ZHjUX2BhtEHWayTLpgBl5JYeYvNnxgeKw MsQjmkDBJDdpxz5s5Y6zY+OKDhWv+o4VgfgykK5bpn5/kXgwqvKSK9QroVSl6OPk JTFo3PQ0KhuZQyBxX+zZPNjgV/Kzop4j7NuNrovrIEJtvgVZuwJx/0h/DdpLez3N +l9jWQcjLjFJHHnTuV9Tv1jqhcPU5vxkkEWlZgwkWMxdghMWWw5JDHxGt1w3g6c8 4Qmn1G3SqAPEItQ+ugU+BL+vYVKVm9/Gaa4JWJTm8d1ADmQxd1xXfkDLSBL2L4EQ VU+lkHwFDhx8hjFSeOTps90lTlrUzIaAqYn532VVsU/RMWFzcdUbPJdED7QddUNG gGj40sUu5XaMoBKzCRJ6KznYMCK3x15gptz8MAj2hfqcUk+GRLKO9crXj0tU35iP sBVMlC66+nVDNQRybqVUMTpizARoFHiRCKSwaFjbsnoe1DkIayn8ghDLhMm9W8Ea hgroXlmgafB7Ad/Kt5SSYv0Nw644Zw4bLTrIO1G+OuQf6DRMGt6z5dmmo9arsgjU fKNs7Dwb+1d0cpKFNHpnQOc4htO3G1B1stGT6WEJUpd5MXDurw4VoNdaRkOSbi75 hd+VM+dyJEDHpc/QuRSFBQQqbFLJ4Hfi3z/+oUP8x1FVbDb2u7b+2H7BFgrn08Y+ EEKRaIQ6PQ== =gtEI -----END PGP SIGNATURE----- Merge tag 'block-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: "Two sets of fixes, one for drbd, and one for the zoned loop driver" * tag 'block-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: zloop: check for spurious options passed to remove zloop: advertise a volatile write cache drbd: fix null-pointer dereference on local read error drbd: Replace deprecated strcpy with strscpy drbd: fix "LOGIC BUG" in drbd_al_begin_io_nonblock()	2026-02-27 10:42:02 -08:00
Christoph Hellwig	3c4617117a	zloop: check for spurious options passed to remove Zloop uses a command option parser for all control commands, but most options are only valid for adding a new device. Check for incorrectly specified options in the remove handler. Fixes: `eb0570c7df` ("block: new zoned loop block device driver") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-24 14:50:18 -07:00
Christoph Hellwig	6acf7860dc	zloop: advertise a volatile write cache Zloop is file system backed and thus needs to sync the underlying file system to persist data. Set BLK_FEAT_WRITE_CACHE so that the block layer actually send flush commands, and fix the flush implementation as sync_filesystem requires s_umount to be held and the code currently misses that. Fixes: `eb0570c7df` ("block: new zoned loop block device driver") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-24 14:50:18 -07:00
Kees Cook	189f164e57	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-22 08:26:33 -08:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	323bbfcf1e	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	8934827db5	kmalloc_obj treewide refactoring for v7.0-rc1 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaZl14wAKCRA2KwveOeQk uz8aAQCBFLYlij3Y3ivVADkBxuVF3xECaznFya41ENYsBwlHdwEArXqMyNrw+DiG TvWCK/tiddNmGIRpI2sxBFzyRpsHfAY= =rVD3 -----END PGP SIGNATURE----- Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj conversion from Kees Cook: "This does the tree-wide conversion to kmalloc_obj() and friends using coccinelle, with a subsequent small manual cleanup of whitespace alignment that coccinelle does not handle. This uncovered a clang bug in __builtin_counted_by_ref(), so the conversion is preceded by disabling that for current versions of clang. The imminent clang 22.1 release has the fix. I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv, s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc" * tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kmalloc_obj: Clean up after treewide replacements treewide: Replace kmalloc with kmalloc_obj for non-scalar types compiler_types: Disable __builtin_counted_by_ref for Clang	2026-02-21 11:02:58 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Christoph Böhmwalder	0d195d3b20	drbd: fix null-pointer dereference on local read error In drbd_request_endio(), READ_COMPLETED_WITH_ERROR is passed to __req_mod() with a NULL peer_device: __req_mod(req, what, NULL, &m); The READ_COMPLETED_WITH_ERROR handler then unconditionally passes this NULL peer_device to drbd_set_out_of_sync(), which dereferences it, causing a null-pointer dereference. Fix this by obtaining the peer_device via first_peer_device(device), matching how drbd_req_destroy() handles the same situation. Cc: stable@vger.kernel.org Reported-by: Tuo Li <islituo@gmail.com> Link: https://lore.kernel.org/linux-block/20260104165355.151864-1-islituo@gmail.com Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-20 07:09:06 -07:00
Thorsten Blum	81b1f046ff	drbd: Replace deprecated strcpy with strscpy strcpy() has been deprecated [1] because it performs no bounds checking on the destination buffer, which can lead to buffer overflows. Replace it with the safer strscpy(). No functional changes. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-19 08:22:02 -07:00
Lars Ellenberg	ab140365fb	drbd: fix "LOGIC BUG" in drbd_al_begin_io_nonblock() Even though we check that we "should" be able to do lc_get_cumulative() while holding the device->al_lock spinlock, it may still fail, if some other code path decided to do lc_try_lock() with bad timing. If that happened, we logged "LOGIC BUG for enr=...", but still did not return an error. The rest of the code now assumed that this request has references for the relevant activity log extents. The implcations are that during an active resync, mutual exclusivity of resync versus application IO is not guaranteed. And a potential crash at this point may not realizs that these extents could have been target of in-flight IO and would need to be resynced just in case. Also, once the request completes, it will give up activity log references it does not even hold, which will trigger a BUG_ON(refcnt == 0) in lc_put(). Fix: Do not crash the kernel for a condition that is harmless during normal operation: also catch "e->refcnt == 0", not only "e == NULL" when being noisy about "al_complete_io() called on inactive extent %u\n". And do not try to be smart and "guess" whether something will work, then be surprised when it does not. Deal with the fact that it may or may not work. If it does not, remember a possible "partially in activity log" state (only possible for requests that cross extent boundaries), and return an error code from drbd_al_begin_io_nonblock(). A latter call for the same request will then resume from where we left off. Cc: stable@vger.kernel.org Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-19 08:21:53 -07:00
Govindarajulu Varadarajan	ea129e55c9	io_uring: Add size check for sqe->cmd For SQE128, sqe->cmd provides 80 bytes for uring_cmd. Add macro to check if size of user struct does not exceed 80 bytes at compile time. User doesn't have to track this manually during development. Replace io_uring_sqe_cmd() inline func with macro and add io_uring_sqe128_cmd() which checks struct size for 16 bytes cmd and 80 bytes cmd respectively. Signed-off-by: Govindarajulu Varadarajan <govind.varadar@gmail.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-19 07:26:26 -07:00
Linus Torvalds	99dfe2d4da	block-7.0-20260216 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmTrNEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjsOEACpUk78nFmLbEgJ5UH8+Z6daDzgoasb5YRT Mj4g+cM2J9Xc9JxgX8QR3F2EfolweTo/H6xlhnlPDcnpB+b3qj4WHuijR/wghphj MBKKqNXTEC+j0ra9uk8h3RmIKaK79xcUup7XfTcuWdYpSsMyYE/m/rck3thw6yNL OAjmWLTP4IwYzXip2AB+J7JbDDOV/qWK0aOYdWHCdbn9X8bBel/HDOITWPdybnSR DNKBeoi/Yv8KwA+axogqP213ifc3Xr6ejRDkqDOf1bgXsKkELkIxcfog6MhfHhxq 3Cqlj1pBuIBxGVU7wmBTDqL+aHrVb983tcA5x1NGZIzJao64b026o5DUhNPprwrZ HveU1MZ2jarAjAz85gE3S4oUY+6d47ooytfvO548Zp/1LY+fOxnjYqq5ksh8BBLk WyjfkJScgr17Z4SVOK8a9GboWO2WKiQJRg+hZ/TWX5fyvu5g9sbRasdwxnp1sl52 EayzkhYFq/Rdd8slwTIaccVUPl/xeEDeRG+jTJ+4Fj54TihKiJzXVsxDkSWKf46V CWmzDx+n6MlGPm9mShSERZ7HJh3VcSp4No/HAjf93u9/UXwubK/SKiV71nhpgJMf 9bWS2G3wPx/5LoME95YkF+CSgs0e/ROUusfGd8X6nIz9EBGzeabCG/mjqd5adC09 OZahOuqrIg== =PVoY -----END PGP SIGNATURE----- Merge tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES	2026-02-17 08:48:45 -08:00
Linus Torvalds	8c0901b6f9	configfs changes for v7.0 -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEEsH5R1a/fCoV1sAS4bgaPnkoY3cFAmmMPp0WHGEuaGluZGJv cmdAa2VybmVsLm9yZwAKCRDhuBo+eShjdx9+D/4hz5pC/L0w21231hz00CLJqHps Mv4uPhy3vd0QqQ8/S25gQNesCirjGuC57NSQ4K1O3+UBJKaPNEnGJYS6nlc8/vsI 4heTR+F7QPu34RD/kgzeUu03/VhQyhQzHx10+e3qiQBEn6lwKqfZaKHnsUD5M2Zk eL8LydI87rXoC9TL5ASPEq33jB7d7ec8uH2fgPhUbysTmiFTIJkaOvIv6ukRo0y7 skdSJemXRjlknXqrhLtAxe5Bt7Ycq1BzEaPYc3xRGXT1B/jp/fJxhfK9AHc5ej+T 7A/Ewjdf9IEHZYKGnUDappXC/7SWsZw47BDrqljj/6a+aODYXlt+E4q1kNBBnA4L I8oVMf485BaZhLbpWdtVTztlLzO6DQg5MvN5KWxdhbz4JSn54eumLmvSjvvPfi0S FhRfR0+/vAvY7nSzZ2FXLQWt3Yg34mUUQUkJWb09TcNSts7WpdwcJP7gEwQSSVif wtEL65CNGoJlKH+KmImHFh4hINrwac53Jl5+qCbj+sfXmCRehb2XPSOZCHa+ivXq +0p9hneOgaPOfFZ67GSU6OAGtnT5X0tL69PiK3gvVD3P/t9W7/Qr8KF4E1Ksc00H CRxzhk2GkRgLaNR00xIY2L0vaeLO6R1Dsg+HFVJGMCM+8fm/Swsy3Mt0/LpG7Rl7 8EOqsbwRIBWySEz6og== =eFAh -----END PGP SIGNATURE----- Merge tag 'configfs-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/a.hindborg/linux Pull configfs updates from Andreas Hindborg: - Switch the configfs rust bindings to use c string literals provided by the compiler, rather than a macro - A follow up on constifying `configfs_item_operations`, applying the change to the configfs sample * tag 'configfs-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/a.hindborg/linux: samples: configfs: Constify struct configfs_item_operations and configfs_group_operations rust: configfs: replace `kernel::c_str!` with C-Strings	2026-02-12 14:01:38 -08:00
Linus Torvalds	136114e0ab	mm.git review status for linus..mm-nonmm-stable Total patches: 107 Reviews/patch: 1.07 Reviewed rate: 67% - The 2 patch series "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" from Heming Zhao saves disk space by teaching ocfs2 to reclaim suballocator block group space. - The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in various places. - The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size. - The 7 patch series "kallsyms: Prevent invalid access when showing module buildid" from Petr Mladek cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces. - The 3 patch series "Address page fault in ima_restore_measurement_list()" from Harshit Mogalapalli fixes a kexec-related crash that can occur when booting the second-stage kernel on x86. - The 6 patch series "kho: ABI headers and Documentation updates" from Mike Rapoport updates the kexec handover ABI documentation. - The 4 patch series "Align atomic storage" from Finn Thain adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh. - The 2 patch series "kho: clean up page initialization logic" from Pratyush Yadav simplifies the page initialization logic in kho_restore_page(). - The 6 patch series "Unload linux/kernel.h" from Yury Norov moves several things out of kernel.h and into more appropriate places. - The 7 patch series "don't abuse task_struct.group_leader" from Oleg Nesterov removes the usage of ->group_leader when it is "obviously unnecessary". - The 5 patch series "list private v2 & luo flb" from Pasha Tatashin adds some infrastructure improvements to the live update orchestrator. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6 0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww= =mmsH -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...	2026-02-12 12:13:01 -08:00
Linus Torvalds	4cff5c05e0	mm.git review status for linus..mm-stable Everything: Total patches: 325 Reviews/patch: 1.39 Reviewed rate: 72% Excluding DAMON: Total patches: 262 Reviews/patch: 1.63 Reviewed rate: 82% Excluding DAMON and zram: Total patches: 248 Reviews/patch: 1.72 Reviewed rate: 86% - The 14 patch series "powerpc/64s: do not re-activate batched TLB flush" from Alexander Gordeev makes arch_{enter\|leave}_lazy_mmu_mode() nest properly. It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - The 7 patch series "zram: introduce compressed data writeback" from Richard Chang and Sergey Senozhatsky implements data compression for zram writeback. - The 8 patch series "mm: folio_zero_user: clear page ranges" from David Hildenbrand adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated. - The 2 patch series "memcg cleanups" from Chen Ridong tideis up some memcg code. - The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" from SeongJae Park improves DAMOS stat's provided information, deterministic control, and readability. - The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging selftests. - The 5 patch series "Fix va_high_addr_switch.sh test failure - again" from Chunyu Hu addresses several issues in the va_high_addr_switch test. - The 5 patch series "mm/damon/tests/core-kunit: extend existing test scenarios" from Shu Anzai improves the KUnit test coverage for DAMON. - The 2 patch series "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN. - The 29 patch series "arch, mm: consolidate hugetlb early reservation" from Mike Rapoport reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb. - The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo Stoakes cleans up the anon_vma implementation in various ways. - The 3 patch series "tweaks for __alloc_pages_slowpath()" from Vlastimil Babka does a little streamlining of the page allocator's slowpath code. - The 8 patch series "memcg: separate private and public ID namespaces" from Shakeel Butt cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace. - The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from Kefeng Wang cleans up the allocation of frozen folios and avoids some atomic refcount operations. - The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from SeongJae Park improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals. - The 18 patch series "Support page table check on PowerPC" from Andrew Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc. - The 3 patch series "nodemask: align nodes_and{,not} with underlying bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code. - The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API callers" from SeongJae Park cleans up some DAMON internal interfaces. - The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from Shivank Garg does some cleanup work in khupaged and fixes a scan limit accounting issue. - The 24 patch series "mm: balloon infrastructure cleanups" from David Hildenbrand goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification. - The 2 patch series "mm/vmscan: add tracepoint and reason for kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to the page reclaim code. - The 3 patch series "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" from Marco Crivellari is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues. - The 9 patch series "Various mm kselftests improvements/fixes" from Kevin Brodsky provides various unrelated improvements/fixes for the mm kselftests. - The 5 patch series "mm: accelerate gigantic folio allocation" from Kefeng Wang greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig(). - The 5 patch series "selftests/damon: improve leak detection and wss estimation reliability" from SeongJae Park improves the reliability of two of the DAMON selftests. - The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work in the core DAMON code. - The 8 patch series "Docs/mm/damon: update intro, modules, maintainer profile, and misc" from SeongJae Park performs maintenance work on the DAMON documentation. - The 10 patch series "mm: add and use vma_assert_stabilised() helper" from Lorenzo Stoakes refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires. - The 19 patch series "mm, swap: swap table phase II: unify swapin use" from Kairui Song removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark. - The 8 patch series "enable PT_RECLAIM on more 64-bit architectures" from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, um, Various cleanups were performed along the way. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI= =+HLt -----END PGP SIGNATURE----- Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter\|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...	2026-02-12 11:32:37 -08:00
Christoph Böhmwalder	2ebc8d600f	drbd: always set BLK_FEAT_STABLE_WRITES DRBD requires stable pages because it may read the same bio data multiple times for local disk I/O and network transmission, and in some cases for calculating checksums. The BLK_FEAT_STABLE_WRITES flag is set when the device is first created, but blk_set_stacking_limits() clears it whenever a backing device is attached. In some cases the flag may be inherited from the backing device, but we want it to be enabled at all times. Unconditionally re-enable BLK_FEAT_STABLE_WRITES in drbd_reconsider_queue_parameters() after the queue parameter negotiations. Also, document why we want this flag enabled in the first place. Fixes: `1a02f3a73f` ("block: move the stable_writes flag to queue_limits") Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-11 10:35:56 -07:00
Linus Torvalds	0c00ed308d	for-7.0/block-20260206 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGLwcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv+TD/48S2HTnMhmW6AtFYWErQ+sEKXpHrxbYe7S +qR8/g/T+QSfhfqPwZEuagndFKtIP3LJfaXGSP1Lk1RfP9NLQy91v33Ibe4DjHkp etWSfnMHA9MUAoWKmg8EvncB2G+ZQFiYCpjazj5tKHD9S2+psGMuL8kq6qzMJE83 uhpb8WutUl4aSIXbMSfyGlwBhI1MjjRbbWlIBmg4yC8BWt1sH8Qn2L2GNVylEIcX U8At3KLgPGn0axSg4yGMAwTqtGhL/jwdDyeczbmRlXuAr4iVL9UX/yADCYkazt6U ttQ2/H+cxCwfES84COx9EteAatlbZxo6wjGvZ3xOMiMJVTjYe1x6Gkcckq+LrZX6 tjofi2KK78qkrMXk1mZMkZjpyUWgRtCswhDllbQyqFs0SwzQtno2//Rk8HU9dhbt pkpryDbGFki9X3upcNyEYp5TYflpW6YhAzShYgmE6KXim2fV8SeFLviy0erKOAl+ fwjTE6KQ5QoQv0s3WxkWa4lREm34O6IHrCUmbiPm5CruJnQDhqAN2QZIDgYC4WAf 0gu9cR/O4Vxu7TQXrumPs5q+gCyDU0u0B8C3mG2s+rIo+PI5cVZKs2OIZ8HiPo0F x73kR/pX3DMe35ZQkQX22ymMuowV+aQouDLY9DTwakP5acdcg7h7GZKABk6VLB06 gUIsnxURiQ== =jNzW -----END PGP SIGNATURE----- Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...	2026-02-09 17:57:21 -08:00
Linus Torvalds	240b8d8227	One RBD and two CephFS fixes which address potential oopses. The RBD thing is more of a rare edge case that pops up in our CI, while the two CephFS scenarios are regressions that were reported by users and can be triggered trivially in normal operation. All marked for stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmmGC6kTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi+csB/9WgySi6UKkZM6w9A5HnkUnCHXf8cS2 RuYv0kBB0zv9SorBipve6PdxVQoqyCc6wuwbiUUhKGmwfQp6ubguh4vPK5zX98Cl tjPcDDUzmTvWvOgr3NEzD3T838hkS/K0FocR+ratoMoZ2Hl8K8NSXj53vAFrHT49 v0hP5YS3UaNeqSA4NjFeMtEoRWDJJrUyim5xtZWAAlL7gEsPkHzmx1NlMYa5Qe22 mmgjcU29zKXjVs96rQqAu0W5jL+M7TJfssy/Pof7t2wVV15F0VSrdqU6KlmLNDOT P8QMolYD6WlEG8RY/ryBzhS6eqTr4fbREuJLLVzLxPQKDeJrzyhumT++ =9NmM -----END PGP SIGNATURE----- Merge tag 'ceph-for-6.19-rc9' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "One RBD and two CephFS fixes which address potential oopses. The RBD thing is more of a rare edge case that pops up in our CI, while the two CephFS scenarios are regressions that were reported by users and can be triggered trivially in normal operation. All marked for stable" * tag 'ceph-for-6.19-rc9' of https://github.com/ceph/ceph-client: ceph: fix NULL pointer dereference in ceph_mds_auth_match() ceph: fix oops due to invalid pointer for kfree() in parse_longname() rbd: check for EOD after exclusive lock is ensured to be held	2026-02-06 10:34:17 -08:00
Linus Torvalds	06bc4e2631	block-6.19-20260205 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmFEd8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvaYD/4jQ5jb3h4ytaf5P36+5jxW9BL/JJI6n87J /KU+a7x8AzvgyJu6woKy3LlBSrOLLgootKz7bjwKRvyxNYtYngdmCIHQPXYnABhT rJEQpiYBPjMVEllhlEECknbrl8u5NwuUpbG/LGf8NR8SSqMBGJdjwpvcF0bd7V3V BpS4bEla3tkEiVZQLYNxyFLNleBbRW+rZB8jaUvrDuILZe2W22dW5cDXLx/jo0JI +RQch0fXa26dNmIJMWpmPq+PTwFWtxoZUdPYxsNN2UAcR3W0fLOeWioSRQJqunwo rGemiqL0UiC20mxOXWhJUENm9GZtJIJuOVvQd4gMZVwdS9gVMmmPck90G5XNOTH4 BT1qQY+OSCd4xDNo/MozC6qSC/01mR525T278Y0cpwUvZDGK1Eb5dKyG/NncSASL zlKwBfC86M9J+nrDUSxBMXxYEfu6LnH4yiJuEWMmLxNBA98P7rsOUutPhgSfEKfy jJPuZNx4Mnmh5tu4c2C+IHUxpd0l5K3XZ4i1m/WF2JcMkWDZcLl9oHlbZY0f3GMb 7Lhc9xdHr/I491dzQj2mK5ix3PP6eCEhUnbND8gzL2WGy1DULfhaFdYRWO9EAlQu z7TxrKhvThBgA1P38GL6ALFLhkjkKjE4kV17OGjRPinx4HLmCAPtWEjr3A4wFWAH cXhh1gWzZQ== =0oAC -----END PGP SIGNATURE----- Merge tag 'block-6.19-20260205' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Revert of a change for loop, which caused regressions for some users (Actually revert of two commits, where one is just an existing fix for the offending commit) - NVMe pull via Keith: - Fix NULL pointer access setting up dma mappings - Fix invalid memory access from malformed TCP PDU * tag 'block-6.19-20260205' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: loop: revert exclusive opener loop status change nvmet-tcp: add bounds checks in nvmet_tcp_build_pdu_iovec nvme-pci: handle changing device dma map requirements	2026-02-05 15:00:53 -08:00
Jens Axboe	a6abd64e14	loop: revert exclusive opener loop status change This commit effectively reverts the following two commits: `2704024d83` ("loop: add missing bd_abort_claiming in loop_set_status") `08e136ebd1` ("loop: don't change loop device under exclusive opener in loop_set_status") as there are reports of them causing issues with unmounting. As we're close to the 6.19 kernel release and the original author hasn't taken a closer look at this yet, revert them for release. Reported-by: nokangaroo <nokangaroo@aon.at> Link: https://lore.kernel.org/all/62de4453-17e8-47f6-a10b-39bf5a49fdee@leemhuis.info/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-05 09:26:53 -07:00
Ilya Dryomov	bd3884a204	rbd: check for EOD after exclusive lock is ensured to be held Similar to commit `870611e487` ("rbd: get snapshot context after exclusive lock is ensured to be held"), move the "beyond EOD" check into the image request state machine so that it's performed after exclusive lock is ensured to be held. This avoids various race conditions which can arise when the image is shrunk under I/O (in practice, mostly readahead). In one such scenario rbd_assert(objno < rbd_dev->object_map_size); can be triggered if a close-to-EOD read gets queued right before the shrink is initiated and the EOD check is performed against an outdated mapping_size. After the resize is done on the server side and exclusive lock is (re)acquired bringing along the new (now shrunk) object map, the read starts going through the state machine and rbd_obj_may_exist() gets invoked on an object that is out of bounds of rbd_dev->object_map array. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@linux.dev>	2026-02-03 21:00:22 +01:00
Sergey Senozhatsky	6efc548d8a	zram: rename init_lock to dev_lock init_lock has completely outgrown its initial purpose and is no longer used only to "prevent concurrent execution of device init" as the stale comment suggests. The scope of this lock is much bigger now. These days this lock (rw_semaphore) controls how a task owns the corresponding zram device: either in shared mode or in exclusive mode. All zram device attribute writes should own the device in exclusive mode, which synchronizes these tasks and prevents, for example, concurrent execution of recompression and writeback. All zram device attribute reads should own the device in shared mode. Rename the lock to dev_lock to better reflect its current purpose. Link: https://lkml.kernel.org/r/20260115080807.3957860-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:36 -08:00
Caleb Sander Mateos	491af20b3c	ublk: remove "can't touch 'ublk_io' any more" comments The struct ublk_io is in fact accessed in __ublk_complete_rq() after the comment. But it's not racy to access the ublk_io between clearing its UBLK_IO_FLAG_OWNED_BY_SRV flag and completing the request, as no other thread can use the ublk_io in the meantime. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-31 06:38:43 -07:00
Ming Lei	8443e2087e	ublk: add UBLK_F_NO_AUTO_PART_SCAN feature flag Add a new feature flag UBLK_F_NO_AUTO_PART_SCAN to allow users to suppress automatic partition scanning when starting a ublk device. This is useful for some cases in which user don't want to scan partitions. Users still can manually trigger partition scanning later when appropriate using standard tools (e.g., partprobe, blockdev --rereadpt). Reported-by: Yoav Cohen <yoav@nvidia.com> Link: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-31 06:36:41 -07:00
Ming Lei	66d3af8d5d	ublk: check list membership before cancelling batch fetch command Add !list_empty(&fcmd->node) check in ublk_batch_cancel_cmd() to ensure the fcmd hasn't already been removed from the list. Once an fcmd is removed from the list, it's considered claimed by whoever removed it and will be freed by that path. Meantime switch to list_del_init() for deleting it from list. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-31 06:36:41 -07:00
Caleb Sander Mateos	373df2c025	ublk: drop ublk_ctrl_start_recovery() header argument ublk_ctrl_start_recovery() only uses its const struct ublksrv_ctrl_cmd * header argument to log the dev_id. But this value is already available in struct ublk_device's ub_number field. So log ub_number instead and drop the unused header argument. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-31 06:36:11 -07:00
Caleb Sander Mateos	ed9f54cc1e	ublk: use READ_ONCE() to read struct ublksrv_ctrl_cmd struct ublksrv_ctrl_cmd is part of the io_uring_sqe, which may lie in userspace-mapped memory. It's racy to access its fields with normal loads, as userspace may write to them concurrently. Use READ_ONCE() to copy the ublksrv_ctrl_cmd from the io_uring_sqe to the stack. Use the local copy in place of the one in the io_uring_sqe. Fixes: `87213b0d84` ("ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue") Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-31 06:36:11 -07:00
Govindarajulu Varadarajan	da7e4b75e5	ublk: Validate SQE128 flag before accessing the cmd ublk_ctrl_cmd_dump() accesses (header *)sqe->cmd before IO_URING_F_SQE128 flag check. This could cause out of boundary memory access. Move the SQE128 flag check earlier in ublk_ctrl_uring_cmd() to return -EINVAL immediately if the flag is not set. Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Govindarajulu Varadarajan <govind.varadar@gmail.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-31 06:36:11 -07:00
Linus Torvalds	03610bd6b5	block-6.19-20260130 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAml8yn8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpm2READCHbwDQmW+uOdvP5uB5UwnOLustp3WfpAu gb2NTHy0KN4jX/iI0Ni0iKijakNFOlI4BPTopPOkFZxLddda6jc/iOXGTfh0iZ8N bHwxVY2FxmTIh171LUXlDBoKKwUt1xQR1baikwP2ew282AxOqouJnL/zHpJhVe+T KQguoUgNiYikwxIoluBK9s0zJanjLT9LLuYmPJtiRT3146QsHZpQBVmyJ+oZKFe3 bBwhLiY2hB8kH6FqigHrgKyx4+8ZFotVW4OmKdAtuieY8BOFwtMVHmETBmWpMdLA EYe9CIQT3gRxP0RBui4OaZoAWsyy54XdP8T2Kz74uNi+URniO5qbSbMKs/UrYzhA odGcSilTL5AtJqyxVJIij10ZBeBG5xge+xukF2VcV8hd95DGiGf4shdkT+uMUprw s6YOGWs21GYvSiW37tdqPA4ezlfPPLjLhTzAmQjspNMgQPSmKF/wFqPyTmp2wsDC iO7yoXe/4yGZHPKPJWxrMEgNnqicnQcfQTkKuyDn4f8t4/UlwtVUnQcSD8t97ASW XzuGuU3QIyWQwCQq5ZkdjA5Tq4TK2sPWMBPVciNUEIfoYZJLG95+h0sbu/jAaFNA 4krhXa5o65mwUIbMFY+ILvvN3F2G33qEbP4KwIdP3YnRl5gRu23DI0qBceOdYXbz enQWu9jZaA== =oo3J -----END PGP SIGNATURE----- Merge tag 'block-6.19-20260130' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix for an accounting leak in bcache that's been there forever, and a related dead code removal - Revert of a fix for rnbd that went into this series, but depends on other changes that are staged for 7.0 - NVMe pull request via Keith: - TCP target completion race condition fix (Ming) - DMA descriptor cleanup fix (Roger) * tag 'block-6.19-20260130' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: fix I/O accounting leak in detached_dev_do_request bcache: remove dead code in detached_dev_do_request nvme-pci: DMA unmap the correct regions in nvme_free_sgls Revert "rnbd-clt: fix refcount underflow in device unmap path" nvmet: fix race in nvmet_bio_done() leading to NULL pointer dereference	2026-01-30 13:18:32 -08:00
Damien Le Moal	da562d92e6	block: introduce bdev_rot() Introduce the helper function bdev_rot() to test if a block device is a rotational one. The existing function bdev_nonrot() which tests for the opposite condition is redefined using this new helper. This avoids the double negation (operator and name) that appears when testing if a block device is a rotational device, thus making the code a little easier to read. Call sites of bdev_nonrot() in the block layer are updated to use this new helper. Remaining users in other subsystems are left unchanged for now. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-30 08:11:09 -07:00
Caleb Sander Mateos	ad5f2e2908	ublk: restore auto buf unregister refcount optimization Commit `1ceeedb597` ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task") optimized ublk request buffer unregistration to use a non-atomic reference count decrement when performed on the ublk_io's daemon task. The optimization applied to auto buffer unregistration, which happens as part of handling UBLK_IO_COMMIT_AND_FETCH_REQ on the daemon task. However, commit `b749965edd` ("ublk: remove ublk_commit_and_fetch()") reordered the ublk_sub_req_ref() for the completed request before the io_buffer_unregister_bvec() call. As a result, task_registered_buffers is already 0 when io_buffer_unregister_bvec() calls ublk_io_release() and the non-atomic refcount optimization doesn't apply. Move the io_buffer_unregister_bvec() call back to before ublk_need_complete_req() to restore the reference counting optimization. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `b749965edd` ("ublk: remove ublk_commit_and_fetch()") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-29 19:53:54 -07:00
Ming Lei	0921abdcbd	ublk: document IO reference counting design Add comprehensive documentation for ublk's split reference counting model (io->ref + io->task_registered_buffers) above ublk_init_req_ref() given this model isn't very straightforward. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-29 05:47:21 -07:00
Chaitanya Kulkarni	7c746eb71f	rnbd-clt: fix refcount underflow in device unmap path During device unmapping (triggered by module unload or explicit unmap), a refcount underflow occurs causing a use-after-free warning: [14747.574913] ------------[ cut here ]------------ [14747.574916] refcount_t: underflow; use-after-free. [14747.574917] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x55/0x90, CPU#9: kworker/9:1/378 [14747.574924] Modules linked in: rnbd_client(-) rtrs_client rnbd_server rtrs_server rtrs_core ... [14747.574998] CPU: 9 UID: 0 PID: 378 Comm: kworker/9:1 Tainted: G O N 6.19.0-rc3lblk-fnext+ #42 PREEMPT(voluntary) [14747.575005] Workqueue: rnbd_clt_wq unmap_device_work [rnbd_client] [14747.575010] RIP: 0010:refcount_warn_saturate+0x55/0x90 [14747.575037] Call Trace: [14747.575038] <TASK> [14747.575038] rnbd_clt_unmap_device+0x170/0x1d0 [rnbd_client] [14747.575044] process_one_work+0x211/0x600 [14747.575052] worker_thread+0x184/0x330 [14747.575055] ? __pfx_worker_thread+0x10/0x10 [14747.575058] kthread+0x10d/0x250 [14747.575062] ? __pfx_kthread+0x10/0x10 [14747.575066] ret_from_fork+0x319/0x390 [14747.575069] ? __pfx_kthread+0x10/0x10 [14747.575072] ret_from_fork_asm+0x1a/0x30 [14747.575083] </TASK> [14747.575096] ---[ end trace 0000000000000000 ]--- Befor this patch :- The bug is a double kobject_put() on dev->kobj during device cleanup. Kobject Lifecycle: kobject_init_and_add() sets kobj.kref = 1 (initialization) kobject_put() sets kobj.kref = 0 (should be called once) * Before this patch: rnbd_clt_unmap_device() rnbd_destroy_sysfs() kobject_del(&dev->kobj) [remove from sysfs] kobject_put(&dev->kobj) PUT #1 (WRONG!) kref: 1 to 0 rnbd_dev_release() kfree(dev) [DEVICE FREED!] rnbd_destroy_gen_disk() [use-after-free!] rnbd_clt_put_dev() refcount_dec_and_test(&dev->refcount) kobject_put(&dev->kobj) PUT #2 (UNDERFLOW!) kref: 0 to -1 [WARNING!] The first kobject_put() in rnbd_destroy_sysfs() prematurely frees the device via rnbd_dev_release(), then the second kobject_put() in rnbd_clt_put_dev() causes refcount underflow. * After this patch :- Remove kobject_put() from rnbd_destroy_sysfs(). This function should only remove sysfs visibility (kobject_del), not manage object lifetime. Call Graph (FIXED): rnbd_clt_unmap_device() rnbd_destroy_sysfs() kobject_del(&dev->kobj) [remove from sysfs only] [kref unchanged: 1] rnbd_destroy_gen_disk() [device still valid] rnbd_clt_put_dev() refcount_dec_and_test(&dev->refcount) kobject_put(&dev->kobj) ONLY PUT (CORRECT!) kref: 1 to 0 [BALANCED] rnbd_dev_release() kfree(dev) [CLEAN DESTRUCTION] This follows the kernel pattern where sysfs removal (kobject_del) is separate from object destruction (kobject_put). Fixes: `581cf833ca` ("block: rnbd: add .release to rnbd_dev_ktype") Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 21:15:11 -07:00
Jens Axboe	c87f15efeb	Revert "rnbd-clt: fix refcount underflow in device unmap path" This reverts commit `ec19ed2b3e`. This commit relies on changes queued for 7.0, and isn't safe in its current form for the 6.19 release. Revert it for now for 6.19. Link: https://lore.kernel.org/linux-block/aXhLQmRudk7cSAnT@shinmob/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 21:13:54 -07:00
Gary Guo	b2b2ce8706	block: rnull: remove imports available via prelude These imports are already in scope by importing `kernel::prelude::*` and does not need to be imported separately. Signed-off-by: Gary Guo <gary@garyguo.net> Acked-by: Andreas Hindborg <a.hindborg@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-27 09:45:42 -07:00
Sergey Senozhatsky	0be909f114	zsmalloc: use actual object size to detect spans Using class->size to detect spanning objects is not entirely correct, because some size classes can hold a range of object sizes of up to class->size bytes in length, due to size-classes merge. Such classes use padding for cases when actually written objects are smaller than class->size. zs_obj_read_begin() can incorrectly hit the slow path and perform memcpy of such objects, basically copying padding bytes. Instead of class->size zs_obj_read_begin() should use the actual compressed object length (both zram and zswap know it) so that it can correctly handle situations when a written object is small enough to fit into the first physical page. Link: https://lkml.kernel.org/r/20260107052145.3586917-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> [zsmalloc & zswap] Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 20:02:25 -08:00
Linus Torvalds	00d20db21e	block-6.19-20260122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmly5rUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv30EACNx0n1MlRvrIHvqIH+gq2NmpweLhJphars 3e4lyDD5bBUuQ1V2iRc5k7MiKPEtQJpF83m1WVIvycNjjx2qWLBsZyCc9PCNoZIn nBe1/qCuKOs8kmrNkTXLKjr1uFVp+0Nm5jWSjeIgNI1zd7IDe6fpkjyH6JrnKsw8 DD7aCYE4jHGJH8q9Ks3rOu3M2syiFwkqmOD12Xgqiz29fP4qJJDvC/y4yOMDWbZY G7ZKFHZK5QS8tmjRUhgJacVpJlN3NCL6lZ16f0fRTJoIlWzOAaBafeJTAjiS3S01 TS8sadfTLrVZocFYSKZjixVgUocHqE5QeKBrDe09N0A5XsWz/YJkcNisNw0oRcvb BKCrO1TCHiPzliPCobPcaFZII4z2HeiIfVGSTaJAzSM4ZCt/EH9BFTKrgohGRZi4 v+bPNdJaCLaFOipgsUeA63NZpt9xN+PsfmzFGC1PvF50gQMM+JeFpqjNh37XjCIt 6QtK9rzLzC9GvdFXV3o1CRbIBAli3UEgjyLeThhsCu2HPnab/yrrf9AwF5udZQlJ wK7S8kB3zl25nyM3jwKzzaVLqi+aUWr1Jn+D+DA2BwwPg3m8h7yIrysJryF5Leon 5VAYHCOg+MhPuvhNFjqkgahjreJqV/3Qou+sbGcLur4PXdKXeW8SY5wrDlplumFb +BZSUZ06Kw== =RK+t -----END PGP SIGNATURE----- Merge tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - A set of selftest fixes for ublk - Fix for a pid mismatch in ublk, comparing PIDs in different namespaces if run inside a namespace - Fix for a regression added in this release with polling, where the nvme tcp connect code would spin forever - Zoned device error path fix - Tweak the blkzoned uapi additions from this kernel release, making them more easily discoverable - Fix for a regression in bcache with bio endio handling added in this release * tag 'block-6.19-20260122' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: use bio cloning for detached device requests blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion selftests/ublk: fix garbage output in foreground mode selftests/ublk: fix error handling for starting device selftests/ublk: fix IO thread idle check block: make the new blkzoned UAPI constants discoverable ublk: fix ublksrv pid handling for pid namespaces block: Fix an error path in disk_update_zone_resources()	2026-01-23 12:53:56 -08:00
Ming Lei	f50af89693	ublk: rename auto buffer registration helpers Rename the auto buffer registration functions for clarity: - __ublk_do_auto_buf_reg() -> ublk_auto_buf_register() - ublk_prep_auto_buf_reg_io() -> ublk_auto_buf_io_setup() - ublk_do_auto_buf_reg() -> ublk_auto_buf_dispatch() Add comments documenting the locking requirements for each function. No functional change. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-23 12:02:18 -07:00
Ming Lei	e4c4bfec2b	ublk: fix canceling flag handling in batch I/O recovery Two issues with ubq->canceling flag handling: 1) In ublk_queue_reset_io_flags(), ubq->canceling is set outside cancel_lock, violating the locking requirement. Move it inside the spinlock-protected section. 2) In ublk_batch_unprep_io(), when rolling back after a batch prep failure, if the queue became ready during prep (which cleared canceling), the flag is not restored when the queue becomes not-ready again. This allows new requests to be queued to uninitialized IO slots. Fix by restoring ubq->canceling = true under cancel_lock when the queue transitions from ready to not-ready during rollback. Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: `3f38507855` ("ublk: fix batch I/O recovery -ENODEV error") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-23 05:11:03 -07:00
Ming Lei	dbc635c4be	ublk: move ublk_mark_io_ready() out of __ublk_fetch() ublk_batch_prep_io() calls __ublk_fetch() while holding io->lock spinlock. When the last IO makes the device ready, ublk_mark_io_ready() tries to acquire ub->cancel_mutex which can sleep, causing a sleeping-while-atomic bug. Fix by moving ublk_mark_io_ready() out of __ublk_fetch() and into the callers (ublk_fetch and ublk_batch_prep_io) after the spinlock is released. Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: `b256795b36` ("ublk: handle UBLK_U_IO_PREP_IO_CMDS") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-23 05:11:02 -07:00
Ming Lei	3f38507855	ublk: fix batch I/O recovery -ENODEV error During recovery with batch I/O, UBLK_U_IO_FETCH_IO_CMDS command fails with -ENODEV because ublk_batch_attach() rejects them when ubq->canceling is set. The canceling flag remains set until all queues are ready. Fix this by tracking per-queue readiness and clearing ubq->canceling as soon as each individual queue becomes ready, rather than waiting for all queues. This allows subsequent UBLK_U_IO_FETCH_IO_CMDS commands to succeed during recovery. Changes: - Add ubq->nr_io_ready to track I/Os ready per queue - Add ub->nr_queue_ready to track number of ready queues - Add ublk_queue_ready() helper to check queue readiness - Redefine ublk_dev_ready() based on queue count instead of I/O count - Clear ubq->canceling immediately when queue becomes ready - Add ublk_queue_reset_io_flags() to reset per-queue flags Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:41 -07:00
Ming Lei	7aa78d4a3c	ublk: implement batch request completion via blk_mq_end_request_batch() Reduce overhead when completing multiple requests in batch I/O mode by accumulating them in an io_comp_batch structure and completing them together via blk_mq_end_request_batch(). This minimizes per-request completion overhead and improves performance for high IOPS workloads. The implementation adds an io_comp_batch pointer to struct ublk_io and initializes it in __ublk_fetch(). For batch I/O, the pointer is set to the batch structure in ublk_batch_commit_io(). The __ublk_complete_rq() function uses io->iob to call blk_mq_add_to_batch() for batch mode. After processing all batch I/Os, the completion callback is invoked in ublk_handle_batch_commit_cmd() to complete all accumulated requests efficiently. So far just covers direct completion. For deferred completion(zero copy, auto buffer reg), ublk_io_release() is often delayed in freeing buffer consumer io_uring request's code path, so this patch often doesn't work, also it is hard to pass the per-task 'struct io_comp_batch' for deferred completion. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:40 -07:00
Ming Lei	e2723e6ce6	ublk: add new feature UBLK_F_BATCH_IO Add new feature UBLK_F_BATCH_IO which replaces the following two per-io commands: - UBLK_U_IO_FETCH_REQ - UBLK_U_IO_COMMIT_AND_FETCH_REQ with three per-queue batch io uring_cmd: - UBLK_U_IO_PREP_IO_CMDS - UBLK_U_IO_COMMIT_IO_CMDS - UBLK_U_IO_FETCH_IO_CMDS Then ublk can deliver batch io commands to ublk server in single multishort uring_cmd, also allows to prepare & commit multiple commands in batch style via single uring_cmd, communication cost is reduced a lot. This feature also doesn't limit task context any more for all supported commands, so any allowed uring_cmd can be issued in any task context. ublk server implementation becomes much easier. Meantime load balance becomes much easier to support with this feature. The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task contexts, so each task can adjust this command's buffer length or number of inflight commands for controlling how much load is handled by current task. Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS` for improving load balance support. UBLK_U_IO_NEED_GET_DATA isn't supported in batch io yet, but it may be enabled in future via its batch pair. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:40 -07:00
Ming Lei	29d0a927f9	ublk: abort requests filled in event kfifo In case of BATCH_IO, any request filled in event kfifo, they don't get chance to be dispatched any more when releasing ublk char device, so we have to abort them too. Add ublk_abort_batch_queue() for aborting this kind of requests. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:40 -07:00
Ming Lei	3ac4796b88	ublk: refactor ublk_queue_rq() and add ublk_batch_queue_rq() Extract common request preparation and cancellation logic into __ublk_queue_rq_common() helper function. Add dedicated ublk_batch_queue_rq() for batch mode operations to eliminate runtime check in ublk_queue_rq(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:40 -07:00
Ming Lei	a4d8837553	ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing of I/O requests. This multishot uring_cmd allows the ublk server to fetch multiple I/O commands in a single operation, significantly reducing submission overhead compared to individual FETCH_REQ* commands. Key Design Features: 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O commands, with the batch size limited by the provided buffer length. 2. Dynamic Load Balancing: Multiple fetch commands can be submitted simultaneously, but only one is active at any time. This enables efficient load distribution across multiple server task contexts. 3. Implicit State Management: The implementation uses three key variables to track state: - evts_fifo: Queue of request tags awaiting processing - fcmd_head: List of available fetch commands - active_fcmd: Currently active fetch command (NULL = none active) States are derived implicitly: - IDLE: No fetch commands available - READY: Fetch commands available, none active - ACTIVE: One fetch command processing events 4. Lockless Reader Optimization: The active fetch command can read from evts_fifo without locking (single reader guarantee), while writers (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory barrier pairing plays key role for the single lockless reader optimization. Implementation Details: - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo - __ublk_acquire_fcmd() selects an available fetch command when events arrive and no command is currently active - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's buffer and posts completion via io_uring_mshot_cmd_post_cqe() - State transitions are coordinated via evts_lock to maintain consistency Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:40 -07:00
Ming Lei	7a1bb41947	ublk: add batch I/O dispatch infrastructure Add infrastructure for delivering I/O commands to ublk server in batches, preparing for the upcoming UBLK_U_IO_FETCH_IO_CMDS feature. Key components: - struct ublk_batch_fetch_cmd: Represents a batch fetch uring_cmd that will receive multiple I/O tags in a single operation, using io_uring's multishot command for efficient ublk IO delivery. - ublk_batch_dispatch(): Batch version of ublk_dispatch_req() that: * Pulls multiple request tags from the events FIFO (lock-free reader) * Prepares each I/O for delivery (including auto buffer registration) * Delivers tags to userspace via single uring_cmd notification * Handles partial failures by restoring undelivered tags to FIFO The batch approach significantly reduces notification overhead by aggregating multiple I/O completions into single uring_cmd, while maintaining the same I/O processing semantics as individual operations. Error handling ensures system consistency: if buffer selection or CQE posting fails, undelivered tags are restored to the FIFO for retry, meantime IO state has to be restored. This runs in task work context, scheduled via io_uring_cmd_complete_in_task() or called directly from ->uring_cmd(), enabling efficient batch processing without blocking the I/O submission path. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:40 -07:00
Ming Lei	f1f99ddf60	ublk: add io events fifo structure Add ublk io events fifo structure and prepare for supporting command batch, which will use io_uring multishot uring_cmd for fetching one batch of io commands each time. One nice feature of kfifo is to allow multiple producer vs single consumer. We just need lock the producer side, meantime the single consumer can be lockless. The producer is actually from ublk_queue_rq() or ublk_queue_rqs(), so lock contention can be eased by setting proper blk-mq nr_queues. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-22 20:05:40 -07:00

1 2 3 4 5 ...

8811 commits