From 4bb08cf974c54adc321b8505ce82720c2a15c451 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Wed, 11 Jun 2025 16:49:38 +0800 Subject: [PATCH 1/9] loop: move lo_set_size() out of queue freeze It isn't necessary to freeze queue for updating disk size given submit_bio() doesn't grab queue usage counter for checking eod. Also many driver won't freeze queue for calling set_capacity_and_notify(). Move lo_set_size() out of queue freeze for fixing many lockdep warning report. Link: https://lore.kernel.org/linux-block/67ea99e0.050a0220.3c3d88.0042.GAE@google.com/ Reported-by: syzbot+9dd7dbb1a4b915dee638@syzkaller.appspotmail.com Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20250611084938.108829-1-ming.lei@redhat.com Signed-off-by: Jens Axboe --- drivers/block/loop.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index f8d136684109..500840e4a74e 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1248,12 +1248,6 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) lo->lo_flags &= ~LOOP_SET_STATUS_CLEARABLE_FLAGS; lo->lo_flags |= (info->lo_flags & LOOP_SET_STATUS_SETTABLE_FLAGS); - if (size_changed) { - loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit, - lo->lo_backing_file); - loop_set_size(lo, new_size); - } - /* update the direct I/O flag if lo_offset changed */ loop_update_dio(lo); @@ -1261,6 +1255,11 @@ out_unfreeze: blk_mq_unfreeze_queue(lo->lo_queue, memflags); if (partscan) clear_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state); + if (!err && size_changed) { + loff_t new_size = get_size(lo->lo_offset, lo->lo_sizelimit, + lo->lo_backing_file); + loop_set_size(lo, new_size); + } out_unlock: mutex_unlock(&lo->lo_mutex); if (partscan) From ff20c516485efbeb5c32bcb6aa5a24f73774185b Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Wed, 11 Jun 2025 16:56:32 +0800 Subject: [PATCH 2/9] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG). Cc: Caleb Sander Mateos Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20250611085632.109719-1-ming.lei@redhat.com Signed-off-by: Jens Axboe --- Documentation/block/ublk.rst | 75 ++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst index c368e1081b41..abec524a04ed 100644 --- a/Documentation/block/ublk.rst +++ b/Documentation/block/ublk.rst @@ -352,6 +352,81 @@ For reaching best IO performance, ublk server should align its segment parameter of `struct ublk_param_segment` with backend for avoiding unnecessary IO split, which usually hurts io_uring performance. +Auto Buffer Registration +------------------------ + +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration +and unregistration for I/O requests, which simplifies the buffer management +process and reduces overhead in the ublk server implementation. + +This is another feature flag for using zero copy, and it is compatible with +``UBLK_F_SUPPORT_ZERO_COPY``. + +Feature Overview +~~~~~~~~~~~~~~~~ + +This feature automatically registers request buffers to the io_uring context +before delivering I/O commands to the ublk server and unregisters them when +completing I/O commands. This eliminates the need for manual buffer +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server +can avoid dependency on the two uring_cmd operations. + +IOs can't be issued concurrently to io_uring if there is any dependency +among these IOs. So this way not only simplifies ublk server implementation, +but also makes concurrent IO handling becomes possible by removing the +dependency on buffer registration & unregistration commands. + +Usage Requirements +~~~~~~~~~~~~~~~~~~ + +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx`` + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``. If + uring_cmd is issued on a different ``io_ring_ctx``, manual buffer + unregistration is required. + +2. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the + following structure:: + + struct ublk_auto_buf_reg { + __u16 index; /* Buffer index for registration */ + __u8 flags; /* Registration flags */ + __u8 reserved0; /* Reserved for future use */ + __u32 reserved1; /* Reserved for future use */ + }; + + ublk_auto_buf_reg_to_sqe_addr() is for converting the above structure into + ``sqe->addr``. + +3. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed. + +4. Optional flags can be passed via ``ublk_auto_buf_reg.flags``. + +Fallback Behavior +~~~~~~~~~~~~~~~~~ + +If auto buffer registration fails: + +1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled: + - The uring_cmd is completed + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags`` + - The ublk server must manually deal with the failure, such as, register + the buffer manually, or using user copy feature for retrieving the data + for handling ublk IO + +2. If fallback is not enabled: + - The ublk I/O request fails silently + - The uring_cmd won't be completed + +Limitations +~~~~~~~~~~~ + +- Requires same ``io_ring_ctx`` for all operations +- May require manual buffer management in fallback cases +- io_ring_ctx buffer table has a max size of 16K, which may not be enough + in case that too many ublk devices are handled by this single io_ring_ctx + and each one has very large queue depth + References ========== From f705d33c2f0353039d03e5d6f18f70467d86080e Mon Sep 17 00:00:00 2001 From: Damien Le Moal Date: Wed, 11 Jun 2025 09:59:15 +0900 Subject: [PATCH 3/9] block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion When blk_zone_write_plug_bio_endio() is called for a regular write BIO used to emulate a zone append operation, that is, a BIO flagged with BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not cleared. Clear it to fully return the BIO to its orginal definition. Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal Reviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org Signed-off-by: Jens Axboe --- block/blk-zoned.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 8f15d1aa6eb8..cee7a0774a9b 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1225,6 +1225,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio) if (bio_flagged(bio, BIO_EMULATES_ZONE_APPEND)) { bio->bi_opf &= ~REQ_OP_MASK; bio->bi_opf |= REQ_OP_ZONE_APPEND; + bio_clear_flag(bio, BIO_EMULATES_ZONE_APPEND); } /* From cf625013d8741c01407bbb4a60c111b61b9fa69d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 11 Jun 2025 06:44:16 +0200 Subject: [PATCH 4/9] block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work Bios queued up in the zone write plug have already gone through all all preparation in the submit_bio path, including the freeze protection. Submitting them through submit_bio_noacct_nocheck duplicates the work and can can cause deadlocks when freezing a queue with pending bio write plugs. Go straight to ->submit_bio or blk_mq_submit_bio to bypass the superfluous extra freeze protection and checks. Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Reported-by: Bart Van Assche Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Reviewed-by: Damien Le Moal Tested-by: Damien Le Moal Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de Signed-off-by: Jens Axboe --- block/blk-zoned.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index cee7a0774a9b..351d659280e1 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1307,7 +1307,6 @@ again: spin_unlock_irqrestore(&zwplug->lock, flags); bdev = bio->bi_bdev; - submit_bio_noacct_nocheck(bio); /* * blk-mq devices will reuse the extra reference on the request queue @@ -1315,8 +1314,12 @@ again: * path for BIO-based devices will not do that. So drop this extra * reference here. */ - if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) + if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) { + bdev->bd_disk->fops->submit_bio(bio); blk_queue_exit(bdev->bd_disk->queue); + } else { + blk_mq_submit_bio(bio); + } put_zwplug: /* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */ From 961296e89dc3800e6a3abc3f5d5bb4192cf31e98 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 11 Jun 2025 08:48:46 -0600 Subject: [PATCH 5/9] block: use plug request list tail for one-shot backmerge attempt Previously, the block layer stored the requests in the plug list in LIFO order. For this reason, blk_attempt_plug_merge() would check just the head entry for a back merge attempt, and abort after that unless requests for multiple queues existed in the plug list. If more than one request is present in the plug list, this makes the one-shot back merging less useful than before, as it'll always fail to find a quick merge candidate. Use the tail entry for the one-shot merge attempt, which is the last added request in the list. If that fails, abort immediately unless there are multiple queues available. If multiple queues are available, then scan the list. Ideally the latter scan would be a backwards scan of the list, but as it currently stands, the plug list is singly linked and hence this isn't easily feasible. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/ Reported-by: Hazem Mohamed Abuelfotoh Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug") Signed-off-by: Jens Axboe --- block/blk-merge.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 3af1d284add5..70d704615be5 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -998,20 +998,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, if (!plug || rq_list_empty(&plug->mq_list)) return false; - rq_list_for_each(&plug->mq_list, rq) { - if (rq->q == q) { - if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == - BIO_MERGE_OK) - return true; - break; - } + rq = plug->mq_list.tail; + if (rq->q == q) + return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == + BIO_MERGE_OK; + else if (!plug->multiple_queues) + return false; - /* - * Only keep iterating plug list for merges if we have multiple - * queues - */ - if (!plug->multiple_queues) - break; + rq_list_for_each(&plug->mq_list, rq) { + if (rq->q != q) + continue; + if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == + BIO_MERGE_OK) + return true; + break; } return false; } From f826ec7966a63d48e16e0868af4e038bf9a1a3ae Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 12 Jun 2025 15:41:25 +0100 Subject: [PATCH 6/9] bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP It is possible for physically contiguous folios to have discontiguous struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not. This is correctly handled by folio_page_idx(), so remove this open-coded implementation. Fixes: 640d1930bef4 (block: Add bio_for_each_folio_all()) Signed-off-by: Matthew Wilcox (Oracle) Link: https://lore.kernel.org/r/20250612144126.2849931-1-willy@infradead.org Signed-off-by: Jens Axboe --- include/linux/bio.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/bio.h b/include/linux/bio.h index 9c37c66ef9ca..46ffac5caab7 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -291,7 +291,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio, fi->folio = page_folio(bvec->bv_page); fi->offset = bvec->bv_offset + - PAGE_SIZE * (bvec->bv_page - &fi->folio->page); + PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page); fi->_seg_count = bvec->bv_len; fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count); fi->_next = folio_next(fi->folio); From 5e223e06ee7c6d8f630041a0645ac90e39a42cc6 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 12 Jun 2025 15:42:53 +0100 Subject: [PATCH 7/9] block: Fix bvec_set_folio() for very large folios Similarly to 26064d3e2b4d ("block: fix adding folio to bio"), if we attempt to add a folio that is larger than 4GB, we'll silently truncate the offset and len. Widen the parameters to size_t, assert that the length is less than 4GB and set the first page that contains the interesting data rather than the first page of the folio. Fixes: 26db5ee15851 (block: add a bvec_set_folio helper) Signed-off-by: Matthew Wilcox (Oracle) Link: https://lore.kernel.org/r/20250612144255.2850278-1-willy@infradead.org Signed-off-by: Jens Axboe --- include/linux/bvec.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/bvec.h b/include/linux/bvec.h index 204b22a99c4b..0a80e1f9aa20 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -57,9 +57,12 @@ static inline void bvec_set_page(struct bio_vec *bv, struct page *page, * @offset: offset into the folio */ static inline void bvec_set_folio(struct bio_vec *bv, struct folio *folio, - unsigned int len, unsigned int offset) + size_t len, size_t offset) { - bvec_set_page(bv, &folio->page, len, offset); + unsigned long nr = offset / PAGE_SIZE; + + WARN_ON_ONCE(len > UINT_MAX); + bvec_set_page(bv, folio_page(folio, nr), len, offset % PAGE_SIZE); } /** From db3dfae1a2f662e69d535827703bcdbb04b8d72b Mon Sep 17 00:00:00 2001 From: Bagas Sanjaya Date: Fri, 13 Jun 2025 09:38:57 +0700 Subject: [PATCH 8/9] Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists Stephen Rothwell reports htmldocs warning on ublk docs: Documentation/block/ublk.rst:414: ERROR: Unexpected indentation. [docutils] Fix the warning by separating sublists of auto buffer registration fallback behavior from their appropriate parent list item. Fixes: ff20c516485e ("ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)") Reported-by: Stephen Rothwell Closes: https://lore.kernel.org/linux-next/20250612132638.193de386@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya Link: https://lore.kernel.org/r/20250613023857.15971-1-bagasdotme@gmail.com Signed-off-by: Jens Axboe --- Documentation/block/ublk.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst index abec524a04ed..8c4030bcabb6 100644 --- a/Documentation/block/ublk.rst +++ b/Documentation/block/ublk.rst @@ -408,6 +408,7 @@ Fallback Behavior If auto buffer registration fails: 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled: + - The uring_cmd is completed - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags`` - The ublk server must manually deal with the failure, such as, register @@ -415,6 +416,7 @@ If auto buffer registration fails: for handling ublk IO 2. If fallback is not enabled: + - The ublk I/O request fails silently - The uring_cmd won't be completed From 9ce6c9875f3e995be5fd720b65835291f8a609b1 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 13 Jun 2025 13:37:41 -0600 Subject: [PATCH 9/9] nvme: always punt polled uring_cmd end_io work to task_work Currently NVMe uring_cmd completions will complete locally, if they are polled. This is done because those completions are always invoked from task context. And while that is true, there's no guarantee that it's invoked under the right ring context, or even task. If someone does NVMe passthrough via multiple threads and with a limited number of poll queues, then ringA may find completions from ringB. For that case, completing the request may not be sound. Always just punt the passthrough completions via task_work, which will redirect the completion, if needed. Cc: stable@vger.kernel.org Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe --- drivers/nvme/host/ioctl.c | 21 +++++++-------------- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 0b50da2f1175..6b3ac8ae3f34 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -429,21 +429,14 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req, pdu->result = le64_to_cpu(nvme_req(req)->result.u64); /* - * For iopoll, complete it directly. Note that using the uring_cmd - * helper for this is safe only because we check blk_rq_is_poll(). - * As that returns false if we're NOT on a polled queue, then it's - * safe to use the polled completion helper. - * - * Otherwise, move the completion to task work. + * IOPOLL could potentially complete this request directly, but + * if multiple rings are polling on the same queue, then it's possible + * for one ring to find completions for another ring. Punting the + * completion via task_work will always direct it to the right + * location, rather than potentially complete requests for ringA + * under iopoll invocations from ringB. */ - if (blk_rq_is_poll(req)) { - if (pdu->bio) - blk_rq_unmap_user(pdu->bio); - io_uring_cmd_iopoll_done(ioucmd, pdu->result, pdu->status); - } else { - io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb); - } - + io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb); return RQ_END_IO_FREE; }