Commit graph

255 commits

Author SHA1 Message Date
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Filipe Manana
29fb415a6a btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
We allocate the bitmap but we never free it in free_raid_bio_pointers().
Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap
of a raid bio.

Fixes: 1810350b04 ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap")
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:37:29 +01:00
Qu Wenruo
1a332a6d70 btrfs: raid56: remove the "_step" infix
The following functions are introduced as a middle step for bs > ps
support:

- rbio_streip_step_paddr()
- rbio_pstripe_step_paddr()
- rbio_qstripe_step_paddr()
- sector_step_paddr_in_rbio()

As there is already an existing function without the infix, and has a
different parameter list.

But the existing functions have been cleaned up, there is no need to
keep the "_step" infix, just remove it completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:50:19 +01:00
Qu Wenruo
8870dbeedc btrfs: raid56: enable bs > ps support
The support code for bs > ps is complete, enable it and update
assertions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:48:52 +01:00
Qu Wenruo
89ca1a403e btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
The function finish_parity_scrub() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper, verify_one_parity_step()
  Since the P/Q generation is always done in a vertical stripe, we have
  to handle the range step by step.

- Only clear the rbio->dbitmap if all steps of an fs block match

- Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers
  Now we either use the paddrs version for checksum, or the step version
  for P/Q generation/recovery.

- Make alloc_rbio_essential_pages() to handle bs > ps cases
  Since for bs > ps cases, one fs block needs multiple pages, the
  existing simple check against rbio->stripe_pages[] is not enough.

  Extract a dedicated helper, alloc_rbio_sector_pages(), for the
  existing alloc_rbio_essential_pages(), which is still based on sector
  number.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:48:02 +01:00
Qu Wenruo
ba88278c69 btrfs: raid56: prepare rbio_bio_add_io_paddr() to support bs > ps cases
The function rbio_bio_add_io_paddr() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper bio_add_paddrs()
  Previously we only need to add a single page to a bio for a fs block,
  but now we need to add multiple pages, this means we can fail halfway.

  In that case we need to properly revert the bio (only for its size
  though) for halfway failed cases.

- Rename rbio_add_io_paddr() to rbio_add_io_paddrs()
  And change the @paddr parameter to @paddrs[].

- Change all callers to use the updated rbio_add_io_paddrs()
  For the @paddrs pointer used for the new function, it can be grabbed
  using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:54 +01:00
Qu Wenruo
53474a2ae1 btrfs: raid56: prepare steal_rbio() to support bs > ps cases
The function steal_rbio() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce two helpers to calculate the sector number
  Previously we assume one page will contain at least one fs block, thus
  can use something like "sectors_per_page = PAGE_SIZE / sectorsize;",
  but with bs > ps support that above number will be 0.

  Instead introduce two helpers:

  * page_nr_to_sector_nr()
    Returns the sector number of the first sector covered by the page.

  * page_nr_to_num_sectors()
    Return how many sectors are covered by the page.

  And use the returned values for bitmap operations other than
  open-coded "PAGE_SIZE / sectorsize".
  Those helpers also have extra ASSERT()s to catch weird numbers.

- Use above helpers
  The involved functions are:
  * steal_rbio_page()
  * is_data_stripe_page()
  * full_page_sectors_uptodate()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:41 +01:00
Qu Wenruo
05ddf35a5d btrfs: raid56: prepare set_bio_pages_uptodate() to support bs > ps cases
The function set_bio_pages_uptodate() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Update find_stripe_sector_nr() to check only the first step paddr
  We don't need to check each paddr, as the bios are still aligned to fs
  block size, thus checking the first step is enough.

- Use step size to iterate the bio
  This means we only need to find the sector number for the first step
  of each fs block, and skip the remaining part.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:31 +01:00
Qu Wenruo
64e7b8c7c5 btrfs: raid56: prepare verify_bio_data_sectors() to support bs > ps cases
The function verify_bio_data_sectors() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Make get_bio_sector_nr() to consider bs > ps cases
  The function is utilized to calculate the sector number of a device
  bio submitted by btrfs raid56 layer.

- Assemble a local paddrs[] for checksum calculation

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple non-contiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:20 +01:00
Qu Wenruo
e0eadfcc95 btrfs: raid56: prepare verify_one_sector() to support bs > ps cases
The function verify_one_sector() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce helpers to get a paddrs pointer
  Thankfully all the higher layer bio should still be aligned to fs
  block size, thus a fs block should still be fully covered by the bio.

  Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will
  return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or
  stripe_paddrs[].

  The pointer can be directly passed to
  btrfs_calculate_block_csum_pages() to verify the checksum.

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple non-contiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:16 +01:00
Qu Wenruo
9ba67fd616 btrfs: raid56: prepare recover_vertical() to support bs > ps cases
Currently recover_vertical() assumes that every fs block can be mapped
by one page, this is blocking bs > ps support for raid56.

Prepare recover_vertical() to support bs > ps cases by:

- Introduce recover_vertical_step() helper
  Which will recover a full step (min(PAGE_SIZE, sectorsize)).

  Now recover_vertical() will do the error check for the specified
  sector, do the recover step by step, then do the sector verification.

- Fix a spelling error of get_rbio_vertical_errors()
  The old name has a typo: "veritical".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:11 +01:00
Qu Wenruo
826325b6d0 btrfs: raid56: prepare generate_pq_vertical() for bs > ps cases
Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple
pages at the same time for P/Q generation.

So here we introduce a new @step_nr, and various helpers to grab the
sub-block page from the rbio, and generate the P/Q stripe page by page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:47:05 +01:00
Qu Wenruo
91cd1b5865 btrfs: raid56: introduce a new parameter to locate a sector
Since we cannot ensure that all bios from the higher layer are backed by
large folios (e.g. direct IO, encoded read/write/send), we need the
ability to locate sub-block (aka, a page) inside a full stripe.

So the existing @stripe_nr + @sector_nr combination is not enough to
locate such page for bs > ps cases.

Introduce a new parameter, @step_nr, to locate the page of a larger fs
block.  The naming is following the conventions used inside btrfs
elsewhere, where one step is min(sectorsize, PAGE_SIZE).

It's still a preparation, only touching the following aspects:

- btrfs_dump_rbio()
  To show the new @sector_nsteps member.

- btrfs_raid_bio::sector_nsteps
  Recording how many steps there are inside a fs block.

- Enlarge btrfs_raid_bio::*_paddrs[] size
  To take @sector_nsteps into consideration.

- index_one_bio()
- index_stripe_sectors()
- memcpy_from_bio_to_stripe()
- cache_rbio_pages()
- need_read_stripe_sectors()
  Those functions are iterating *_paddrs[], which needs to take
  sector_nsteps into consideration.

- Rename rbio_stripe_sector_index() to rbio_sector_index()
  The "stripe" part is not that helpful.

  And an extra ASSERT() before returning the result.

- Add a new rbio_paddr_index() helper
  This will take the extra @step_nr into consideration.

- The comments of btrfs_raid_bio

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:46:58 +01:00
Qu Wenruo
5387bd9581 btrfs: raid56: remove sector_ptr structure
Since sector_ptr structure is now only containing a single paddr, there
is no need to use that structure.

Instead use phys_addr_t array for bio and stripe pointers.

This means several helpers are also needed to accept a paddr instead of
a sector_ptr pointer.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:23 +01:00
Qu Wenruo
1810350b04 btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap
The uptodate boolean member can be extracted into a bitmap, which will
save us some space (1 bit in a byte vs 8 bits in a byte).

Furthermore we do not need to record the uptodate bitmap for bio
sectors, as if bio_sectors[].paddr is valid it means there is a bio and
will be uptodate.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:22 +01:00
Qu Wenruo
17d552ab9b btrfs: raid56: remove sector_ptr::has_paddr member
We can use paddr -1 as an indicator for unset/uninitialized paddr.

We can not use 0 paddr, unlike virtual address 0 which is never mapped
thus will always trigger a page fault, physical address 0 may be a valid
page.

So here we follow swiotlb to use (paddr)-1 as a special indicator for
invalid/unset physical address.

Even if the PFN may still be valid, our usage of the physical address
should always be aligned to fs block size (or page size for bs > ps
cases), thus such -1 paddr should never be a valid one.

With this special -1 paddr, we can get rid of has_paddr member and save
1 byte for sector_ptr structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:22 +01:00
Gladyshev Ilya
1dac8db80c btrfs: don't generate any code from ASSERT() in release builds
The current definition of ASSERT(cond) as (void)(cond) is redundant,
since these checks have no side effects and don't affect code logic.

However, some checks contain READ_ONCE() or other compiler-unfriendly
constructs. For example, ASSERT(list_empty) in btrfs_add_dealloc_inode()
was compiled to a redundant mov instruction due to this issue.

Define ASSERT as BUILD_BUG_ON_INVALID for !CONFIG_BTRFS_ASSERT builds
which uses sizeof(cond) trick.  Also mark full_page_sectors_uptodate()
as __maybe_unused to suppress "unneeded declaration" warning (it's
needed in compile time)

Signed-off-by: Gladyshev Ilya <foxido@foxido.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:22 +01:00
David Sterba
cc53bd2085 btrfs: add unlikely annotations to branches leading to EIO
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
Qu Wenruo
5fbaae4b85 btrfs: prepare scrub to support bs > ps cases
This involves:

- Migrate scrub_stripe::pages[] to folios[]

- Use btrfs_alloc_folio_array() and folio_put() to alloc above array.

- Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use
  folio interfaces

- Migrate raid56_parity_cache_data_pages() to
  raid56_parity_cache_data_folios()
  Since scrub is the only caller still using pages.

  This helper will copy the folio array contents into rbio::stripe_pages,
  with sector uptodate flags updated.

  And a new ASSERT() to make sure bs > ps cases will not hit this path.

Since most scrub code is based on kaddr/paddr, the migration itself is
pretty straightforward.

And since we're here, also move the loop to set the
stripe_sectors[].uptodate out of the copy loop.
As we always mark all the sectors as uptodate for the data stripe, it's
easier to do in one go, other than doing it inside the copy loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:25 +02:00
Qu Wenruo
7425a28940 btrfs: introduce btrfs_bio_for_each_block_all() helper
Currently if we want to iterate all blocks inside a bio, we do something
like this:

	bio_for_each_segment_all(bvec, bio, iter_all) {
		for (off = 0; off < bvec->bv_len; off += sectorsize) {
			/* Iterate blocks using bv + off */
		}
	}

That's fine for now, but it will not handle future bs > ps, as
bio_for_each_segment_all() is a single-page iterator, it will always
return a bvec that's no larger than a page.

But for bs > ps cases, we need a full folio (which covers at least one
block) so that we can work on the block.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block_all()
  This helper will create a local bvec_iter, which has the size of the
  target bio. Then grab the current physical address of the current
  location, then advance the iterator by block size.

- Use btrfs_bio_for_each_block_all() to replace existing call sites
  Including:

  * set_bio_pages_uptodate() in raid56
  * verify_bio_data_sectors() in raid56

  Both will result much easier to read code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:17 +02:00
Qu Wenruo
9afc617265 btrfs: introduce btrfs_bio_for_each_block() helper
Currently if we want to iterate a bio in block unit, we do something
like this:

	while (iter->bi_size) {
		struct bio_vec bv = bio_iter_iovec();

		/* Do something with using the bv */

		bio_advance_iter_single(&bbio->bio, iter, sectorsize);
	}

That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.

This means the code using that bv can only handle a block if bs <= ps.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block()
  Instead of bio_vec, which has single and multiple page version and
  multiple page version has quite some limits, use my favorite way to
  represent a block, phys_addr_t.

  For bs <= ps cases, nothing is changed, except we will do a very
  small overhead to convert phys_addr_t to a folio, then use the proper
  folio helpers to handle the possible highmem cases.

  For bs > ps cases, all blocks will be backed by large folios, meaning
  every folio will cover at least one block. And still use proper folio
  helpers to handle highmem cases.

  With phys_addr_t, we will handle both large folio and highmem
  properly. So there is no better single variable to present a btrfs
  block than phys_addr_t.

- Extract the data block csum calculation into a helper
  The new helper, btrfs_calculate_block_csum() will be utilized by
  btrfs_csum_one_bio().

- Use btrfs_bio_for_each_block() to replace existing call sites
  Including:

  * index_one_bio() from raid56.c
    Very straight-forward.

  * btrfs_check_read_bio()
    Also update repair_one_sector() to grab the folio using phys_addr_t,
    and do extra checks to make sure the folio covers at least one
    block.
    We do not need to bother bv_len at all now.

  * btrfs_csum_one_bio()
    Now we can move the highmem handling into a dedicated helper,
    calculate_block_csum(), and use btrfs_bio_for_each_block() helper.

There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:17 +02:00
Qu Wenruo
35aff706dc btrfs: concentrate highmem handling for data verification
Currently for btrfs checksum verification, we do it in the following
pattern:

	kaddr = kmap_local_*();
	ret = btrfs_check_csum_csum(kaddr);
	kunmap_local(kaddr);

It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.

In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.

Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:

- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
  To follow the more common term "block" used in all other major
  filesystems.

- Pass a physical address into btrfs_check_block_csum() and
  btrfs_data_csum_ok()
  The physical address is always available even for a highmem page.
  Since it's page frame number << PAGE_SHIFT + offset in page.

  And with that physical address, we can grab the folio covering the
  page, and do extra checks to ensure it covers at least one block.

  This also allows us to do the kmap inside btrfs_check_block_csum().
  This means all the extra HIGHMEM handling will be concentrated into
  btrfs_check_block_csum(), and no callers will need to bother highmem
  by themselves.

- Properly zero out the block if csum mismatch
  Since btrfs_data_csum_ok() only got a paddr, we can not and should not
  use memzero_bvec(), which only accepts single page bvec.
  Instead use paddr to grab the folio and call folio_zero_range()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:16 +02:00
Filipe Manana
c5d12d5b62 btrfs: raid56: use list_last_entry() at cache_rbio()
Instead of using list_entry() against the list's prev entry, use
list_last_entry(), which removes the need to know the last member is
accessed through the prev list pointer and the naming makes it easier
to reason about what we are doing.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:54 +02:00
David Sterba
c779b7980c btrfs: raid56: rename parameter err to status in endio helpers
Trivial renames to unify the naming of blk_status_t variables/parameters.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
ae8ce87165 btrfs: drop redundant local variable in raid_wait_write_end_io()
The bio status is read only once, no variable needed for that.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
05a6ec865d btrfs: use unsigned types for constants defined as bit shifts
The unsigned type is a recommended practice (CWE-190, CWE-194) for bit
shifts to avoid problems with potential unwanted sign extensions.
Although there are no such cases in btrfs codebase, follow the
recommendation.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
2d44a15afd btrfs: use list_first_entry() everywhere
Using the helper makes it a bit more clear that we're accessing the
first list entry.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:47 +02:00
Qu Wenruo
cd678925e9 btrfs: raid56: store a physical address in structure sector_ptr
Instead of using a @page + @pg_offset pair inside sector_ptr structure,
use a single physical address instead.

This allows us to grab both the page and offset from a single u64 value.
Although we still need an extra bool value, @has_paddr, to distinguish
if the sector is properly mapped (as the 0 physical address is totally
valid).

This change doesn't change the size of structure sector_ptr, but reduces
the parameters of several functions.

Note: the original idea and patch is from Christoph Hellwig
(https://lore.kernel.org/linux-btrfs/20250409111055.3640328-7-hch@lst.de/)
but the final implementation is different.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[ Use physical addresses instead to handle highmem. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Christoph Hellwig
6f3f722df7 btrfs: simplify bvec iteration in index_one_bio()
Flatten the two loops by open coding bio_for_each_segment() and advancing
the iterator one sector at a time.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Fix a bug that @offset is not increased. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Christoph Hellwig
959ddf2839 btrfs: move kmapping out of btrfs_check_sector_csum()
Move kmapping the page out of btrfs_check_sector_csum().

This allows using bvec_kmap_local() where suitable and reduces the number
of kmap*() calls in the raid56 code.

This also means btrfs_check_sector_csum() will only accept a properly
kmapped address.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:46 +02:00
Qu Wenruo
c186345a6b btrfs: make assert_rbio() to only check CONFIG_BTRFS_ASSERT
According to the description, CONFIG_BTRFS_DEBUG is only for extra
debug info, meanwhile sanity checks should be managed by
CONFIG_BTRFS_ASSERT.

There is no need to check both to enable assert_rbio().

Just remove the check for CONFIG_BTRFS_DEBUG.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Qu Wenruo
0fbf6cbd72 btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array().  And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.

Rename the @extra_gfp parameter to @nofail to indicate that.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:30 +02:00
Qu Wenruo
bbbee460aa btrfs: raid56: do extra dumping for CONFIG_BTRFS_ASSERT
There are several hard-to-hit ASSERT()s hit inside raid56.
Unfortunately the ASSERT() expression is a little complex, and except
the ASSERT(), there is nothing to provide any clue.

Considering if race is involved, it's pretty hard to reproduce.
Meanwhile sometimes the dump of the rbio structure can provide some
pretty good clues, it's worth to do the extra multi-line dump for
btrfs raid56 related code.

The dump looks like this:

  BTRFS critical (device dm-3): bioc logical=4598530048 full_stripe=4598530048 size=0 map_type=0x81 mirror=0 replace_nr_stripes=0 replace_stripe_src=-1 num_stripes=5
  BTRFS critical (device dm-3):     nr=0 devid=1 physical=1166147584
  BTRFS critical (device dm-3):     nr=1 devid=2 physical=1145176064
  BTRFS critical (device dm-3):     nr=2 devid=4 physical=1145176064
  BTRFS critical (device dm-3):     nr=3 devid=5 physical=1145176064
  BTRFS critical (device dm-3):     nr=4 devid=3 physical=1145176064
  BTRFS critical (device dm-3): rbio flags=0x0 nr_sectors=80 nr_data=4 real_stripes=5 stripe_nsectors=16 scrubp=0 dbitmap=0x0
  BTRFS critical (device dm-3): logical=4598530048
  assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1702

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:16 +02:00
Christoph Hellwig
fa1af65bf8 btrfs use bio_list_merge_init
Use bio_list_merge_init instead of open coding bio_list_merge and
bio_list_init.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240328084147.2954434-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:53:37 -06:00
Qu Wenruo
b2324e08b8 btrfs: raid56: extra debugging for raid6 syndrome generation
[BUG]
I have got at least two crash report for RAID6 syndrome generation, no
matter if it's AVX2 or SSE2, they all seems to have a similar
calltrace with corrupted RAX:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  Workqueue: btrfs-rmw rmw_rbio_work [btrfs]
  RIP: 0010:raid6_sse21_gen_syndrome+0x9e/0x130 [raid6_pq]
  RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248
  RDX: 0000000000000000 RSI: ffffa0f74cfa3238 RDI: 0000000000000000
  Call Trace:
   <TASK>
   rmw_rbio+0x5c8/0xa80 [btrfs]
   process_one_work+0x1c7/0x3d0
   worker_thread+0x4d/0x380
   kthread+0xf3/0x120
   ret_from_fork+0x2c/0x50
   </TASK>

[CAUSE]
The cause is not known.  Recently I also hit this in AVX512 path, and
that's even in v5.15 backport, which doesn't have any of my RAID56
rework.

Furthermore according to the registers:

  RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248

The RAX register is showing the number of stripes (including PQ), which
is not correct (0).  But the remaining two registers are all sane.

- RBX is the sectorsize
  For x86_64 it should always be 4K and matches the output.

- RCX is the pointers array
  Which is from rbio->finish_pointers, and it looks like a sane
  kernel address.

[WORKAROUND]
For now, I can only add extra debug ASSERT()s before we call raid6
gen_syndrome() helper and hopes to catch the problem.

The debug requires both CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT
enabled.

My current guess is some use-after-free, but every report is only having
corrupted RAX but seemingly valid pointers doesn't make much sense.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:52 +01:00
David Sterba
2b712e3bb2 btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:46 +01:00
Qu Wenruo
09e6cef19c btrfs: refactor alloc_extent_buffer() to allocate-then-attach method
Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.

This method has the following disadvantages:

- find_or_create_page() is the legacy way of allocating new pages
  With the new folio infrastructure, find_or_create_page() is just
  redirected to filemap_get_folio().

- Lacks the way to support higher order (order >= 1) folios
  As we can not yet let filemap give us a higher order folio.

This patch would change the workflow by the following way:

		Old		   |		new
-----------------------------------+-------------------------------------
                                   | ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
    p = find_or_create_page();     |     ret = filemap_add_folio();
    /* Attach page private */      |     /* Reuse page cache if needed */
    /* Reused eb if needed */      |
				   |     /* Attach page private and
				   |        reuse eb if needed */
				   | }

By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00
David Sterba
a0df0a2680 btrfs: raid56: remove unused btrfs_plug_cb::work
The raid56 changes in 6.2 reworked the IO path to RMW, commit
93723095b5 ("btrfs: raid56: switch write path to rmw_rbio()") in
particular removed the last use of the work member so it can be removed
as well. This was found by tool https://github.com/jirislaby/clang-struct .

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
Qu Wenruo
3c771c1944 btrfs: scrub: avoid unnecessary csum tree search preparing stripes
One of the bottleneck of the new scrub code is the extra csum tree
search.

The old code would only do the csum tree search for each scrub bio,
which can be as large as 512KiB, thus they can afford to allocate a new
path each time.

But the new scrub code is doing csum tree search for each stripe, which
is only 64KiB, this means we'd better re-use the same csum path during
each search.

This patch would introduce a per-sctx path for csum tree search, as we
don't need to re-allocate the path every time we need to do a csum tree
search.

With this change we can further improve the queue depth and improve the
scrub read performance:

Before (with regression and cached extent tree path):

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3 15875.00 1013328.00    12.00   0.08    0.08    63.83   1.35 100.00

After (with both cached extent/csum tree path):

 nvme0n1p3 17759.00 1133280.00    10.00   0.06    0.08    63.81   1.50 100.00

Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:48 +02:00
Qu Wenruo
dbb6ecb328 btrfs: tracepoints: simplify raid56 events
After commit 6bfd0133be ("btrfs: raid56: switch scrub path to use a
single function"), the raid56 implementation no longer uses different
endio functions for RMW/recover/scrub.

All read operations end in submit_read_wait_bio_list(), while all write
operations end in submit_write_bios().  This means quite some trace
events are out-of-date and no longer utilized.

This patch would unify the trace events into just two:

- trace_raid56_read()
  Replaces trace_raid56_read_partial(), trace_raid56_scrub_read() and
  trace_raid56_scrub_read_recover().

- trace_raid56_write()
  Replaces trace_raid56_write_stripe() and
  trace_raid56_scrub_write_stripe().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:12 +02:00
Qu Wenruo
3a3c7a7f65 btrfs: raid56: remove unused BTRFS_RBIO_REBUILD_MISSING
Commit aca43fe839 ("btrfs: remove unused raid56 functions which were
dedicated for scrub") removed the special handling of RAID56 scrub for
missing device.

As scrub goes full mirror_num based recovery, that means if it hits a
missing device in RAID56, it would just try the next mirror, which would
go through the BTRFS_RBIO_READ_REBUILD operation.

This means there is no longer any use of BTRFS_RBIO_REBUILD_MISSING
operation and we can safely remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:12 +02:00
Qu Wenruo
486c737f7f btrfs: raid56: always verify the P/Q contents for scrub
[REGRESSION]
Commit 75b4703329 ("btrfs: raid56: migrate recovery and scrub recovery
path to use error_bitmap") changed the behavior of scrub_rbio().

Initially if we have no error reading the raid bio, we will assign
@need_check to true, then finish_parity_scrub() would later verify the
content of P/Q stripes before writeback.

But after that commit we never verify the content of P/Q stripes and
just writeback them.

This can lead to unrepaired P/Q stripes during scrub, or already
corrupted P/Q copied to the dev-replace target.

[FIX]
The situation is more complex than the regression, in fact the initial
behavior is not 100% correct either.

If we have the following rare case, it can still lead to the same
problem using the old behavior:

		0	16K	32K	48K	64K
	Data 1:	|IIIIIII|                       |
	Data 2:	|				|
	Parity:	|	|CCCCCCC|		|

Where "I" means IO error, "C" means corruption.

In the above case, we're scrubbing the parity stripe, then read out all
the contents of Data 1, Data 2, Parity stripes.

But found IO error in Data 1, which leads to rebuild using Data 2 and
Parity and got the correct data.

In that case, we would not verify if the Parity is correct for range
[16K, 32K).

So here we have to always verify the content of Parity no matter if we
did recovery or not.

This patch would remove the @need_check parameter of
finish_parity_scrub() completely, and would always do the P/Q
verification before writeback.

Fixes: 75b4703329 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap")
CC: stable@vger.kernel.org # 6.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:13:15 +02:00
Qu Wenruo
94ead93e63 btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read
For P/Q stripe scrub, we have quite some duplicated read IO:

- Data stripes read for verification
  This is triggered by the scrub_submit_initial_read() inside
  scrub_raid56_parity_stripe().

- Data stripes read (again) for P/Q stripe verification
  This is triggered by scrub_assemble_read_bios() from scrub_rbio().

  Although we can have hit rbio cache and avoid unnecessary read, the
  chance is very low, as scrub would easily flush the whole rbio cache.

This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario.  If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.

However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.

To avoid such cache miss, this patch would:

- Introduce a new helper, raid56_parity_cache_data_pages()
  This function would grab the pages from an array, and copy the content
  to the rbio, marking all the involved sectors uptodate.

  The page copy is unavoidable because of the cache pages of rbio are all
  self managed, thus can not utilize outside pages without screwing up
  the lifespan.

- Use the repaired data stripes as cache inside
  scrub_raid56_parity_stripe()

By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Anand Jain
adbe7e388e btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
aca43fe839 btrfs: remove unused raid56 functions which were dedicated for scrub
Since the scrub rework, the following RAID56 functions are no longer
called:

- raid56_add_scrub_pages()
- raid56_alloc_missing_rbio()
- raid56_submit_missing_rbio()

Those functions are all utilized by scrub to handle missing device cases
for RAID56.

However the new scrub code handle them in a completely different way:

- If it's data stripe, go recovery path through btrfs_submit_bio()
- If it's P/Q stripe, it would be handled through
  raid56_parity_submit_scrub_rbio()
  And that function would handle dev-replace and repair properly.

Thus we can safely remove those functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:18 +02:00
Qu Wenruo
b979547513 btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe
The new helper will search the extent tree to find the first extent of a
logical range, then fill the sectors array by two loops:

- Loop 1 to fill common bits and metadata generation

- Loop 2 to fill csum data (only for data bgs)
  This loop will use the new btrfs_lookup_csums_bitmap() to fill
  the full csum buffer, and set scrub_sector_verification::csum.

With all the needed info filled by this function, later we only need to
submit and verify the stripe.

Here we temporarily export the helper to avoid warning on unused static
function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Johannes Thumshirn
cf32e41fa5 btrfs: use __bio_add_page to add single a page in rbio_add_io_sector
The btrfs raid56 sector submission code uses bio_add_page() to add a
page to a newly created bio. bio_add_page() can fail, but the return
value is never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00
Qu Wenruo
18d758a2d8 btrfs: replace btrfs_io_context::raid_map with a fixed u64 value
In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.

But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.

So why we waste the space and time (for sorting) for raid_map?

This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:

- Replace btrfs_io_context::raid_map with full_stripe_start

- Replace call sites using raid_map[0] to use full_stripe_start

- Replace call sites using raid_map[i] to compare with nr_data_stripes.

The benefits are:

- Less memory wasted on raid_map
  It's sizeof(u64) * num_stripes vs sizeof(u64).
  It'll always save at least one u64, and the benefit grows larger with
  num_stripes.

- No more weird alloc_btrfs_io_context() behavior
  As there is only one fixed size + one variable length array.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00