linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-08 04:04:43 +01:00

Author	SHA1	Message	Date
Linus Torvalds	323bbfcf1e	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Filipe Manana	29fb415a6a	btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap We allocate the bitmap but we never free it in free_raid_bio_pointers(). Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap of a raid bio. Fixes: `1810350b04` ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap") Reported-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/ Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 06:37:29 +01:00
Qu Wenruo	1a332a6d70	btrfs: raid56: remove the "_step" infix The following functions are introduced as a middle step for bs > ps support: - rbio_streip_step_paddr() - rbio_pstripe_step_paddr() - rbio_qstripe_step_paddr() - sector_step_paddr_in_rbio() As there is already an existing function without the infix, and has a different parameter list. But the existing functions have been cleaned up, there is no need to keep the "_step" infix, just remove it completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:50:19 +01:00
Qu Wenruo	8870dbeedc	btrfs: raid56: enable bs > ps support The support code for bs > ps is complete, enable it and update assertions. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:48:52 +01:00
Qu Wenruo	89ca1a403e	btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases The function finish_parity_scrub() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce a helper, verify_one_parity_step() Since the P/Q generation is always done in a vertical stripe, we have to handle the range step by step. - Only clear the rbio->dbitmap if all steps of an fs block match - Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers Now we either use the paddrs version for checksum, or the step version for P/Q generation/recovery. - Make alloc_rbio_essential_pages() to handle bs > ps cases Since for bs > ps cases, one fs block needs multiple pages, the existing simple check against rbio->stripe_pages[] is not enough. Extract a dedicated helper, alloc_rbio_sector_pages(), for the existing alloc_rbio_essential_pages(), which is still based on sector number. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:48:02 +01:00
Qu Wenruo	ba88278c69	btrfs: raid56: prepare rbio_bio_add_io_paddr() to support bs > ps cases The function rbio_bio_add_io_paddr() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce a helper bio_add_paddrs() Previously we only need to add a single page to a bio for a fs block, but now we need to add multiple pages, this means we can fail halfway. In that case we need to properly revert the bio (only for its size though) for halfway failed cases. - Rename rbio_add_io_paddr() to rbio_add_io_paddrs() And change the @paddr parameter to @paddrs[]. - Change all callers to use the updated rbio_add_io_paddrs() For the @paddrs pointer used for the new function, it can be grabbed using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:54 +01:00
Qu Wenruo	53474a2ae1	btrfs: raid56: prepare steal_rbio() to support bs > ps cases The function steal_rbio() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce two helpers to calculate the sector number Previously we assume one page will contain at least one fs block, thus can use something like "sectors_per_page = PAGE_SIZE / sectorsize;", but with bs > ps support that above number will be 0. Instead introduce two helpers: * page_nr_to_sector_nr() Returns the sector number of the first sector covered by the page. * page_nr_to_num_sectors() Return how many sectors are covered by the page. And use the returned values for bitmap operations other than open-coded "PAGE_SIZE / sectorsize". Those helpers also have extra ASSERT()s to catch weird numbers. - Use above helpers The involved functions are: * steal_rbio_page() * is_data_stripe_page() * full_page_sectors_uptodate() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:41 +01:00
Qu Wenruo	05ddf35a5d	btrfs: raid56: prepare set_bio_pages_uptodate() to support bs > ps cases The function set_bio_pages_uptodate() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Update find_stripe_sector_nr() to check only the first step paddr We don't need to check each paddr, as the bios are still aligned to fs block size, thus checking the first step is enough. - Use step size to iterate the bio This means we only need to find the sector number for the first step of each fs block, and skip the remaining part. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:31 +01:00
Qu Wenruo	64e7b8c7c5	btrfs: raid56: prepare verify_bio_data_sectors() to support bs > ps cases The function verify_bio_data_sectors() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Make get_bio_sector_nr() to consider bs > ps cases The function is utilized to calculate the sector number of a device bio submitted by btrfs raid56 layer. - Assemble a local paddrs[] for checksum calculation - Open code btrfs_check_block_csum() btrfs_check_block_csum() only supports fs blocks backed by large folios. But for raid56 we can have fs blocks backed by multiple non-contiguous pages, e.g. direct IO, encoded read/write/send. So instead of using btrfs_check_block_csum(), open code it to use btrfs_calculate_block_csum_pages(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:20 +01:00
Qu Wenruo	e0eadfcc95	btrfs: raid56: prepare verify_one_sector() to support bs > ps cases The function verify_one_sector() assume each fs block can be mapped by one page, blocking bs > ps support for raid56. Prepare it for bs > ps cases by: - Introduce helpers to get a paddrs pointer Thankfully all the higher layer bio should still be aligned to fs block size, thus a fs block should still be fully covered by the bio. Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or stripe_paddrs[]. The pointer can be directly passed to btrfs_calculate_block_csum_pages() to verify the checksum. - Open code btrfs_check_block_csum() btrfs_check_block_csum() only supports fs blocks backed by large folios. But for raid56 we can have fs blocks backed by multiple non-contiguous pages, e.g. direct IO, encoded read/write/send. So instead of using btrfs_check_block_csum(), open code it to use btrfs_calculate_block_csum_pages(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:16 +01:00
Qu Wenruo	9ba67fd616	btrfs: raid56: prepare recover_vertical() to support bs > ps cases Currently recover_vertical() assumes that every fs block can be mapped by one page, this is blocking bs > ps support for raid56. Prepare recover_vertical() to support bs > ps cases by: - Introduce recover_vertical_step() helper Which will recover a full step (min(PAGE_SIZE, sectorsize)). Now recover_vertical() will do the error check for the specified sector, do the recover step by step, then do the sector verification. - Fix a spelling error of get_rbio_vertical_errors() The old name has a typo: "veritical". Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:11 +01:00
Qu Wenruo	826325b6d0	btrfs: raid56: prepare generate_pq_vertical() for bs > ps cases Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple pages at the same time for P/Q generation. So here we introduce a new @step_nr, and various helpers to grab the sub-block page from the rbio, and generate the P/Q stripe page by page. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:47:05 +01:00
Qu Wenruo	91cd1b5865	btrfs: raid56: introduce a new parameter to locate a sector Since we cannot ensure that all bios from the higher layer are backed by large folios (e.g. direct IO, encoded read/write/send), we need the ability to locate sub-block (aka, a page) inside a full stripe. So the existing @stripe_nr + @sector_nr combination is not enough to locate such page for bs > ps cases. Introduce a new parameter, @step_nr, to locate the page of a larger fs block. The naming is following the conventions used inside btrfs elsewhere, where one step is min(sectorsize, PAGE_SIZE). It's still a preparation, only touching the following aspects: - btrfs_dump_rbio() To show the new @sector_nsteps member. - btrfs_raid_bio::sector_nsteps Recording how many steps there are inside a fs block. - Enlarge btrfs_raid_bio::_paddrs[] size To take @sector_nsteps into consideration. - index_one_bio() - index_stripe_sectors() - memcpy_from_bio_to_stripe() - cache_rbio_pages() - need_read_stripe_sectors() Those functions are iterating _paddrs[], which needs to take sector_nsteps into consideration. - Rename rbio_stripe_sector_index() to rbio_sector_index() The "stripe" part is not that helpful. And an extra ASSERT() before returning the result. - Add a new rbio_paddr_index() helper This will take the extra @step_nr into consideration. - The comments of btrfs_raid_bio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:46:58 +01:00
Qu Wenruo	5387bd9581	btrfs: raid56: remove sector_ptr structure Since sector_ptr structure is now only containing a single paddr, there is no need to use that structure. Instead use phys_addr_t array for bio and stripe pointers. This means several helpers are also needed to accept a paddr instead of a sector_ptr pointer. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:23 +01:00
Qu Wenruo	1810350b04	btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap The uptodate boolean member can be extracted into a bitmap, which will save us some space (1 bit in a byte vs 8 bits in a byte). Furthermore we do not need to record the uptodate bitmap for bio sectors, as if bio_sectors[].paddr is valid it means there is a bio and will be uptodate. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:22 +01:00
Qu Wenruo	17d552ab9b	btrfs: raid56: remove sector_ptr::has_paddr member We can use paddr -1 as an indicator for unset/uninitialized paddr. We can not use 0 paddr, unlike virtual address 0 which is never mapped thus will always trigger a page fault, physical address 0 may be a valid page. So here we follow swiotlb to use (paddr)-1 as a special indicator for invalid/unset physical address. Even if the PFN may still be valid, our usage of the physical address should always be aligned to fs block size (or page size for bs > ps cases), thus such -1 paddr should never be a valid one. With this special -1 paddr, we can get rid of has_paddr member and save 1 byte for sector_ptr structure. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:22 +01:00
Gladyshev Ilya	1dac8db80c	btrfs: don't generate any code from ASSERT() in release builds The current definition of ASSERT(cond) as (void)(cond) is redundant, since these checks have no side effects and don't affect code logic. However, some checks contain READ_ONCE() or other compiler-unfriendly constructs. For example, ASSERT(list_empty) in btrfs_add_dealloc_inode() was compiled to a redundant mov instruction due to this issue. Define ASSERT as BUILD_BUG_ON_INVALID for !CONFIG_BTRFS_ASSERT builds which uses sizeof(cond) trick. Also mark full_page_sectors_uptodate() as __maybe_unused to suppress "unneeded declaration" warning (it's needed in compile time) Signed-off-by: Gladyshev Ilya <foxido@foxido.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:22 +01:00
David Sterba	cc53bd2085	btrfs: add unlikely annotations to branches leading to EIO The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
Qu Wenruo	5fbaae4b85	btrfs: prepare scrub to support bs > ps cases This involves: - Migrate scrub_stripe::pages[] to folios[] - Use btrfs_alloc_folio_array() and folio_put() to alloc above array. - Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use folio interfaces - Migrate raid56_parity_cache_data_pages() to raid56_parity_cache_data_folios() Since scrub is the only caller still using pages. This helper will copy the folio array contents into rbio::stripe_pages, with sector uptodate flags updated. And a new ASSERT() to make sure bs > ps cases will not hit this path. Since most scrub code is based on kaddr/paddr, the migration itself is pretty straightforward. And since we're here, also move the loop to set the stripe_sectors[].uptodate out of the copy loop. As we always mark all the sectors as uptodate for the data stripe, it's easier to do in one go, other than doing it inside the copy loop. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:25 +02:00
Qu Wenruo	7425a28940	btrfs: introduce btrfs_bio_for_each_block_all() helper Currently if we want to iterate all blocks inside a bio, we do something like this: bio_for_each_segment_all(bvec, bio, iter_all) { for (off = 0; off < bvec->bv_len; off += sectorsize) { /* Iterate blocks using bv + off / } } That's fine for now, but it will not handle future bs > ps, as bio_for_each_segment_all() is a single-page iterator, it will always return a bvec that's no larger than a page. But for bs > ps cases, we need a full folio (which covers at least one block) so that we can work on the block. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block_all() This helper will create a local bvec_iter, which has the size of the target bio. Then grab the current physical address of the current location, then advance the iterator by block size. - Use btrfs_bio_for_each_block_all() to replace existing call sites Including: set_bio_pages_uptodate() in raid56 * verify_bio_data_sectors() in raid56 Both will result much easier to read code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	9afc617265	btrfs: introduce btrfs_bio_for_each_block() helper Currently if we want to iterate a bio in block unit, we do something like this: while (iter->bi_size) { struct bio_vec bv = bio_iter_iovec(); /* Do something with using the bv / bio_advance_iter_single(&bbio->bio, iter, sectorsize); } That's fine for now, but it will not handle future bs > ps, as bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not exceed page size. This means the code using that bv can only handle a block if bs <= ps. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block() Instead of bio_vec, which has single and multiple page version and multiple page version has quite some limits, use my favorite way to represent a block, phys_addr_t. For bs <= ps cases, nothing is changed, except we will do a very small overhead to convert phys_addr_t to a folio, then use the proper folio helpers to handle the possible highmem cases. For bs > ps cases, all blocks will be backed by large folios, meaning every folio will cover at least one block. And still use proper folio helpers to handle highmem cases. With phys_addr_t, we will handle both large folio and highmem properly. So there is no better single variable to present a btrfs block than phys_addr_t. - Extract the data block csum calculation into a helper The new helper, btrfs_calculate_block_csum() will be utilized by btrfs_csum_one_bio(). - Use btrfs_bio_for_each_block() to replace existing call sites Including: index_one_bio() from raid56.c Very straight-forward. * btrfs_check_read_bio() Also update repair_one_sector() to grab the folio using phys_addr_t, and do extra checks to make sure the folio covers at least one block. We do not need to bother bv_len at all now. * btrfs_csum_one_bio() Now we can move the highmem handling into a dedicated helper, calculate_block_csum(), and use btrfs_bio_for_each_block() helper. There is one exception in btrfs_decompress_buf2page(), which is copying decompressed data into the original bio, which is not iterating using block size thus we don't need to bother. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:17 +02:00
Qu Wenruo	35aff706dc	btrfs: concentrate highmem handling for data verification Currently for btrfs checksum verification, we do it in the following pattern: kaddr = kmap_local_*(); ret = btrfs_check_csum_csum(kaddr); kunmap_local(kaddr); It's OK for now, but it's still not following the patterns of helpers inside linux/highmem.h, which never requires a virt memory address. In those highmem helpers, they mostly accept a folio, some offset/length inside the folio, and in the implementation they check if the folio needs partial kmap, and do the handling. Inspired by those formal highmem helpers, enhance the highmem handling of data checksum verification by: - Rename btrfs_check_sector_csum() to btrfs_check_block_csum() To follow the more common term "block" used in all other major filesystems. - Pass a physical address into btrfs_check_block_csum() and btrfs_data_csum_ok() The physical address is always available even for a highmem page. Since it's page frame number << PAGE_SHIFT + offset in page. And with that physical address, we can grab the folio covering the page, and do extra checks to ensure it covers at least one block. This also allows us to do the kmap inside btrfs_check_block_csum(). This means all the extra HIGHMEM handling will be concentrated into btrfs_check_block_csum(), and no callers will need to bother highmem by themselves. - Properly zero out the block if csum mismatch Since btrfs_data_csum_ok() only got a paddr, we can not and should not use memzero_bvec(), which only accepts single page bvec. Instead use paddr to grab the folio and call folio_zero_range() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
Filipe Manana	c5d12d5b62	btrfs: raid56: use list_last_entry() at cache_rbio() Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	c779b7980c	btrfs: raid56: rename parameter err to status in endio helpers Trivial renames to unify the naming of blk_status_t variables/parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	ae8ce87165	btrfs: drop redundant local variable in raid_wait_write_end_io() The bio status is read only once, no variable needed for that. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	05a6ec865d	btrfs: use unsigned types for constants defined as bit shifts The unsigned type is a recommended practice (CWE-190, CWE-194) for bit shifts to avoid problems with potential unwanted sign extensions. Although there are no such cases in btrfs codebase, follow the recommendation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	2d44a15afd	btrfs: use list_first_entry() everywhere Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Qu Wenruo	cd678925e9	btrfs: raid56: store a physical address in structure sector_ptr Instead of using a @page + @pg_offset pair inside sector_ptr structure, use a single physical address instead. This allows us to grab both the page and offset from a single u64 value. Although we still need an extra bool value, @has_paddr, to distinguish if the sector is properly mapped (as the 0 physical address is totally valid). This change doesn't change the size of structure sector_ptr, but reduces the parameters of several functions. Note: the original idea and patch is from Christoph Hellwig (https://lore.kernel.org/linux-btrfs/20250409111055.3640328-7-hch@lst.de/) but the final implementation is different. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [ Use physical addresses instead to handle highmem. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	6f3f722df7	btrfs: simplify bvec iteration in index_one_bio() Flatten the two loops by open coding bio_for_each_segment() and advancing the iterator one sector at a time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Fix a bug that @offset is not increased. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	959ddf2839	btrfs: move kmapping out of btrfs_check_sector_csum() Move kmapping the page out of btrfs_check_sector_csum(). This allows using bvec_kmap_local() where suitable and reduces the number of kmap*() calls in the raid56 code. This also means btrfs_check_sector_csum() will only accept a properly kmapped address. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Qu Wenruo	c186345a6b	btrfs: make assert_rbio() to only check CONFIG_BTRFS_ASSERT According to the description, CONFIG_BTRFS_DEBUG is only for extra debug info, meanwhile sanity checks should be managed by CONFIG_BTRFS_ASSERT. There is no need to check both to enable assert_rbio(). Just remove the check for CONFIG_BTRFS_DEBUG. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-11-11 14:34:12 +01:00
Qu Wenruo	0fbf6cbd72	btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array() There is only one caller utilizing the @extra_gfp parameter, alloc_eb_folio_array(). And in that case the extra_gfp is only assigned to __GFP_NOFAIL. Rename the @extra_gfp parameter to @nofail to indicate that. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:30 +02:00
Qu Wenruo	bbbee460aa	btrfs: raid56: do extra dumping for CONFIG_BTRFS_ASSERT There are several hard-to-hit ASSERT()s hit inside raid56. Unfortunately the ASSERT() expression is a little complex, and except the ASSERT(), there is nothing to provide any clue. Considering if race is involved, it's pretty hard to reproduce. Meanwhile sometimes the dump of the rbio structure can provide some pretty good clues, it's worth to do the extra multi-line dump for btrfs raid56 related code. The dump looks like this: BTRFS critical (device dm-3): bioc logical=4598530048 full_stripe=4598530048 size=0 map_type=0x81 mirror=0 replace_nr_stripes=0 replace_stripe_src=-1 num_stripes=5 BTRFS critical (device dm-3): nr=0 devid=1 physical=1166147584 BTRFS critical (device dm-3): nr=1 devid=2 physical=1145176064 BTRFS critical (device dm-3): nr=2 devid=4 physical=1145176064 BTRFS critical (device dm-3): nr=3 devid=5 physical=1145176064 BTRFS critical (device dm-3): nr=4 devid=3 physical=1145176064 BTRFS critical (device dm-3): rbio flags=0x0 nr_sectors=80 nr_data=4 real_stripes=5 stripe_nsectors=16 scrubp=0 dbitmap=0x0 BTRFS critical (device dm-3): logical=4598530048 assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1702 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-11 15:33:16 +02:00
Christoph Hellwig	fa1af65bf8	btrfs use bio_list_merge_init Use bio_list_merge_init instead of open coding bio_list_merge and bio_list_init. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240328084147.2954434-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:53:37 -06:00
Qu Wenruo	b2324e08b8	btrfs: raid56: extra debugging for raid6 syndrome generation [BUG] I have got at least two crash report for RAID6 syndrome generation, no matter if it's AVX2 or SSE2, they all seems to have a similar calltrace with corrupted RAX: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI Workqueue: btrfs-rmw rmw_rbio_work [btrfs] RIP: 0010:raid6_sse21_gen_syndrome+0x9e/0x130 [raid6_pq] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248 RDX: 0000000000000000 RSI: ffffa0f74cfa3238 RDI: 0000000000000000 Call Trace: <TASK> rmw_rbio+0x5c8/0xa80 [btrfs] process_one_work+0x1c7/0x3d0 worker_thread+0x4d/0x380 kthread+0xf3/0x120 ret_from_fork+0x2c/0x50 </TASK> [CAUSE] The cause is not known. Recently I also hit this in AVX512 path, and that's even in v5.15 backport, which doesn't have any of my RAID56 rework. Furthermore according to the registers: RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248 The RAX register is showing the number of stripes (including PQ), which is not correct (0). But the remaining two registers are all sane. - RBX is the sectorsize For x86_64 it should always be 4K and matches the output. - RCX is the pointers array Which is from rbio->finish_pointers, and it looks like a sane kernel address. [WORKAROUND] For now, I can only add extra debug ASSERT()s before we call raid6 gen_syndrome() helper and hopes to catch the problem. The debug requires both CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT enabled. My current guess is some use-after-free, but every report is only having corrupted RAX but seemingly valid pointers doesn't make much sense. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-03-04 16:24:52 +01:00
David Sterba	2b712e3bb2	btrfs: remove unused included headers With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-03-04 16:24:46 +01:00
Qu Wenruo	09e6cef19c	btrfs: refactor alloc_extent_buffer() to allocate-then-attach method Currently alloc_extent_buffer() utilizes find_or_create_page() to allocate one page a time for an extent buffer. This method has the following disadvantages: - find_or_create_page() is the legacy way of allocating new pages With the new folio infrastructure, find_or_create_page() is just redirected to filemap_get_folio(). - Lacks the way to support higher order (order >= 1) folios As we can not yet let filemap give us a higher order folio. This patch would change the workflow by the following way: Old \| new -----------------------------------+------------------------------------- \| ret = btrfs_alloc_page_array(); for (i = 0; i < num_pages; i++) { \| for (i = 0; i < num_pages; i++) { p = find_or_create_page(); \| ret = filemap_add_folio(); /* Attach page private / \| / Reuse page cache if needed / / Reused eb if needed / \| \| / Attach page private and \| reuse eb if needed */ \| } By this we split the page allocation and private attaching into two parts, allowing future updates to each part more easily, and migrate to folio interfaces (especially for possible higher order folios). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-12-15 23:01:04 +01:00
David Sterba	a0df0a2680	btrfs: raid56: remove unused btrfs_plug_cb::work The raid56 changes in 6.2 reworked the IO path to RMW, commit `93723095b5` ("btrfs: raid56: switch write path to rmw_rbio()") in particular removed the last use of the work member so it can be removed as well. This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-12-15 20:27:01 +01:00
Qu Wenruo	3c771c1944	btrfs: scrub: avoid unnecessary csum tree search preparing stripes One of the bottleneck of the new scrub code is the extra csum tree search. The old code would only do the csum tree search for each scrub bio, which can be as large as 512KiB, thus they can afford to allocate a new path each time. But the new scrub code is doing csum tree search for each stripe, which is only 64KiB, this means we'd better re-use the same csum path during each search. This patch would introduce a per-sctx path for csum tree search, as we don't need to re-allocate the path every time we need to do a csum tree search. With this change we can further improve the queue depth and improve the scrub read performance: Before (with regression and cached extent tree path): Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util nvme0n1p3 15875.00 1013328.00 12.00 0.08 0.08 63.83 1.35 100.00 After (with both cached extent/csum tree path): nvme0n1p3 17759.00 1133280.00 10.00 0.06 0.08 63.81 1.50 100.00 Fixes: `e02ee89baa` ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:48 +02:00
Qu Wenruo	dbb6ecb328	btrfs: tracepoints: simplify raid56 events After commit `6bfd0133be` ("btrfs: raid56: switch scrub path to use a single function"), the raid56 implementation no longer uses different endio functions for RMW/recover/scrub. All read operations end in submit_read_wait_bio_list(), while all write operations end in submit_write_bios(). This means quite some trace events are out-of-date and no longer utilized. This patch would unify the trace events into just two: - trace_raid56_read() Replaces trace_raid56_read_partial(), trace_raid56_scrub_read() and trace_raid56_scrub_read_recover(). - trace_raid56_write() Replaces trace_raid56_write_stripe() and trace_raid56_scrub_write_stripe(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:12 +02:00
Qu Wenruo	3a3c7a7f65	btrfs: raid56: remove unused BTRFS_RBIO_REBUILD_MISSING Commit `aca43fe839` ("btrfs: remove unused raid56 functions which were dedicated for scrub") removed the special handling of RAID56 scrub for missing device. As scrub goes full mirror_num based recovery, that means if it hits a missing device in RAID56, it would just try the next mirror, which would go through the BTRFS_RBIO_READ_REBUILD operation. This means there is no longer any use of BTRFS_RBIO_REBUILD_MISSING operation and we can safely remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:12 +02:00
Qu Wenruo	486c737f7f	btrfs: raid56: always verify the P/Q contents for scrub [REGRESSION] Commit `75b4703329` ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") changed the behavior of scrub_rbio(). Initially if we have no error reading the raid bio, we will assign @need_check to true, then finish_parity_scrub() would later verify the content of P/Q stripes before writeback. But after that commit we never verify the content of P/Q stripes and just writeback them. This can lead to unrepaired P/Q stripes during scrub, or already corrupted P/Q copied to the dev-replace target. [FIX] The situation is more complex than the regression, in fact the initial behavior is not 100% correct either. If we have the following rare case, it can still lead to the same problem using the old behavior: 0 16K 32K 48K 64K Data 1: \|IIIIIII\| \| Data 2: \| \| Parity: \| \|CCCCCCC\| \| Where "I" means IO error, "C" means corruption. In the above case, we're scrubbing the parity stripe, then read out all the contents of Data 1, Data 2, Parity stripes. But found IO error in Data 1, which leads to rebuild using Data 2 and Parity and got the correct data. In that case, we would not verify if the Parity is correct for range [16K, 32K). So here we have to always verify the content of Parity no matter if we did recovery or not. This patch would remove the @need_check parameter of finish_parity_scrub() completely, and would always do the P/Q verification before writeback. Fixes: `75b4703329` ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") CC: stable@vger.kernel.org # 6.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-07-18 03:13:15 +02:00
Qu Wenruo	94ead93e63	btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read For P/Q stripe scrub, we have quite some duplicated read IO: - Data stripes read for verification This is triggered by the scrub_submit_initial_read() inside scrub_raid56_parity_stripe(). - Data stripes read (again) for P/Q stripe verification This is triggered by scrub_assemble_read_bios() from scrub_rbio(). Although we can have hit rbio cache and avoid unnecessary read, the chance is very low, as scrub would easily flush the whole rbio cache. This means, even we're just scrubbing a single P/Q stripe, we would read the data stripes twice for the best case scenario. If we need to recover some data stripes, it would cause more reads on the same data stripes, again and again. However before we call raid56_parity_submit_scrub_rbio() we already have all data stripes repaired and their contents ready to use. But RAID56 cache is unaware about the scrub cache, thus RAID56 layer itself still needs to re-read the data stripes. To avoid such cache miss, this patch would: - Introduce a new helper, raid56_parity_cache_data_pages() This function would grab the pages from an array, and copy the content to the rbio, marking all the involved sectors uptodate. The page copy is unavoidable because of the cache pages of rbio are all self managed, thus can not utilize outside pages without screwing up the lifespan. - Use the repaired data stripes as cache inside scrub_raid56_parity_stripe() By this, we ensure all the data sectors of the scrub rbio are already uptodate, and no need to read them again from disk. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:24 +02:00
Anand Jain	adbe7e388e	btrfs: use SECTOR_SHIFT to convert LBA to physical offset Using SECTOR_SHIFT to convert LBA to physical address makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:23 +02:00
Anand Jain	29e70be261	btrfs: use SECTOR_SHIFT to convert physical offset to LBA Use SECTOR_SHIFT while converting a physical address to an LBA, makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:23 +02:00
Qu Wenruo	aca43fe839	btrfs: remove unused raid56 functions which were dedicated for scrub Since the scrub rework, the following RAID56 functions are no longer called: - raid56_add_scrub_pages() - raid56_alloc_missing_rbio() - raid56_submit_missing_rbio() Those functions are all utilized by scrub to handle missing device cases for RAID56. However the new scrub code handle them in a completely different way: - If it's data stripe, go recovery path through btrfs_submit_bio() - If it's P/Q stripe, it would be handled through raid56_parity_submit_scrub_rbio() And that function would handle dev-replace and repair properly. Thus we can safely remove those functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 19:52:18 +02:00
Qu Wenruo	b979547513	btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe The new helper will search the extent tree to find the first extent of a logical range, then fill the sectors array by two loops: - Loop 1 to fill common bits and metadata generation - Loop 2 to fill csum data (only for data bgs) This loop will use the new btrfs_lookup_csums_bitmap() to fill the full csum buffer, and set scrub_sector_verification::csum. With all the needed info filled by this function, later we only need to submit and verify the stripe. Here we temporarily export the helper to avoid warning on unused static function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:23 +02:00
Johannes Thumshirn	cf32e41fa5	btrfs: use __bio_add_page to add single a page in rbio_add_io_sector The btrfs raid56 sector submission code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but the return value is never checked. Use __bio_add_page() as adding a single page to a newly created bio is guaranteed to succeed. This brings us a step closer to marking bio_add_page() as __must_check. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:20 +02:00
Qu Wenruo	18d758a2d8	btrfs: replace btrfs_io_context::raid_map with a fixed u64 value In btrfs_io_context structure, we have a pointer raid_map, which indicates the logical bytenr for each stripe. But considering we always call sort_parity_stripes(), the result raid_map[] is always sorted, thus raid_map[0] is always the logical bytenr of the full stripe. So why we waste the space and time (for sorting) for raid_map? This patch will replace btrfs_io_context::raid_map with a single u64 number, full_stripe_start, by: - Replace btrfs_io_context::raid_map with full_stripe_start - Replace call sites using raid_map[0] to use full_stripe_start - Replace call sites using raid_map[i] to compare with nr_data_stripes. The benefits are: - Less memory wasted on raid_map It's sizeof(u64) * num_stripes vs sizeof(u64). It'll always save at least one u64, and the benefit grows larger with num_stripes. - No more weird alloc_btrfs_io_context() behavior As there is only one fixed size + one variable length array. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:14 +02:00

1 2 3 4 5 ...

255 commits