Please consider pulling these changes from the signed vfs-7.0-rc2.fixes tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaZ7xWAAKCRCRxhvAZXjc
onpeAP4qOrTURIAX9M/NGCHywvjI91ZJt20J6vm0X6KbVV/ebQD/eoJ21xzPhG9M
gN7oRcZ9SW3e/AdtdnlqB0PEP+cyGwM=
=9Ji+
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix an uninitialized variable in file_getattr().
The flags_valid field wasn't initialized before calling
vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse
- Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is
not enabled.
sysctl_hung_task_timeout_secs is 0 in that case causing spurious
"waiting for writeback completion for more than 1 seconds" warnings
- Fix a null-ptr-deref in do_statmount() when the mount is internal
- Add missing kernel-doc description for the @private parameter in
iomap_readahead()
- Fix mount namespace creation to hold namespace_sem across the mount
copy in create_new_namespace().
The previous drop-and-reacquire pattern was fragile and failed to
clean up mount propagation links if the real rootfs was a shared or
dependent mount
- Fix /proc mount iteration where m->index wasn't updated when
m->show() overflows, causing a restart to repeatedly show the same
mount entry in a rapidly expanding mount table
- Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the
inode number is out of range
- Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared.
copy_mnt_ns() received the live fs_struct so if a subsequent
namespace creation failed the rollback would leave pwd and root
pointing to detached mounts. Always allocate a new fs_struct when
CLONE_NEWNS is requested
- fserror bug fixes:
- Remove the unused fsnotify_sb_error() helper now that all callers
have been converted to fserror_report_metadata
- Fix a lockdep splat in fserror_report() where igrab() takes
inode::i_lock which can be held in IRQ context.
Replace igrab() with a direct i_count bump since filesystems
should not report inodes that are about to be freed or not yet
exposed
- Handle error pointer in procfs for try_lookup_noperm()
- Fix an integer overflow in ep_loop_check_proc() where recursive calls
returning INT_MAX would overflow when +1 is added, breaking the
recursion depth check
- Fix a misleading break in pidfs
* tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: avoid misleading break
eventpoll: Fix integer overflow in ep_loop_check_proc()
proc: Fix pointer error dereference
fserror: fix lockdep complaint when igrabbing inode
fsnotify: drop unused helper
unshare: fix unshare_fs() handling
minix: Correct errno in minix_new_inode
namespace: fix proc mount iteration
mount: hold namespace_sem across copy in create_new_namespace()
iomap: Describe @private in iomap_readahead()
statmount: Fix the null-ptr-deref in do_statmount()
writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK
fs: init flags_valid before calling vfs_fileattr_get
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Recent changes of fs-writeback cause such warnings if DETECT_HUNG_TASK
is not enabled:
INFO: The task sync:1342 has been waiting for writeback completion for more than 1 seconds.
The reason is sysctl_hung_task_timeout_secs is 0 when DETECT_HUNG_TASK
is not enabled, then it causes the warning message even if the writeback
lasts for only one second.
Guard the wakeup and logging with "#ifdef CONFIG_DETECT_HUNG_TASK" can
eliminate the warning messages. But on the other hand, it is possible
that sysctl_hung_task_timeout_secs be also 0 when DETECT_HUNG_TASK is
enabled. So let's just check the value of sysctl_hung_task_timeout_secs
to decide whether do wakeup and logging.
Fixes: 1888635532 ("writeback: Wake up waiting tasks when finishing the writeback of a chunk.")
Fixes: d6e6215907 ("writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Link: https://patch.msgid.link/20260203094014.2273240-1-chenhuacai@loongson.cn
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
- cpuset changes:
- Continue separating v1 and v2 implementations by moving more
v1-specific logic into cpuset-v1.c.
- Improve partition handling. Sibling partitions are no longer
invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
fail in v2, and effective_xcpus computation is made consistent.
- Fix partition effective CPUs overlap that caused a warning on cpuset
removal when sibling partitions shared CPUs.
- Increase the maximum cgroup subsystem count from 16 to 32 to
accommodate future subsystem additions.
- Misc cleanups and selftest improvements including switching to
css_is_online() helper, removing dead code and stale documentation
references, using lockdep_assert_cpuset_lock_held() consistently,
and adding polling helpers for asynchronously updated cgroup
statistics.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYozIw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGZQKAQD51KJQz4M79wf2yBhIBLOnM4aakMalhSwZNL4O
JiGutwD+Ir33VzNX8aXBuDin9p4wI15O54PhqSenJbelKRQ3Dws=
=gR7L
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset changes:
- Continue separating v1 and v2 implementations by moving more
v1-specific logic into cpuset-v1.c
- Improve partition handling. Sibling partitions are no longer
invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
fail in v2, and effective_xcpus computation is made consistent
- Fix partition effective CPUs overlap that caused a warning on
cpuset removal when sibling partitions shared CPUs
- Increase the maximum cgroup subsystem count from 16 to 32 to
accommodate future subsystem additions
- Misc cleanups and selftest improvements including switching to
css_is_online() helper, removing dead code and stale documentation
references, using lockdep_assert_cpuset_lock_held() consistently, and
adding polling helpers for asynchronously updated cgroup statistics
* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: fix overlap of partition effective CPUs
cgroup: increase maximum subsystem count from 16 to 32
cgroup: Remove stale cpu.rt.max reference from documentation
cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
cgroup/cpuset: Don't fail cpuset.cpus change in v2
cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
cgroup/cpuset: Streamline rm_siblings_excl_cpus()
cpuset: remove dead code in cpuset-v1.c
cpuset: remove v1-specific code from generate_sched_domains
cpuset: separate generate_sched_domains for v1 and v2
cpuset: move update_domain_attr_tree to cpuset_v1.c
cpuset: add cpuset1_init helper for v1 initialization
cpuset: add cpuset1_online_css helper for v1-specific operations
cpuset: add lockdep_assert_cpuset_lock_held helper
cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
cgroup: switch to css_is_online() helper
selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
selftests: cgroup: make test_memcg_sock robust against delayed sock stats
...
Please consider pulling these changes from the signed vfs-7.0-rc1.nonblocking_timestamps tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49gAKCRCRxhvAZXjc
oqNMAQCjHw9iwYDu63n96QAipWopJb8onqc0rTEvi0OOl1zDNwEAufN3EqTzV3uQ
JbNgSwBWD/+ICd2aUOuAX0GgU6teyAQ=
=lJlI
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This contains the changes to support non-blocking timestamp updates.
Since commit 66fa3cedf1 ("fs: Add async write file modification
handling") file_update_time_flags() unconditionally returns -EAGAIN
when any timestamp needs updating and IOCB_NOWAIT is set. This makes
non-blocking direct writes impossible on file systems with granular
enough timestamps, which in practice means all of them.
This reworks the timestamp update path to propagate IOCB_NOWAIT
through ->update_time so that file systems which can update timestamps
without blocking are no longer penalized.
With that groundwork in place, the core change passes IOCB_NOWAIT into
->update_time and returns -EAGAIN only when the file system indicates
it would block.
XFS implements non-blocking timestamp updates by using the new
->sync_lazytime and open-coding generic_update_time without the
S_NOWAIT check, since the lazytime path through the generic helpers
can never block in XFS"
* tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: enable non-blocking timestamp updates
xfs: implement ->sync_lazytime
fs: refactor file_update_time_flags
fs: add support for non-blocking timestamp updates
fs: add a ->sync_lazytime method
fs: factor out a sync_lazytime helper
fs: refactor ->update_time handling
fat: cleanup the flags for fat_truncate_time
nfs: split nfs_update_timestamps
fs: allow error returns from generic_update_time
fs: remove inode_update_time
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaXc4IwAKCRCRxhvAZXjc
oo0jAQDOV580l4wHiY6eT1QGY2QYa7u8fYDOi6mqfgHa+EH5twD9ETnQ0xQHIKYP
oruFJXLf3ihBBsum+pTpAO2XFVjM7Qs=
=pM8o
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix the the buggy conversion of fuse_reverse_inval_entry() introduced
during the creation rework
- Disallow nfs delegation requests for directories by setting
simple_nosetlease()
- Require an opt-in for getting readdir flag bits outside of S_DT_MASK
set in d_type
- Fix scheduling delayed writeback work by only scheduling when the
dirty time expiry interval is non-zero and cancel the delayed work if
the interval is set to zero
- Use rounded_jiffies_interval for dirty time work
- Check the return value of sb_set_blocksize() for romfs
- Wait for batched folios to be stable in __iomap_get_folio()
- Use private naming for fuse hash size
- Fix the stale dentry cleanup to prevent a race that causes a UAF
* tag 'vfs-6.19-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: document d_dispose_if_unused()
fuse: shrink once after all buckets have been scanned
fuse: clean up fuse_dentry_tree_work()
fuse: add need_resched() before unlocking bucket
fuse: make sure dentry is evicted if stale
fuse: fix race when disposing stale dentries
fuse: use private naming for fuse hash size
writeback: use round_jiffies_relative for dirtytime_work
iomap: wait for batched folios to be stable in __iomap_get_folio
romfs: check sb_set_blocksize() return value
docs: clarify that dirtytime_expire_seconds=0 disables writeback
writeback: fix 100% CPU usage when dirtytime_expire_interval is 0
readdir: require opt-in for d_type flags
vboxsf: don't allow delegations to be set on directories
ceph: don't allow delegations to be set on directories
gfs2: don't allow delegations to be set on directories
9p: don't allow delegations to be set on directories
smb/client: properly disallow delegations on directories
nfs: properly disallow delegation requests on directories
fuse: fix conversion of fuse_reverse_inval_entry() to start_removing()
Above the while() loop in wait_sb_inodes(), we document that we must wait
for all pages under writeback for data integrity. Consequently, if a
mapping, like fuse, traditionally does not have data integrity semantics,
there is no need to wait at all; we can simply skip these inodes.
This restores fuse back to prior behavior where syncs are no-ops. This
fixes a user regression where if a system is running a faulty fuse server
that does not reply to issued write requests, this causes wait_sb_inodes()
to wait forever.
Link: https://lkml.kernel.org/r/20260105211737.4105620-2-joannelkoong@gmail.com
Fixes: 0c58a97f91 ("fuse: remove tmp folio for writebacks and internal rb tree")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reported-by: Athul Krishna <athul.krishna.kr@protonmail.com>
Reported-by: J. Neuschäfer <j.neuschaefer@gmx.net>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Tested-by: J. Neuschäfer <j.neuschaefer@gmx.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Bernd Schubert <bschubert@ddn.com>
Cc: Bonaccorso Salvatore <carnil@debian.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The dirtytime_work is a background housekeeping task that flushes dirty
inodes, using round_jiffies_relative() will allow kernel to batch this
work with other aligned system tasks, reducing power consumption.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Link: https://patch.msgid.link/20260113082614.231580-1-zhaomzhao@126.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Allow the file system to explicitly implement lazytime syncing instead
of pigging back on generic inode dirtying. This allows to simplify
the XFS implementation and prepares for non-blocking lazytime timestamp
updates.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260108141934.2052404-8-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Centralize how we synchronize a lazytime update into the actual on-disk
timestamp into a single helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260108141934.2052404-7-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When vm.dirtytime_expire_seconds is set to 0, wakeup_dirtytime_writeback()
schedules delayed work with a delay of 0, causing immediate execution.
The function then reschedules itself with 0 delay again, creating an
infinite busy loop that causes 100% kworker CPU usage.
Fix by:
- Only scheduling delayed work in wakeup_dirtytime_writeback() when
dirtytime_expire_interval is non-zero
- Cancelling the delayed work in dirtytime_interval_handler() when
the interval is set to 0
- Adding a guard in start_dirtytime_writeback() for defensive coding
Tested by booting kernel in QEMU with virtme-ng:
- Before fix: kworker CPU spikes to ~73%
- After fix: CPU remains at normal levels
- Setting interval back to non-zero correctly resumes writeback
Fixes: a2f4870697 ("fs: make sure the timestamps for lazytime inodes eventually get written")
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220227
Signed-off-by: Laveesh Bansal <laveeshb@laveeshbansal.com>
Link: https://patch.msgid.link/20260106145059.543282-2-laveeshb@laveeshbansal.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use the new css_is_online() helper that has been introduced to check css
online state, instead of testing the CSS_ONLINE flag directly. This
improves readability and centralizes the state check logic.
No functional changes intended.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
or4UAP9FbpFsZd0DpsYnKuv7kFepl291PuR0x2dKmseJ/wcf8AEAzI8FR5wd/fey
25ZNdExoUojAOj5wVn+jUep3u54jBws=
=/toi
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
For consistency with sb routines.
ext4 is the only consumer outside of evict(). Damage-controlling it is
outside of the scope of this cleanup.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251103230911.516866-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and
inode_add_lru(). move it up
2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is
needed
3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself
(inode_lru_list_del()) and similar routines for sb and io list
management
4. push list presence check into inode_lru_list_del(), just like sb and
io list
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029131428.654761-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The relatively low minimal writeback size of 4MiB means that written back
inodes on rotational media are switched a lot. Besides introducing
additional seeks, this also can lead to extreme file fragmentation on
zoned devices when a lot of files are cached relative to the available
writeback bandwidth.
Add a superblock field that allows the file system to override the
default size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Return the pages directly when calculated instead of first assigning
them back to a variable, and directly return for the data integrity /
tagged case instead of going through an else clause.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-2-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace filemap_fdatawrite_wbc, which exposes a writeback_control to the
callers with a filemap_writeback helper that takes all the possible
arguments and declares the writeback_control itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-9-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When a writeback work lasts for sysctl_hung_task_timeout_secs, we want
to identify that there are tasks waiting for a long time-this helps us
pinpoint potential issues.
Additionally, recording the starting jiffies is useful when debugging a
crashed vmcore.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Writing back a large number of pages can take a lots of time.
This issue is exacerbated when the underlying device is slow or
subject to block layer rate limiting, which in turn triggers
unexpected hung task warnings.
We can trigger a wake-up once a chunk has been written back and the
waiting time for writeback exceeds half of
sysctl_hung_task_timeout_secs.
This action allows the hung task detector to be aware of the writeback
progress, thereby eliminating these unexpected hung task warnings.
This patch has passed the xfstests 'check -g quick' test based on ext4,
with no additional failures introduced.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The incomming helpers don't ship with _release/_acquire variants, for
the time being anyway.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQigAKCRCRxhvAZXjc
ojqsAQCI9BOJV953LWPr9fmmFVz78wA4kTxkFFDAxwH/0iYtmQD8DORngPfJERI1
FwTY/gk2N744vUpbbcSM/KOnM3ss8wE=
=B3rD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs writeback updates from Christian Brauner:
"This contains work adressing lockups reported by users when a systemd
unit reading lots of files from a filesystem mounted with the lazytime
mount option exits.
With the lazytime mount option enabled we can be switching many dirty
inodes on cgroup exit to the parent cgroup. The numbers observed in
practice when systemd slice of a large cron job exits can easily reach
hundreds of thousands or millions.
The logic in inode_do_switch_wbs() which sorts the inode into
appropriate place in b_dirty list of the target wb however has linear
complexity in the number of dirty inodes thus overall time complexity
of switching all the inodes is quadratic leading to workers being
pegged for hours consuming 100% of the CPU and switching inodes to the
parent wb.
Simple reproducer of the issue:
FILES=10000
# Filesystem mounted with lazytime mount option
MNT=/mnt/
echo "Creating files and switching timestamps"
for (( j = 0; j < 50; j ++ )); do
mkdir $MNT/dir$j
for (( i = 0; i < $FILES; i++ )); do
echo "foo" >$MNT/dir$j/file$i
done
touch -a -t 202501010000 $MNT/dir$j/file*
done
wait
echo "Syncing and flushing"
sync
echo 3 >/proc/sys/vm/drop_caches
echo "Reading all files from a cgroup"
mkdir /sys/fs/cgroup/unified/mycg1 || exit
echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
for (( j = 0; j < 50; j ++ )); do
cat /mnt/dir$j/file* >/dev/null &
done
wait
echo "Switching wbs"
# Now rmdir the cgroup after the script exits
This can be solved by:
- Avoiding contention on the wb->list_lock when switching inodes by
running a single work item per wb and managing a queue of items
switching to the wb
- Allowing rescheduling when switching inodes over to a different
cgroup to avoid softlockups
- Maintaining the b_dirty list ordering instead of sorting it"
* tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
writeback: Add tracepoint to track pending inode switches
writeback: Avoid excessively long inode switching times
writeback: Avoid softlockup when switching many inodes
writeback: Avoid contention on wb->list_lock when switching inodes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQYgAKCRCRxhvAZXjc
olgGAQDWr4sD7kUt8TxifdAXsQNgyGG8qOUkb/BHHSqJ/5mKvAEAlTwJ+81tgNKT
hYYdPyvWdbgW6CnWeiQLi0JjpFvUPQU=
=uHwG
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs workqueue updates from Christian Brauner:
"This contains various workqueue changes affecting the filesystem
layer.
Currently if a user enqueue a work item using schedule_delayed_work()
the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies
to schedule_work() that is using system_wq and queue_work(), that
makes use again of WORK_CPU_UNBOUND.
This replaces the use of system_wq and system_unbound_wq. system_wq is
a per-CPU workqueue which isn't very obvious from the name and
system_unbound_wq is to be used when locality is not required.
So this renames system_wq to system_percpu_wq, and system_unbound_wq
to system_dfl_wq.
This also adds a new WQ_PERCPU flag to allow the fs subsystem users to
explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and
WQ_PERCPU flags coexist for one release cycle to allow callers to
transition their calls. WQ_UNBOUND will be removed in a next release
cycle"
* tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: WQ_PERCPU added to alloc_workqueue users
fs: replace use of system_wq with system_percpu_wq
fs: replace use of system_unbound_wq with system_dfl_wq
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQQgAKCRCRxhvAZXjc
oud9AQD5IG4sNnzCjsvcTDpQkbX5eZW+LFIiAiiN+nztZ+OcRQEAvC2N7YovfqM3
TWpVoNDKvEPdtDc9ttFMUKqBZYvxvgE=
=sEaL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"This contains a series I originally wrote and that Eric brought over
the finish line. It moves out the i_crypt_info and i_verity_info
pointers out of 'struct inode' and into the fs-specific part of the
inode.
So now the few filesytems that actually make use of this pay the price
in their own private inode storage instead of forcing it upon every
user of struct inode.
The pointer for the crypt and verity info is simply found by storing
an offset to its address in struct fsverity_operations and struct
fscrypt_operations. This shrinks struct inode by 16 bytes.
I hope to move a lot more out of it in the future so that struct inode
becomes really just about very core stuff that we need, much like
struct dentry and struct file, instead of the dumping ground it has
become over the years.
On top of this are a various changes associated with the ongoing inode
lifetime handling rework that multiple people are pushing forward:
- Stop accessing inode->i_count directly in f2fs and gfs2. They
simply should use the __iget() and iput() helpers
- Make the i_state flags an enum
- Rework the iput() logic
Currently, if we are the last iput, and we have the I_DIRTY_TIME
bit set, we will grab a reference on the inode again and then mark
it dirty and then redo the put. This is to make sure we delay the
time update for as long as possible
We can rework this logic to simply dec i_count if it is not 1, and
if it is do the time update while still holding the i_count
reference
Then we can replace the atomic_dec_and_lock with locking the
->i_lock and doing atomic_dec_and_test, since we did the
atomic_add_unless above
- Add an icount_read() helper and convert everyone that accesses
inode->i_count directly for this purpose to use the helper
- Expand dump_inode() to dump more information about an inode helping
in debugging
- Add some might_sleep() annotations to iput() and associated
helpers"
* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add might_sleep() annotation to iput() and more
fs: expand dump_inode()
inode: fix whitespace issues
fs: add an icount_read helper
fs: rework iput logic
fs: make the i_state flags an enum
fs: stop accessing ->i_count directly in f2fs and gfs2
fsverity: check IS_VERITY() in fsverity_cleanup_inode()
fs: remove inode::i_verity_info
btrfs: move verity info pointer to fs-specific part of inode
f2fs: move verity info pointer to fs-specific part of inode
ext4: move verity info pointer to fs-specific part of inode
fsverity: add support for info in fs-specific part of inode
fs: remove inode::i_crypt_info
ceph: move crypt info pointer to fs-specific part of inode
ubifs: move crypt info pointer to fs-specific part of inode
f2fs: move crypt info pointer to fs-specific part of inode
ext4: move crypt info pointer to fs-specific part of inode
fscrypt: add support for info in fs-specific part of inode
fscrypt: replace raw loads of info pointer with helper function
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
=1ICf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This patch adds a new WQ_PERCPU flag to all the fs subsystem users to
explicitly request the use of the per-CPU behavior. Both flags coexist
for one release cycle to allow callers to transition their calls.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
All existing users have been updated accordingly.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/20250916082906.77439-4-marco.crivellari@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users.
Make it clear by adding a system_percpu_wq to all the fs subsystem.
The old wq will be kept for a few release cylces.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/20250916082906.77439-3-marco.crivellari@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add trace_inode_switch_wbs_queue tracepoint to allow insight into how
many inodes are queued to switch their bdi_writeback structure.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
With lazytime mount option enabled we can be switching many dirty inodes
on cgroup exit to the parent cgroup. The numbers observed in practice
when systemd slice of a large cron job exits can easily reach hundreds
of thousands or millions. The logic in inode_do_switch_wbs() which sorts
the inode into appropriate place in b_dirty list of the target wb
however has linear complexity in the number of dirty inodes thus overall
time complexity of switching all the inodes is quadratic leading to
workers being pegged for hours consuming 100% of the CPU and switching
inodes to the parent wb.
Simple reproducer of the issue:
FILES=10000
# Filesystem mounted with lazytime mount option
MNT=/mnt/
echo "Creating files and switching timestamps"
for (( j = 0; j < 50; j ++ )); do
mkdir $MNT/dir$j
for (( i = 0; i < $FILES; i++ )); do
echo "foo" >$MNT/dir$j/file$i
done
touch -a -t 202501010000 $MNT/dir$j/file*
done
wait
echo "Syncing and flushing"
sync
echo 3 >/proc/sys/vm/drop_caches
echo "Reading all files from a cgroup"
mkdir /sys/fs/cgroup/unified/mycg1 || exit
echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
for (( j = 0; j < 50; j ++ )); do
cat /mnt/dir$j/file* >/dev/null &
done
wait
echo "Switching wbs"
# Now rmdir the cgroup after the script exits
We need to maintain b_dirty list ordering to keep writeback happy so
instead of sorting inode into appropriate place just append it at the
end of the list and clobber dirtied_time_when. This may result in inode
writeback starting later after cgroup switch however cgroup switches are
rare so it shouldn't matter much. Since the cgroup had write access to
the inode, there are no practical concerns of the possible DoS issues.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
process_inode_switch_wbs_work() can be switching over 100 inodes to a
different cgroup. Since switching an inode requires counting all dirty &
under-writeback pages in the address space of each inode, this can take
a significant amount of time. Add a possibility to reschedule after
processing each inode to avoid softlockups.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There can be multiple inode switch works that are trying to switch
inodes to / from the same wb. This can happen in particular if some
cgroup exits which owns many (thousands) inodes and we need to switch
them all. In this case several inode_switch_wbs_work_fn() instances will
be just spinning on the same wb->list_lock while only one of them makes
forward progress. This wastes CPU cycles and quickly leads to softlockup
reports and unusable system.
Instead of running several inode_switch_wbs_work_fn() instances in
parallel switching to the same wb and contending on wb->list_lock, run
just one work item per wb and manage a queue of isw items switching to
this wb.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
The dirtytime_expire_interval belongs to fs/fs-writeback.c, move it to
fs/fs-writeback.c from /kernel/sysctl.c. And remove the useless extern
variable declaration and the function declaration from
include/linux/writeback.h
Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Christoph Hellwig <hch@lst.de> says:
This fixes one (of multiple) sparse warnings in fs-writeback.c, and
then reshuffles the code a bit that only the proper high level API
instead of low-level helpers is exported.
* patches from https://lore.kernel.org/r/20241112054403.1470586-1-hch@lst.de:
writeback: wbc_attach_fdatawrite_inode out of line
writeback: add a __releases annoation to wbc_attach_and_unlock_inode
Link: https://lore.kernel.org/r/20241112054403.1470586-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
This allows exporting this high-level interface only while keeping
wbc_attach_and_unlock_inode private in fs-writeback.c and unexporting
__inode_attach_wb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241112054403.1470586-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This shuts up a sparse lock context tracking warning.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241112054403.1470586-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most of the callers of wbc_account_cgroup_owner() are converting a folio
to page before calling the function. wbc_account_cgroup_owner() is
converting the page back to a folio to call mem_cgroup_css_from_folio().
Convert wbc_account_cgroup_owner() to take a folio instead of a page,
and convert all callers to pass a folio directly except f2fs.
Convert the page to folio for all the callers from f2fs as they were the
only callers calling wbc_account_cgroup_owner() with a page. As f2fs is
already in the process of converting to folios, these call sites might
also soon be calling wbc_account_cgroup_owner() with a folio directly in
the future.
No functional changes. Only compile tested.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most commonly neither I_LRU_ISOLATING nor I_SYNC are set, but the stock
kernel takes a back-to-back relock trip to check for them.
It probably can be avoided altogether, but for now massage things back
to just one lock acquire.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240813143626.1573445-1-mjguzik@gmail.com
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When deactivating any type of superblock, it had to wait for the in-flight
wb switches to be completed. wb switches are executed in inode_switch_wbs_work_fn()
which needs to acquire the wb_switch_rwsem and races against sync_inodes_sb().
If there are too much dirty data in the superblock, the waiting time may increase
significantly.
For superblocks without cgroup writeback such as tmpfs, they have nothing to
do with the wb swithes, so the flushing can be avoided.
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Link: https://lore.kernel.org/r/20240726030525.180330-1-haifeng.xu@shopee.com
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
writeback_inodes_sb doesn't have return value, just remove unnecessary
return in it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-7-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Commit e8e8a0c6c9 ("writeback: move nr_pages == 0 logic to one
location") removed parameter nr_pages of __wakeup_flusher_threads_bdi
and we try to writeback all dirty pages in __wakeup_flusher_threads_bdi
now. Just correct stale comment.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-6-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The dirtied_before is only used when b_io is not empty, so only calculate
when b_io is not empty.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-5-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Remove unused parameter wb of finish_writeback_work.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-4-shikemeng@huaweicloud.com
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
For case there is no more inodes for IO in io list from last wb_writeback,
We may bail out early even there is inode in dirty list should be written
back. Only bail out when we queued once to avoid missing dirtied inode.
This is from code reading...
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240228091958.288260-3-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
[brauner@kernel.org: fold in memory corruption fix from Jan in [1]]
Link: https://lore.kernel.org/r/20240405132346.bid7gibby3lxxhez@quack3 [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>