linux/kernel/time
Waiman Long a84097e625 cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.18.0-test+ #2 Tainted: G S
  ------------------------------------------------------
  test_cpuset_prs/10971 is trying to acquire lock:
  ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180

  but task is already holding lock:
  ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:
  -> #4 (cpuset_mutex){+.+.}-{4:4}:
  -> #3 (cpu_hotplug_lock){++++}-{0:0}:
  -> #2 (rtnl_mutex){+.+.}-{4:4}:
  -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
  -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

  Chain exists of:
    (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

  5 locks held by test_cpuset_prs/10971:
   #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
   #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
   #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
   #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
   #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  Call Trace:
   <TASK>
     :
   touch_wq_lockdep_map+0x93/0x180
   __flush_workqueue+0x111/0x10b0
   housekeeping_update+0x12d/0x2d0
   update_parent_effective_cpumask+0x595/0x2440
   update_prstate+0x89d/0xce0
   cpuset_partition_write+0xc5/0x130
   cgroup_file_write+0x1a5/0x680
   kernfs_fop_write_iter+0x3df/0x5f0
   vfs_write+0x525/0xfd0
   ksys_write+0xf9/0x1d0
   do_syscall_64+0x95/0x520
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

So housekeeping_update() is now called after releasing cpus_read_lock
and cpuset_mutex at the end of a cpuset operation. These two locks are
then re-acquired later before calling rebuild_sched_domains_locked().

To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.

As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.

The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.

Fixes: 03ff735101 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23 10:46:49 -10:00
..
alarmtimer.c time: Fix spelling mistakes in comments 2025-09-21 10:02:02 +02:00
clockevents.c treewide: Update email address 2026-01-11 06:09:11 -10:00
clocksource-wdtest.c clocksource/wdtest: Print time values for short udelay(1) 2025-01-15 19:49:13 +01:00
clocksource.c clocksource: Reduce watchdog readout delay limit to prevent false positives 2026-01-21 11:33:11 +01:00
hrtimer.c Updates for the core time subsystem: 2026-02-10 16:41:59 -08:00
itimer.c timers/itimer: Avoid direct access to hrtimer clockbase 2025-09-09 12:27:17 +02:00
jiffies.c sysctl: Move proc_doulongvec_ms_jiffies_minmax to kernel/time/jiffies.c 2025-11-27 15:45:37 +01:00
Kconfig time: Introduce auxiliary POSIX clocks 2025-06-19 14:28:22 +02:00
Makefile time: Build generic update_vsyscall() only with generic time vDSO 2025-09-04 11:23:50 +02:00
namespace.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
ntp.c ntp: Use ktime_get_ntp_seconds() 2025-06-19 14:28:24 +02:00
ntp_internal.h ntp: Rename __do_adjtimex() to ntp_adjtimex() 2025-06-19 14:28:23 +02:00
posix-clock.c Networking changes for 6.15. 2025-03-26 21:48:21 -07:00
posix-cpu-timers.c hrtimer: Store time as ktime_t in restart block 2025-11-14 16:31:19 +01:00
posix-stubs.c posix-timers: Get rid of [COMPAT_]SYS_NI() uses 2023-12-20 21:30:27 -08:00
posix-timers.c compiler-context-analysis: Remove __cond_lock() function-like helper 2026-01-05 16:43:33 +01:00
posix-timers.h timekeeping: Add minimal posix-timers support for auxiliary clocks 2025-06-27 20:13:12 +02:00
sched_clock.c time/sched_clock: Use ACCESS_PRIVATE() to evaluate hrtimer::function 2026-01-13 18:08:57 +01:00
sleep_timeout.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
test_udelay.c time: Add MODULE_DESCRIPTION() to time test modules 2024-06-03 11:18:50 +02:00
tick-broadcast-hrtimer.c time: Switch to hrtimer_setup() 2025-02-18 10:32:33 +01:00
tick-broadcast.c treewide: Update email address 2026-01-11 06:09:11 -10:00
tick-common.c treewide: Update email address 2026-01-11 06:09:11 -10:00
tick-internal.h cpufreq: ondemand: Simplify idle cputime granularity test 2026-01-28 22:24:58 +01:00
tick-legacy.c timekeeping: remove xtime_update 2020-10-30 21:57:07 +01:00
tick-oneshot.c treewide: Update email address 2026-01-11 06:09:11 -10:00
tick-sched.c Updates for the core time subsystem: 2026-02-10 16:41:59 -08:00
tick-sched.h tick/sched: Fix struct tick_sched doc warnings 2024-04-01 10:36:35 +02:00
time.c time: export timespec64_add_safe() symbol 2025-09-03 16:51:08 -07:00
time_test.c time/kunit: Document handling of negative years of is_leap() 2026-02-02 12:37:54 +01:00
timeconst.bc time: Add SPDX license identifiers 2018-11-23 11:51:20 +01:00
timeconv.c time: Improve performance of time64_to_tm() 2021-06-24 11:51:59 +02:00
timecounter.c time/timecounter: Inline timecounter_cyc2time() 2025-12-15 20:16:49 +01:00
timekeeping.c timekeeping: Adjust the leap state for the correct auxiliary timekeeper 2026-01-20 10:18:53 +01:00
timekeeping.h asm-generic: cross-architecture timer cleanup 2020-12-16 00:07:17 -08:00
timekeeping_debug.c timekeeping: Add percpu counter for tracking floor swap events 2024-10-10 10:20:46 +02:00
timekeeping_internal.h timekeeping: Provide ktime_get_ntp_seconds() 2025-06-19 14:28:23 +02:00
timer.c cpufreq: ondemand: Simplify idle cputime granularity test 2026-01-28 22:24:58 +01:00
timer_list.c hrtimer: Remove hrtimer_clock_base:: Get_time 2025-09-09 12:27:18 +02:00
timer_migration.c cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock 2026-02-23 10:46:49 -10:00
timer_migration.h timers/migration: Rename 'online' bit to 'available' 2025-11-20 20:17:31 +01:00
vsyscall.c vdso/vsyscall: Avoid slow division loop in auxiliary clock update 2025-09-03 11:55:11 +02:00