linux/kernel/sched
Zicheng Qu e34881c84c sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
Consider the following sequence on a CPU configured with nohz_full:

1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
   bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.

2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
   another cgroup B (not throttled).

   During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.

3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
   explicit scheduling event, i.e., resched_curr().

4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
   P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.

5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.

The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.

Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.

Fixes: 29f59db3a7 ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
2026-02-03 12:04:19 +01:00
..
autogroup.c cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() 2025-11-03 11:46:18 -10:00
autogroup.h sched: Clean up and standardize #if/#else/#endif markers in sched/autogroup.[ch] 2025-06-13 08:47:14 +02:00
build_policy.c sched_ext: Move internal type and accessor definitions to ext_internal.h 2025-09-03 11:33:28 -10:00
build_utility.c sched/smp: Make SMP unconditional 2025-06-13 08:47:18 +02:00
clock.c sched/clock: Avoid false sharing for sched_clock_irqtime 2026-02-03 12:04:19 +01:00
completion.c sched: Make clangd usable 2025-06-11 11:20:53 +02:00
core.c sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups 2026-02-03 12:04:19 +01:00
core_sched.c sched: Make clangd usable 2025-06-11 11:20:53 +02:00
cpuacct.c sched: Make clangd usable 2025-06-11 11:20:53 +02:00
cpudeadline.c sched/deadline: only set free_cpus for online runqueues 2025-10-16 11:13:49 +02:00
cpudeadline.h sched/deadline: only set free_cpus for online runqueues 2025-10-16 11:13:49 +02:00
cpufreq.c sched: Make clangd usable 2025-06-11 11:20:53 +02:00
cpufreq_schedutil.c sched/cpufreq: Use %pe format for PTR_ERR() printing 2026-02-03 12:04:19 +01:00
cpupri.c sched: Make clangd usable 2025-06-11 11:20:53 +02:00
cpupri.h sched/smp: Make SMP unconditional 2025-06-13 08:47:18 +02:00
cputime.c sched/clock: Avoid false sharing for sched_clock_irqtime 2026-02-03 12:04:19 +01:00
deadline.c sched/debug: Fix dl_server (re)start conditions 2026-02-03 12:04:18 +01:00
debug.c sched/debug: Fix dl_server (re)start conditions 2026-02-03 12:04:18 +01:00
ext.c sched_ext: Add a DL server for sched_ext tasks 2026-02-03 12:04:17 +01:00
ext.h sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations 2025-09-03 11:36:07 -10:00
ext_idle.c sched_ext: Wrap kfunc args in struct to prepare for aux__prog 2025-10-13 08:49:29 -10:00
ext_idle.h sched_ext: Always use SMP versions in kernel/sched/ext_idle.h 2025-06-13 14:47:59 -10:00
ext_internal.h sched_ext: Implement load balancer for bypass mode 2025-11-12 06:43:44 -10:00
fair.c Linux 6.19-rc8 2026-02-03 12:04:13 +01:00
features.h sched/fair: Disable scheduler feature NEXT_BUDDY 2026-01-23 11:53:19 +01:00
idle.c sched_ext: Add a DL server for sched_ext tasks 2026-02-03 12:04:17 +01:00
isolation.c sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any 2025-11-20 20:17:31 +01:00
loadavg.c Merge branch 'tip/sched/urgent' 2025-07-14 17:16:28 +02:00
Makefile tracing: Disable branch profiling in noinstr code 2025-03-22 09:49:26 +01:00
membarrier.c rseq: Simplify the event notification 2025-11-04 08:30:09 +01:00
pelt.c treewide: Update email address 2026-01-11 06:09:11 -10:00
pelt.h sched/fair: Switch to task based throttle model 2025-09-03 10:03:14 +02:00
psi.c sched/psi: Fix psi_seq initialization 2025-08-04 10:51:22 -07:00
rq-offsets.c sched: Make migrate_{en,dis}able() inline 2025-09-25 09:57:16 +02:00
rt.c sched/rt: Skip currently executing CPU in rto_next_cpu() 2026-02-03 12:04:19 +01:00
sched-pelt.h sched: Make clangd usable 2025-06-11 11:20:53 +02:00
sched.h sched/clock: Avoid false sharing for sched_clock_irqtime 2026-02-03 12:04:19 +01:00
smp.h sched: Make clangd usable 2025-06-11 11:20:53 +02:00
stats.c sched/smp: Use the SMP version of schedstats 2025-06-13 08:47:21 +02:00
stats.h sched/core: Fix psi_dequeue() for Proxy Execution 2025-12-06 10:13:16 +01:00
stop_task.c sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*() 2025-12-17 10:53:25 +01:00
swait.c sched: Make clangd usable 2025-06-11 11:20:53 +02:00
syscalls.c sched: Deadline has dynamic priority 2026-01-15 21:57:53 +01:00
topology.c sched_ext: Add a DL server for sched_ext tasks 2026-02-03 12:04:17 +01:00
wait.c ARM: 2025-07-30 17:14:01 -07:00
wait_bit.c sched: Make clangd usable 2025-06-11 11:20:53 +02:00