linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-08 01:04:41 +01:00

Author	SHA1	Message	Date
Zhao Mengmeng	9af832c0a7	tools/sched_ext: Add -fms-extensions to bpf build flags Similar to commit `835a507535` ("selftests/bpf: Add -fms-extensions to bpf build flags") and commit `639f58a0f4` ("bpftool: Fix build warnings due to MS extensions") The kernel is now built with -fms-extensions, therefore generated vmlinux.h contains types like: struct aes_key { struct aes_enckey; union aes_invkey_arch inv_k; }; struct ns_common { ... union { struct ns_tree; struct callback_head ns_rcu; }; }; Which raise warning like below when building scx scheduler: tools/sched_ext/build/include/vmlinux.h:50533:3: warning: declaration does not declare anything [-Wmissing-declarations] 50533 \| struct ns_tree; \| ^ Fix it by using -fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs that #include "vmlinux.h" Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-03-02 22:00:23 -10:00
David Carlier	032e084f0d	tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq() scx_hotplug_seq() uses strtoul() but validates the result with a negative check (val < 0), which can never be true for an unsigned return value. Use the endptr mechanism to verify the entire string was consumed, and check errno == ERANGE for overflow detection. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-27 09:17:44 -10:00
Cheng-Yang Chou	ee0ff6690f	tools/sched_ext: Add Kconfig to sync with upstream Add the missing Kconfig file to tools/sched_ext/ as referenced in the README. Ref: https://github.com/sched-ext/scx/blob/main/kernel.config Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-24 07:51:36 -10:00
Cheng-Yang Chou	095f569332	tools/sched_ext: Sync README.md Kconfig with upstream scx Sync the documentation with the upstream scx repository to reflect the current recommended configuration. Ref: https://github.com/sched-ext/scx/blob/main/README.md#build--install Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-24 07:51:29 -10:00
Cheng-Yang Chou	cbb297323d	tools/sched_ext: scx_sdt: Remove unused '-f' option The '-f' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-f' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 07:45:34 -10:00
Cheng-Yang Chou	0c36a6f6f0	tools/sched_ext: scx_central: Remove unused '-p' option The '-p' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-p' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-23 07:45:30 -10:00
David Carlier	640c9dc72f	tools/sched_ext: fix getopt not re-parsed on restart After goto restart, optind retains its advanced position from the previous getopt loop, causing getopt() to immediately return -1. This silently drops all command-line options on the restarted skeleton. Reset optind to 1 at the restart label so options are re-parsed. Affected schedulers: scx_simple, scx_central, scx_flatcg, scx_pair, scx_sdt, scx_cpu0. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-20 17:17:38 -10:00
David Carlier	f892f9f994	tools/sched_ext: scx_userland: fix data races on shared counters The stats thread reads nr_vruntime_enqueues, nr_vruntime_dispatches, nr_vruntime_failed, and nr_curr_enqueued concurrently with the main thread writing them, with no synchronization. Use __atomic builtins with relaxed ordering for all accesses to these counters to eliminate the data races. Only display accuracy is affected, not scheduling correctness. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-20 17:17:31 -10:00
David Carlier	625be3456b	tools/sched_ext: scx_pair: fix stride == 0 crash on single-CPU systems nr_cpu_ids / 2 produces stride 0 on a single-CPU system, which later causes SCX_BUG_ON(i == j) to fire. Validate stride after option parsing to also catch invalid user-supplied values via -S. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-18 07:03:57 -10:00
David Carlier	55a24d9203	tools/sched_ext: scx_central: fix CPU_SET and skeleton leak on early exit Use CPU_SET_S() instead of CPU_SET() on the dynamically allocated cpuset to avoid a potential out-of-bounds write when nr_cpu_ids exceeds CPU_SETSIZE. Also destroy the skeleton before returning on invalid central CPU ID to prevent a resource leak. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-18 07:03:50 -10:00
David Carlier	0767684613	tools/sched_ext: scx_userland: fix stale data on restart Reset all counters, tasks and vruntime_head list on restart. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-16 21:02:16 -10:00
David Carlier	cabd76bbc0	tools/sched_ext: scx_flatcg: fix potential stack overflow from VLA in fcg_read_stats fcg_read_stats() had a VLA allocating 21 * nr_cpus * 8 bytes on the stack, risking stack overflow on large CPU counts (nr_cpus can be up to 512). Fix by using a single heap allocation with the correct size, reusing it across all stat indices, and freeing it at the end. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-16 21:01:18 -10:00
David Carlier	048714d9df	tools/sched_ext: scx_userland: fix restart and stats thread lifecycle bugs Fix three issues in scx_userland's restart path: - exit_req is not reset on restart, causing sched_main_loop() to exit immediately without doing any scheduling work. - stats_printer thread handle is local to spawn_stats_thread(), making it impossible to join from main(). Promote it to file scope. - The stats thread continues reading skel->bss after the skeleton is destroyed on restart, causing a use-after-free. Join the stats thread before destroying the skeleton to ensure it has exited. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-12 11:17:35 -10:00
David Carlier	988369d236	tools/sched_ext: scx_central: fix sched_setaffinity() call with the set size The cpu set is dynamically allocated for nr_cpu_ids using CPU_ALLOC(), so the size passed to sched_setaffinity() should be CPU_ALLOC_SIZE() rather than sizeof(cpu_set_t). Valgrind flagged this as accessing unaddressable bytes past the allocation. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-12 07:30:17 -10:00
David Carlier	11fece49e9	tools/sched_ext: scx_flatcg: zero-initialize stats counter array The local cnts array in read_stats() is not initialized before being accumulated into per-CPU stats, which may lead to reading garbage values. Zero it out with memset alongside the existing stats array initialization. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-12 07:28:02 -10:00
Linus Torvalds	38ef046544	sched_ext: Changes for v6.20 - Move C example schedulers back from the external scx repo to tools/sched_ext as the authoritative source. scx_userland and scx_pair are returning while scx_sdt (BPF arena-based task data management) is new. These schedulers will be dropped from the external repo. - Improve error reporting by adding scx_bpf_error() calls when DSQ creation fails across all in-tree schedulers. - Avoid redundant irq_work_queue() calls in destroy_dsq() by only queueing when llist_add() indicates an empty list. - Fix flaky init_enable_count selftest by properly synchronizing pre-forked children using a pipe instead of sleep(). -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYo1pQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGa5VAP9udiGksQ4bBFQrUD+0yhuvjsSuXzssfdfxHgT6 Hj66wgEAjgbnSyxfcGrB+w7DWUxNLaZlXepibVMPcfvAaSieSgU= =UvLY -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Move C example schedulers back from the external scx repo to tools/sched_ext as the authoritative source. scx_userland and scx_pair are returning while scx_sdt (BPF arena-based task data management) is new. These schedulers will be dropped from the external repo. - Improve error reporting by adding scx_bpf_error() calls when DSQ creation fails across all in-tree schedulers - Avoid redundant irq_work_queue() calls in destroy_dsq() by only queueing when llist_add() indicates an empty list - Fix flaky init_enable_count selftest by properly synchronizing pre-forked children using a pipe instead of sleep() * tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Fix init_enable_count flakiness tools/sched_ext: Fix data header access during free in scx_sdt tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers tools/sched_ext: add arena based scheduler tools/sched_ext: add scx_pair scheduler tools/sched_ext: add scx_userland scheduler sched_ext: Add error logging for dsq creation failures sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()	2026-02-11 13:35:24 -08:00
Emil Tsalapatis	2e06d54ea9	tools/sched_ext: Fix data header access during free in scx_sdt Fix a pointer arithmetic error in scx_sdt during freeing that causes the allocator to use the wrong memory address for the allocation's data header. Fixes: `36929ebd17` ("tools/sched_ext: add arena based scheduler") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-02 05:50:14 -10:00
zhidao su	bd4f0822f4	tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers Add scx_bpf_error() calls when scx_bpf_create_dsq() fails in the remaining schedulers to improve debuggability: - scx_simple.bpf.c: simple_init() - scx_sdt.bpf.c: sdt_init() - scx_cpu0.bpf.c: cpu0_init() - scx_flatcg.bpf.c: fcg_init() This follows the same pattern established in commit `2f8d489897` ("sched_ext: Add error logging for dsq creation failures") for other schedulers and ensures consistent error reporting across all schedulers. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-01 07:10:18 -10:00
Emil Tsalapatis	36929ebd17	tools/sched_ext: add arena based scheduler Add a scheduler that uses BPF arenas to manage task context data. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-27 15:43:34 -10:00
Emil Tsalapatis	f0262b102c	tools/sched_ext: add scx_pair scheduler Add the scx_pair cgroup-based core scheduler. Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <dvernet@meta.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-27 15:43:28 -10:00
Emil Tsalapatis	cc4448d085	tools/sched_ext: add scx_userland scheduler Add in the scx_userland scheduler that does vruntime-based scheduling in userspace code and communicates scheduling decisions to BPF by accessing and modifying globals through the skeleton. Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <dvernet@meta.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-27 15:43:22 -10:00
Alexei Starovoitov	e3d0dbb3b5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5 Cross-merge BPF and other fixes after downstream PR. No conflicts. Adjacent: Auto-merging MAINTAINERS Auto-merging Makefile Auto-merging kernel/bpf/verifier.c Auto-merging kernel/sched/ext.c Auto-merging mm/memcontrol.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-14 15:22:01 -08:00
George Guo	2f8d489897	sched_ext: Add error logging for dsq creation failures Add scx_bpf_error() calls when scx_bpf_create_dsq() fails in multiple schedulers to improve debuggability: - scx_central.bpf.c: central_init() - scx_flatcg.bpf.c: fcg_cgroup_init() and fcg_init() - scx_qmap.bpf.c: qmap_init() Signed-off-by: George Guo <guodongtai@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 08:44:11 -10:00
Kohei Enju	c9894e6f01	tools/sched_ext: update scx_show_state.py for scx_aborting change Commit `a69040ed57` ("sched_ext: Simplify breather mechanism with scx_aborting flag") removed scx_in_softlockup and scx_breather_depth, replacing them with scx_aborting. Update the script accordingly. Fixes: `a69040ed57` ("sched_ext: Simplify breather mechanism with scx_aborting flag") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-28 06:11:26 -10:00
Kohei Enju	f92ff79ba2	tools/sched_ext: fix scx_show_state.py for scx_root change Commit `48e1267773` ("sched_ext: Introduce scx_sched") introduced scx_root and removed scx_ops, causing scx_show_state.py to fail when searching for the 'scx_ops' object. [1] Fix by using 'scx_root' instead, with NULL pointer handling. [1] # drgn -s vmlinux ./tools/sched_ext/scx_show_state.py Traceback (most recent call last): File "/root/.venv/bin/drgn", line 8, in <module> sys.exit(_main()) ~~~~~^^ File "/root/.venv/lib64/python3.14/site-packages/drgn/cli.py", line 625, in _main runpy.run_path( ~~~~~~~~~~~~~~^ script_path, init_globals={"prog": prog}, run_name="__main__" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "<frozen runpy>", line 287, in run_path File "<frozen runpy>", line 98, in _run_module_code File "<frozen runpy>", line 88, in _run_code File "./tools/sched_ext/scx_show_state.py", line 30, in <module> ops = prog['scx_ops'] ~~~~^^^^^^^^^^^ _drgn.ObjectNotFoundError: could not find 'scx_ops' Fixes: `48e1267773` ("sched_ext: Introduce scx_sched") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-28 06:11:13 -10:00
Ihor Solodrai	903922cfa0	lib/Kconfig.debug: Set the minimum required pahole version to v1.22 Subsequent patches in the series change vmlinux linking scripts to unconditionally pass --btf_encode_detached to pahole, which was introduced in v1.22 [1][2]. This change allows to remove PAHOLE_HAS_SPLIT_BTF Kconfig option and other checks of older pahole versions. [1] https://github.com/acmel/dwarves/releases/tag/v1.22 [2] https://lore.kernel.org/bpf/cbafbf4e-9073-4383-8ee6-1353f9e5869c@oracle.com/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Nicolas Schier <nsc@kernel.org> Link: https://lore.kernel.org/bpf/20251219181825.1289460-1-ihor.solodrai@linux.dev	2025-12-19 10:55:40 -08:00
Rong Tao	06a7415cf2	sched_ext: tools: Removing duplicate targets during non-cross compilation When cross-compilation is not used, BPFOBJ and HOST_BPFOBJ are identical files, libbpf.a, and duplicate libbpf.a files should be removed. Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-20 07:00:27 -10:00
Tejun Heo	c948d9f80c	sched_ext: Add scx_cpu0 example scheduler Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and only dispatches them from CPU0 in FIFO order. This is useful for testing bypass behavior when many tasks are concentrated on a single CPU. If the load balancer doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is long and there's only one CPU working on it. v2: Check whether task is on CPU0 at enqueue using scx_bpf_task_cpu() instead of nr_cpus_allowed (Andrea Righi). Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-11-12 06:43:44 -10:00
Tejun Heo	3442345644	sched_ext/tools: Restore backward compat with v6.12 kernels Commit `111a79800a` ("tools/sched_ext: Strip compatibility macros for cgroup and dispatch APIs") removed the compat layer for v6.12-v6.13 kfunc renaming, but v6.12 is the current LTS kernel and will remain supported through 2026. Restore backward compatibility so schedulers built with v6.19+ headers can run on v6.12 kernels. The restored compat differs from the original in two ways: 1. Uses ___new/___old suffixes instead of ___compat for clarity. The new macros check for v6.13+ names (scx_bpf_dsq_move), fall back to v6.12 names (scx_bpf_dispatch_from_dsq, scx_bpf_consume), then return safe no-ops for missing symbols. 2. Integrates with the args-struct-packing changes added in `c0d630ba34` ("sched_ext: Wrap kfunc args in struct to prepare for aux__prog"). scx_bpf_dsq_insert_vtime() now tries __scx_bpf_dsq_insert_vtime (args struct), then scx_bpf_dsq_insert_vtime___compat (v6.13-v6.18), then scx_bpf_dispatch_vtime___compat (pre-v6.13). Forward compatibility is not restored - binaries built against v6.13 or earlier headers won't run on v6.19+ kernels, as the old kfunc names are not exported. This is acceptable since the priority is new binaries running on older kernels. Also add missing compat checks for ops.cgroup_set_bandwidth() (added v6.17) and ops.cgroup_set_idle() (added v6.19). These need to be NULLed out in userspace on older kernels. Reported-by: Andrea Righi <arighi@nvidia.com> Acked-and-tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-29 10:41:21 -10:00
Tejun Heo	a3f5d48222	sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere The ops.cpu_acquire/release() callbacks miss events under multiple conditions. There are two distinct task dispatch gaps that can cause cpu_released flag desynchronization: 1. balance-to-pick_task gap: This is what was originally reported. balance_scx() can enqueue a task, but during consume_remote_task() when the rq lock is released, a higher priority task can be enqueued and ultimately picked while cpu_released remains false. This gap is closeable via RETRY_TASK handling. 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local DSQ. By the time the sched path runs on the target CPU, higher class tasks may already be queued. In such cases, nothing on sched_ext side will be invoked, and the only solution would be a hook invoked regardless of sched class, which isn't desirable. Rather than adding invasive core hooks, BPF schedulers can use generic BPF mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent with other mechanisms it already uses and doesn't add further friction. The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when a CPU gets preempted by a higher priority scheduling class. However, the old scx_bpf_reenqueue_local() could only be called from cpu_release() context. Add a new version of scx_bpf_reenqueue_local() that can be called from any context by deferring the actual re-enqueue operation. This eliminates the need for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint to detect and handle CPU preemption. Update scx_qmap to demonstrate the new approach using sched_switch instead of cpu_release, with compat support for older kernels. Mark cpu_acquire/release() as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in v6.23. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-29 05:29:04 -10:00
Tejun Heo	dcb938c453	sched_ext: Add ___compat suffix to scx_bpf_dsq_insert___v2 in compat.bpf.h `2dbbdeda77` ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility") renamed the new bool-returning variant to scx_bpf_dsq_insert___v2 in the kernel. However, libbpf currently only strips ___SUFFIX on the BPF side, not on kernel symbols, so the compat wrapper couldn't match the kernel kfunc and would always fall back to the old variant even when the new one was available. Add an extra ___compat suffix as a workaround - libbpf strips one suffix on the BPF side leaving ___v2, which then matches the kernel kfunc directly. In the future when libbpf strips all suffixes on both sides, all suffixes can be dropped. Fixes: `2dbbdeda77` ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility") Cc: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-24 13:37:37 -10:00
Tejun Heo	2dbbdeda77	sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility `cded46d971` ("sched_ext: Make scx_bpf_dsq_insert() return bool") introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old void-returning version to scx_bpf_dsq_insert___compat, with the expectation that libbpf would match old binaries to the ___compat variant, maintaining backward binary compatibility. However, while libbpf ignores ___suffix on the BPF side when matching symbols, it doesn't do so for kernel-side symbols. Old binaries compiled with the original scx_bpf_dsq_insert() could no longer resolve the symbol. Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old void-returning interface and add ___v2 to the new bool-returning version. This allows old binaries to continue working while new code can use the ___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the ___v2 suffix can be dropped when the compat interface is removed. v2: Use ___v2 instead of ___new. Fixes: `cded46d971` ("sched_ext: Make scx_bpf_dsq_insert() return bool") Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-21 10:40:15 -10:00
Ryan Newton	44f5c8ec5b	sched_ext: Add lockless peek operation for DSQs The builtin DSQ queue data structures are meant to be used by a wide range of different sched_ext schedulers with different demands on these data structures. They might be per-cpu with low-contention, or high-contention shared queues. Unfortunately, DSQs have a coarse-grained lock around the whole data structure. Without going all the way to a lock-free, more scalable implementation, a small step we can take to reduce lock contention is to allow a lockless, small-fixed-cost peek at the head of the queue. This change allows certain custom SCX schedulers to cheaply peek at queues, e.g. during load balancing, before locking them. But it represents a few extra memory operations to update the pointer each time the DSQ is modified, including a memory barrier on ARM so the write appears correctly ordered. This commit adds a first_task pointer field which is updated atomically when the DSQ is modified, and allows any thread to peek at the head of the queue without holding the lock. Signed-off-by: Ryan Newton <newton@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-15 06:46:25 -10:00
Tejun Heo	bd7143e74e	sched_ext/tools: Add compat wrapper for scx_bpf_task_set_slice/dsq_vtime() for sub-scheduler authority checks. Add compat wrappers which fall back to direct p->scx field writes on older kernels. Suggested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-13 08:49:29 -10:00
Tejun Heo	cded46d971	sched_ext: Make scx_bpf_dsq_insert*() return bool In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and scx_bpf_dsq_insert_vtime() to return bool instead of void. With sub-schedulers, there will be no reliable way to guarantee a task is still owned by the sub-scheduler at insertion time (e.g., the task may have been migrated to another scheduler). The bool return value will enable sub-schedulers to detect and gracefully handle insertion failures. For the root scheduler, insertion failures will continue to trigger scheduler abort via scx_error(), so existing code doesn't need to check the return value. Backward compatibility is maintained through compat wrappers. Also update scx_bpf_dsq_move() documentation to clarify that it can return false for sub-schedulers when @dsq_id points to a disallowed local DSQ. Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-13 08:49:29 -10:00
Tejun Heo	c0d630ba34	sched_ext: Wrap kfunc args in struct to prepare for aux__prog scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5 parameters. An upcoming change will add aux__prog parameter which will exceed BPF's 5 argument limit. Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and __scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are kept as compatibility wrappers. BPF programs use inline wrappers that detect kernel API version via bpf_core_type_exists() and use the new struct-based kfuncs when available, falling back to compat kfuncs otherwise. This allows BPF programs to work with both old and new kernels. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-13 08:49:29 -10:00
Tejun Heo	3035addfaf	sched_ext: Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() With the planned hierarchical scheduler support, sub-schedulers will need to be verified for authority before being allowed to modify task->scx.slice and task->scx.dsq_vtime. Add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() which will perform the necessary permission checks. Root schedulers can still directly write to these fields, so this doesn't affect existing schedulers. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-13 08:49:29 -10:00
Tejun Heo	111a79800a	tools/sched_ext: Strip compatibility macros for cgroup and dispatch APIs Enough time has passed since the introduction of scx_bpf_task_cgroup() and the scx_bpf_dispatch* -> scx_bpf_dsq* kfunc renaming. Strip the compatibility macros. Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-13 08:49:29 -10:00
Tejun Heo	7852e0fd1d	tools/sched_ext: scx_qmap: Make debug output quieter by default scx_qmap currently outputs verbose debug messages including cgroup operations and CPU online/offline events by default, which can be noisy during normal operation. While the existing -P option controls DSQ dumps and event statistics, there's no way to suppress the other debug messages. Split the debug output controls to make scx_qmap quieter by default. The -P option continues to control DSQ dumps and event statistics (print_dsqs_and_events), while a new -M option controls debug messages like cgroup operations and CPU events (print_msgs). This allows users to run scx_qmap with minimal output and selectively enable debug information as needed. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Tejun Heo	d452972858	sched_ext: Make qmap dump operation non-destructive The qmap dump operation was destructively consuming queue entries while displaying them. As dump can be triggered anytime, this can easily lead to stalls. Add a temporary dump_store queue and modify the dump logic to pop entries, display them, and then restore them back to the original queue. This allows dump operations to be performed without affecting the scheduler's queue state. Note that if racing against new enqueues during dump, ordering can get mixed up, but this is acceptable for debugging purposes. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-23 09:03:26 -10:00
Andrea Righi	a08b4dcad9	tools/sched_ext: Add compat helper for scx_bpf_cpu_curr() Introduce a compatibility helper that allows BPF schedulers to use scx_bpf_cpu_curr() on older kernels. Fixes: `20b158094a` ("sched_ext: Introduce scx_bpf_cpu_curr()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 20:40:45 -10:00
Christian Loehle	20b158094a	sched_ext: Introduce scx_bpf_cpu_curr() Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:50:42 -10:00
Christian Loehle	e0ca169638	sched_ext: Introduce scx_bpf_locked_rq() Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:50:36 -10:00
Andrea Righi	de68c05189	tools/sched_ext: Receive updates from SCX repo Receive tools/sched_ext updates form https://github.com/sched-ext/scx to sync userspace bits: - basic BPF arena allocator abstractions, - additional process flags definitions, - fixed is_migration_disabled() helper, - separate out user_exit_info BPF and user space code. This also fixes the following warning when building the selftests: tools/sched_ext/include/scx/common.bpf.h:550:9: warning: 'likely' macro redefined [-Wmacro-redefined] 550 \| #define likely(x) __builtin_expect(!!(x), 1) \| ^ Co-developed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-08-11 08:21:57 -10:00
Tejun Heo	ddceadce63	sched_ext: Add support for cgroup bandwidth control interface From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-20 17:03:51 -10:00
Honglei Wang	f203683c3e	sched_ext: change the variable name for slice refill event SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-18 17:25:39 -10:00
yangsonghua	6d65f682a9	sched_ext: Improve cross-compilation support in Makefile Modify the tools/sched_ext/Makefile to better handle cross-compilation environments by: 1. Fix host tools build directory structure by separating obj/ from output (HOST_BUILD_DIR now points to $(OBJ_DIR)/host/obj) 2. Properly propagate CROSS_COMPILE to libbpf sub-make invocation 3. Add missing $(HOST_BPFOBJ) build rule with proper host toolchain flags (ARCH=, CROSS_COMPILE=, explicit HOSTCC/HOSTLD) 4. Consistently quote $(HOSTCC) in bpftool build rule 5. Change LDFLAGS assignment to += to allow external extensions The changes ensure proper cross-compilation behavior while maintaining backward compatibility with native builds. Host tools are now correctly built with the host toolchain while target binaries use the cross-toolchain. Signed-off-by: yangsonghua <yangsonghua@lixiang.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-14 06:58:41 -10:00
Tejun Heo	294f5ff474	sched_ext: Merge branch 'for-6.15-fixes' into for-6.16 Pull for-6.15-fixes to receive: `e776b26e37` ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: `1a7ff7216c` ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-08 08:56:57 -10:00
Tejun Heo	bc08b15b54	sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg. v2: Actually include the scx_flatcg update. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>	2025-04-08 08:53:52 -10:00
Andrea Righi	683d2d0fab	sched_ext: idle: Introduce scx_bpf_select_cpu_and() Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU. This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu(). It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic. With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs. Example usage ============= Possible usage in ops.select_cpu(): s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct p, s32 prev_cpu, u64 wake_flags) { const struct cpumask cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 1[ 0.0%] 9[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 2[ 0.0%] 10[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 3[ 0.0%] 11[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 4[ 0.0%] 12[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 5[ 0.0%] 13[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 6[ 0.0%] 14[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 7[ 0.0%] 15[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-07 07:13:52 -10:00

1 2 3

119 commits