rseq: Add fields and constants for time slice extension

Aside of a Kconfig knob add the following items:

   - Two flag bits for the rseq user space ABI, which allow user space to
     query the availability and enablement without a syscall.

   - A new member to the user space ABI struct rseq, which is going to be
     used to communicate request and grant between kernel and user space.

   - A rseq state struct to hold the kernel state of this

   - Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
This commit is contained in:
Thomas Gleixner 2025-12-15 17:52:04 +01:00 committed by Peter Zijlstra
parent 4fe82cf302
commit d7a5da7a0f
6 changed files with 220 additions and 1 deletions

View file

@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
rseq
Security-related interfaces
===========================

View file

@ -0,0 +1,135 @@
=====================
Restartable Sequences
=====================
Restartable Sequences allow to register a per thread userspace memory area
to be used as an ABI between kernel and userspace for three purposes:
* userspace restartable sequences
* quick access to read the current CPU number, node ID from userspace
* scheduler time slice extensions
Restartable sequences (per-cpu atomics)
---------------------------------------
Restartable sequences allow userspace to perform update operations on
per-cpu data without requiring heavyweight atomic operations. The actual
ABI is unfortunately only available in the code and selftests.
Quick access to CPU number, node ID
-----------------------------------
Allows to implement per CPU data efficiently. Documentation is in code and
selftests. :(
Scheduler time slice extensions
-------------------------------
This allows a thread to request a time slice extension when it enters a
critical section to avoid contention on a resource when the thread is
scheduled out inside of the critical section.
The prerequisites for this functionality are:
* Enabled in Kconfig
* Enabled at boot time (default is enabled)
* A rseq userspace pointer has been registered for the thread
The thread has to enable the functionality via prctl(2)::
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
prctl() returns 0 on success or otherwise with the following error codes:
========= ==============================================================
Errorcode Meaning
========= ==============================================================
EINVAL Functionality not available or invalid function arguments.
Note: arg4 and arg5 must be zero
ENOTSUPP Functionality was disabled on the kernel command line
ENXIO Available, but no rseq user struct registered
========= ==============================================================
The state can be also queried via prctl(2)::
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
disabled. Otherwise it returns with the following error codes:
========= ==============================================================
Errorcode Meaning
========= ==============================================================
EINVAL Functionality not available or invalid function arguments.
Note: arg3 and arg4 and arg5 must be zero
========= ==============================================================
The availability and status is also exposed via the rseq ABI struct flags
field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
space and only for informational purposes.
If the mechanism was enabled via prctl(), the thread can request a time
slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
interrupted and the interrupt results in a reschedule request in the
kernel, then the kernel can grant a time slice extension and return to
userspace instead of scheduling out. The length of the extension is
determined by the ``rseq_slice_extension_nsec`` sysctl.
The kernel indicates the grant by clearing rseq::slice_ctrl::request and
setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
thread after granting the extension, the kernel clears the granted bit to
indicate that to userspace.
If the request bit is still set when the leaving the critical section,
userspace can clear it and continue.
If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
leaving the critical section to relinquish the CPU. The kernel enforces
this by arming a timer to prevent misbehaving userspace from abusing this
mechanism.
If both the request bit and the granted bit are false when leaving the
critical section, then this indicates that a grant was revoked and no
further action is required by userspace.
The required code flow is as follows::
rseq->slice_ctrl.request = 1;
barrier(); // Prevent compiler reordering
critical_section();
barrier(); // Prevent compiler reordering
rseq->slice_ctrl.request = 0;
if (rseq->slice_ctrl.granted)
rseq_slice_yield();
As all of this is strictly CPU local, there are no atomicity requirements.
Checking the granted state is racy, but that cannot be avoided at all::
if (rseq->slice_ctrl.granted)
-> Interrupt results in schedule and grant revocation
rseq_slice_yield();
So there is no point in pretending that this might be solved by an atomic
operation.
If the thread issues a syscall other than rseq_slice_yield(2) within the
granted timeslice extension, the grant is also revoked and the CPU is
relinquished immediately when entering the kernel. This is required as
syscalls might consume arbitrary CPU time until they reach a scheduling
point when the preemption model is either NONE or VOLUNTARY and therefore
might exceed the grant by far.
The preferred solution for user space is to use rseq_slice_yield(2) which
is side effect free. The support for arbitrary syscalls is required to
support onion layer architectured applications, where the code handling the
critical section and requesting the time slice extension has no control
over the code within the critical section.
The kernel enforces flag consistency and terminates the thread with SIGSEGV
if it detects a violation.

View file

@ -72,13 +72,36 @@ struct rseq_ids {
};
};
/**
* union rseq_slice_state - Status information for rseq time slice extension
* @state: Compound to access the overall state
* @enabled: Time slice extension is enabled for the task
* @granted: Time slice extension was granted to the task
*/
union rseq_slice_state {
u16 state;
struct {
u8 enabled;
u8 granted;
};
};
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
*/
struct rseq_slice {
union rseq_slice_state state;
};
/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @sig: Signature of critical section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
* @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@ -86,6 +109,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
struct rseq_slice slice;
#endif
};
#else /* CONFIG_RSEQ */

View file

@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
/* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
/* (3) Intentional gap to put new bits into a separate byte */
/* User read only feature flags */
RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@ -35,6 +41,11 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
@ -53,6 +64,27 @@ struct rseq_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
/**
* rseq_slice_ctrl - Time slice extension control structure
* @all: Compound value
* @request: Request for a time slice extension
* @granted: Granted time slice extension
*
* @request is set by user space and can be cleared by user space or kernel
* space. @granted is set and cleared by the kernel and must only be read
* by user space.
*/
struct rseq_slice_ctrl {
union {
__u32 all;
struct {
__u8 request;
__u8 granted;
__u16 __reserved;
};
};
};
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@ -141,6 +173,12 @@ struct rseq {
*/
__u32 mm_cid;
/*
* Time slice extension control structure. CPU local updates from
* kernel and user space.
*/
struct rseq_slice_ctrl slice_ctrl;
/*
* Flexible array member at end of structure, after last feature field.
*/

View file

@ -1938,6 +1938,18 @@ config RSEQ
If unsure, say Y.
config RSEQ_SLICE_EXTENSION
bool "Enable rseq-based time slice extension mechanism"
depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
help
Allows userspace to request a limited time slice extension when
returning from an interrupt to user space via the RSEQ shared
data ABI. If granted, that allows to complete a critical section,
so that other threads are not stuck on a conflicted resource,
while the task is scheduled out.
If unsure, say N.
config RSEQ_STATS
default n
bool "Enable lightweight statistics of restartable sequences" if EXPERT

View file

@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
u32 rseqfl = 0;
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (!access_ok(rseq, rseq_len))
return -EFAULT;
if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*