mirror of
https://github.com/torvalds/linux.git
synced 2026-03-08 04:04:43 +01:00
rseq: Add fields and constants for time slice extension
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
This commit is contained in:
parent
4fe82cf302
commit
d7a5da7a0f
6 changed files with 220 additions and 1 deletions
|
|
@ -21,6 +21,7 @@ System calls
|
|||
ebpf/index
|
||||
ioctl/index
|
||||
mseal
|
||||
rseq
|
||||
|
||||
Security-related interfaces
|
||||
===========================
|
||||
|
|
|
|||
135
Documentation/userspace-api/rseq.rst
Normal file
135
Documentation/userspace-api/rseq.rst
Normal file
|
|
@ -0,0 +1,135 @@
|
|||
=====================
|
||||
Restartable Sequences
|
||||
=====================
|
||||
|
||||
Restartable Sequences allow to register a per thread userspace memory area
|
||||
to be used as an ABI between kernel and userspace for three purposes:
|
||||
|
||||
* userspace restartable sequences
|
||||
|
||||
* quick access to read the current CPU number, node ID from userspace
|
||||
|
||||
* scheduler time slice extensions
|
||||
|
||||
Restartable sequences (per-cpu atomics)
|
||||
---------------------------------------
|
||||
|
||||
Restartable sequences allow userspace to perform update operations on
|
||||
per-cpu data without requiring heavyweight atomic operations. The actual
|
||||
ABI is unfortunately only available in the code and selftests.
|
||||
|
||||
Quick access to CPU number, node ID
|
||||
-----------------------------------
|
||||
|
||||
Allows to implement per CPU data efficiently. Documentation is in code and
|
||||
selftests. :(
|
||||
|
||||
Scheduler time slice extensions
|
||||
-------------------------------
|
||||
|
||||
This allows a thread to request a time slice extension when it enters a
|
||||
critical section to avoid contention on a resource when the thread is
|
||||
scheduled out inside of the critical section.
|
||||
|
||||
The prerequisites for this functionality are:
|
||||
|
||||
* Enabled in Kconfig
|
||||
|
||||
* Enabled at boot time (default is enabled)
|
||||
|
||||
* A rseq userspace pointer has been registered for the thread
|
||||
|
||||
The thread has to enable the functionality via prctl(2)::
|
||||
|
||||
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
|
||||
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
|
||||
|
||||
prctl() returns 0 on success or otherwise with the following error codes:
|
||||
|
||||
========= ==============================================================
|
||||
Errorcode Meaning
|
||||
========= ==============================================================
|
||||
EINVAL Functionality not available or invalid function arguments.
|
||||
Note: arg4 and arg5 must be zero
|
||||
ENOTSUPP Functionality was disabled on the kernel command line
|
||||
ENXIO Available, but no rseq user struct registered
|
||||
========= ==============================================================
|
||||
|
||||
The state can be also queried via prctl(2)::
|
||||
|
||||
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
|
||||
|
||||
prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
|
||||
disabled. Otherwise it returns with the following error codes:
|
||||
|
||||
========= ==============================================================
|
||||
Errorcode Meaning
|
||||
========= ==============================================================
|
||||
EINVAL Functionality not available or invalid function arguments.
|
||||
Note: arg3 and arg4 and arg5 must be zero
|
||||
========= ==============================================================
|
||||
|
||||
The availability and status is also exposed via the rseq ABI struct flags
|
||||
field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
|
||||
``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
|
||||
space and only for informational purposes.
|
||||
|
||||
If the mechanism was enabled via prctl(), the thread can request a time
|
||||
slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
|
||||
interrupted and the interrupt results in a reschedule request in the
|
||||
kernel, then the kernel can grant a time slice extension and return to
|
||||
userspace instead of scheduling out. The length of the extension is
|
||||
determined by the ``rseq_slice_extension_nsec`` sysctl.
|
||||
|
||||
The kernel indicates the grant by clearing rseq::slice_ctrl::request and
|
||||
setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
|
||||
thread after granting the extension, the kernel clears the granted bit to
|
||||
indicate that to userspace.
|
||||
|
||||
If the request bit is still set when the leaving the critical section,
|
||||
userspace can clear it and continue.
|
||||
|
||||
If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
|
||||
leaving the critical section to relinquish the CPU. The kernel enforces
|
||||
this by arming a timer to prevent misbehaving userspace from abusing this
|
||||
mechanism.
|
||||
|
||||
If both the request bit and the granted bit are false when leaving the
|
||||
critical section, then this indicates that a grant was revoked and no
|
||||
further action is required by userspace.
|
||||
|
||||
The required code flow is as follows::
|
||||
|
||||
rseq->slice_ctrl.request = 1;
|
||||
barrier(); // Prevent compiler reordering
|
||||
critical_section();
|
||||
barrier(); // Prevent compiler reordering
|
||||
rseq->slice_ctrl.request = 0;
|
||||
if (rseq->slice_ctrl.granted)
|
||||
rseq_slice_yield();
|
||||
|
||||
As all of this is strictly CPU local, there are no atomicity requirements.
|
||||
Checking the granted state is racy, but that cannot be avoided at all::
|
||||
|
||||
if (rseq->slice_ctrl.granted)
|
||||
-> Interrupt results in schedule and grant revocation
|
||||
rseq_slice_yield();
|
||||
|
||||
So there is no point in pretending that this might be solved by an atomic
|
||||
operation.
|
||||
|
||||
If the thread issues a syscall other than rseq_slice_yield(2) within the
|
||||
granted timeslice extension, the grant is also revoked and the CPU is
|
||||
relinquished immediately when entering the kernel. This is required as
|
||||
syscalls might consume arbitrary CPU time until they reach a scheduling
|
||||
point when the preemption model is either NONE or VOLUNTARY and therefore
|
||||
might exceed the grant by far.
|
||||
|
||||
The preferred solution for user space is to use rseq_slice_yield(2) which
|
||||
is side effect free. The support for arbitrary syscalls is required to
|
||||
support onion layer architectured applications, where the code handling the
|
||||
critical section and requesting the time slice extension has no control
|
||||
over the code within the critical section.
|
||||
|
||||
The kernel enforces flag consistency and terminates the thread with SIGSEGV
|
||||
if it detects a violation.
|
||||
|
|
@ -72,13 +72,36 @@ struct rseq_ids {
|
|||
};
|
||||
};
|
||||
|
||||
/**
|
||||
* union rseq_slice_state - Status information for rseq time slice extension
|
||||
* @state: Compound to access the overall state
|
||||
* @enabled: Time slice extension is enabled for the task
|
||||
* @granted: Time slice extension was granted to the task
|
||||
*/
|
||||
union rseq_slice_state {
|
||||
u16 state;
|
||||
struct {
|
||||
u8 enabled;
|
||||
u8 granted;
|
||||
};
|
||||
};
|
||||
|
||||
/**
|
||||
* struct rseq_slice - Status information for rseq time slice extension
|
||||
* @state: Time slice extension state
|
||||
*/
|
||||
struct rseq_slice {
|
||||
union rseq_slice_state state;
|
||||
};
|
||||
|
||||
/**
|
||||
* struct rseq_data - Storage for all rseq related data
|
||||
* @usrptr: Pointer to the registered user space RSEQ memory
|
||||
* @len: Length of the RSEQ region
|
||||
* @sig: Signature of critial section abort IPs
|
||||
* @sig: Signature of critical section abort IPs
|
||||
* @event: Storage for event management
|
||||
* @ids: Storage for cached CPU ID and MM CID
|
||||
* @slice: Storage for time slice extension data
|
||||
*/
|
||||
struct rseq_data {
|
||||
struct rseq __user *usrptr;
|
||||
|
|
@ -86,6 +109,9 @@ struct rseq_data {
|
|||
u32 sig;
|
||||
struct rseq_event event;
|
||||
struct rseq_ids ids;
|
||||
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
|
||||
struct rseq_slice slice;
|
||||
#endif
|
||||
};
|
||||
|
||||
#else /* CONFIG_RSEQ */
|
||||
|
|
|
|||
|
|
@ -23,9 +23,15 @@ enum rseq_flags {
|
|||
};
|
||||
|
||||
enum rseq_cs_flags_bit {
|
||||
/* Historical and unsupported bits */
|
||||
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
|
||||
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
|
||||
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
|
||||
/* (3) Intentional gap to put new bits into a separate byte */
|
||||
|
||||
/* User read only feature flags */
|
||||
RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
|
||||
RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
|
||||
};
|
||||
|
||||
enum rseq_cs_flags {
|
||||
|
|
@ -35,6 +41,11 @@ enum rseq_cs_flags {
|
|||
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
|
||||
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
|
||||
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
|
||||
|
||||
RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
|
||||
(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
|
||||
RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
|
||||
(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
|
||||
};
|
||||
|
||||
/*
|
||||
|
|
@ -53,6 +64,27 @@ struct rseq_cs {
|
|||
__u64 abort_ip;
|
||||
} __attribute__((aligned(4 * sizeof(__u64))));
|
||||
|
||||
/**
|
||||
* rseq_slice_ctrl - Time slice extension control structure
|
||||
* @all: Compound value
|
||||
* @request: Request for a time slice extension
|
||||
* @granted: Granted time slice extension
|
||||
*
|
||||
* @request is set by user space and can be cleared by user space or kernel
|
||||
* space. @granted is set and cleared by the kernel and must only be read
|
||||
* by user space.
|
||||
*/
|
||||
struct rseq_slice_ctrl {
|
||||
union {
|
||||
__u32 all;
|
||||
struct {
|
||||
__u8 request;
|
||||
__u8 granted;
|
||||
__u16 __reserved;
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
/*
|
||||
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
|
||||
* contained within a single cache-line.
|
||||
|
|
@ -141,6 +173,12 @@ struct rseq {
|
|||
*/
|
||||
__u32 mm_cid;
|
||||
|
||||
/*
|
||||
* Time slice extension control structure. CPU local updates from
|
||||
* kernel and user space.
|
||||
*/
|
||||
struct rseq_slice_ctrl slice_ctrl;
|
||||
|
||||
/*
|
||||
* Flexible array member at end of structure, after last feature field.
|
||||
*/
|
||||
|
|
|
|||
12
init/Kconfig
12
init/Kconfig
|
|
@ -1938,6 +1938,18 @@ config RSEQ
|
|||
|
||||
If unsure, say Y.
|
||||
|
||||
config RSEQ_SLICE_EXTENSION
|
||||
bool "Enable rseq-based time slice extension mechanism"
|
||||
depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
|
||||
help
|
||||
Allows userspace to request a limited time slice extension when
|
||||
returning from an interrupt to user space via the RSEQ shared
|
||||
data ABI. If granted, that allows to complete a critical section,
|
||||
so that other threads are not stuck on a conflicted resource,
|
||||
while the task is scheduled out.
|
||||
|
||||
If unsure, say N.
|
||||
|
||||
config RSEQ_STATS
|
||||
default n
|
||||
bool "Enable lightweight statistics of restartable sequences" if EXPERT
|
||||
|
|
|
|||
|
|
@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
|
|||
*/
|
||||
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
|
||||
{
|
||||
u32 rseqfl = 0;
|
||||
|
||||
if (flags & RSEQ_FLAG_UNREGISTER) {
|
||||
if (flags & ~RSEQ_FLAG_UNREGISTER)
|
||||
return -EINVAL;
|
||||
|
|
@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
|
|||
if (!access_ok(rseq, rseq_len))
|
||||
return -EFAULT;
|
||||
|
||||
if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
|
||||
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
|
||||
|
||||
scoped_user_write_access(rseq, efault) {
|
||||
/*
|
||||
* If the rseq_cs pointer is non-NULL on registration, clear it to
|
||||
|
|
@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
|
|||
* clearing the fields. Don't bother reading it, just reset it.
|
||||
*/
|
||||
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
|
||||
unsafe_put_user(rseqfl, &rseq->flags, efault);
|
||||
/* Initialize IDs in user space */
|
||||
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
|
||||
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
|
||||
unsafe_put_user(0U, &rseq->node_id, efault);
|
||||
unsafe_put_user(0U, &rseq->mm_cid, efault);
|
||||
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
|
||||
}
|
||||
|
||||
/*
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue