linux/include
Christian Brauner 9b8a0ba682
mount: add OPEN_TREE_NAMESPACE
When creating containers the setup usually involves using CLONE_NEWNS
via clone3() or unshare(). This copies the caller's complete mount
namespace. The runtime will also assemble a new rootfs and then use
pivot_root() to switch the old mount tree with the new rootfs. Afterward
it will recursively umount the old mount tree thereby getting rid of all
mounts.

On a basic system here where the mount table isn't particularly large
this still copies about 30 mounts. Copying all of these mounts only to
get rid of them later is pretty wasteful.

This is exacerbated if intermediary mount namespaces are used that only
exist for a very short amount of time and are immediately destroyed
again causing a ton of mounts to be copied and destroyed needlessly.

With a large mount table and a system where thousands or ten-thousands
of containers are spawned in parallel this quickly becomes a bottleneck
increasing contention on the semaphore.

Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
returning a file descriptor referring to that mount tree
OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
to a new mount namespace. In that new mount namespace the copied mount
tree has been mounted on top of a copy of the real rootfs.

The caller can setns() into that mount namespace and perform any
additionally required setup such as move_mount() detached mounts in
there.

This allows OPEN_TREE_NAMESPACE to function as a combined
unshare(CLONE_NEWNS) and pivot_root().

A caller may for example choose to create an extremely minimal rootfs:

fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

This will create a mount namespace where "wootwoot" has become the
rootfs mounted on top of the real rootfs. The caller can now setns()
into this new mount namespace and assemble additional mounts.

This also works with user namespaces:

unshare(CLONE_NEWUSER);
fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);

which creates a new mount namespace owned by the earlier created user
namespace with "wootwoot" as the rootfs mounted on top of the real
rootfs.

Link: https://patch.msgid.link/20251229-work-empty-namespace-v1-1-bfb24c7b061f@kernel.org
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Christian Brauner <brauner@kernel.org>
Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-16 19:21:40 +01:00
..
acpi Revert "ACPI: processor: idle: Optimize ACPI idle driver registration" 2025-11-25 16:08:06 +01:00
asm-generic hyperv-next for v6.19 2025-12-09 06:10:17 +09:00
clocksource
crypto This update includes the following changes: 2025-12-03 11:28:38 -08:00
cxl
drm Cross-subsystem Changes: 2025-12-05 10:16:25 +10:00
dt-bindings This pull request is entirely SoC clk drivers, not for lack of trying to modify 2025-12-08 09:38:52 +09:00
hyperv mshv: Add definitions for MSHV sleep state configuration 2025-12-05 23:24:57 +00:00
keys
kunit
kvm KVM: arm64: GICv2: Handle deactivation via GICV_DIR traps 2025-11-24 14:29:14 -08:00
linux fs: Remove internal old mount API code 2025-12-15 14:48:33 +01:00
math-emu
media
memory
misc
net - fix a bug with O_APPEND in cached mode causing data to be written multiple times on server 2025-12-07 08:29:09 -08:00
pcmcia
ras Significant patch series in this merge are as follows: 2025-12-05 13:52:43 -08:00
rdma RDMA/core: Add new IB rate for XDR (8x) support 2025-11-24 02:58:30 -05:00
rv rv: Fix compilation if !CONFIG_RV_REACTORS 2025-12-02 12:33:37 -05:00
scsi
soc This pull request is entirely SoC clk drivers, not for lack of trying to modify 2025-12-08 09:38:52 +09:00
sound soundwire updates for 6.19 2025-12-13 16:26:55 +12:00
target
trace We have a patch that adds an initial set of tracepoints to the MDS 2025-12-14 15:24:10 +12:00
uapi mount: add OPEN_TREE_NAMESPACE 2026-01-16 19:21:40 +01:00
ufs
vdso
video
xen
Kbuild