vfs-7.0-rc1.misc

Please consider pulling these changes from the signed vfs-7.0-rc1.misc tag.
 
 Thanks!
 Christian
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49QAKCRCRxhvAZXjc
 ojrZAQD1VJzY46r5FnAVf4jlEHyjIbDnZCP/n+c4x6XnqpU6EQEAgB0yAtAGP6+u
 SBuytElqHoTT5VtmEXTAabCNQ9Ks8wo=
 =JwZz
 -----END PGP SIGNATURE-----

Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains a mix of VFS cleanups, performance improvements, API
  fixes, documentation, and a deprecation notice.

  Scalability and performance:

   - Rework pid allocation to only take pidmap_lock once instead of
     twice during alloc_pid(), improving thread creation/teardown
     throughput by 10-16% depending on false-sharing luck. Pad the
     namespace refcount to reduce false-sharing

   - Track file lock presence via a flag in ->i_opflags instead of
     reading ->i_flctx, avoiding false-sharing with ->i_readcount on
     open/close hot paths. Measured 4-16% improvement on 24-core
     open-in-a-loop benchmarks

   - Use a consume fence in locks_inode_context() to match the
     store-release/load-consume idiom, eliminating a hardware fence on
     some architectures

   - Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
     false-sharing

   - Remove a redundant DCACHE_MANAGED_DENTRY check in
     __follow_mount_rcu() that never fires since the caller already
     verifies it, eliminating a 100% mispredicted branch

   - Fix a 100% mispredicted likely() in devcgroup_inode_permission()
     that became wrong after a prior code reorder

  Bug fixes and correctness:

   - Make insert_inode_locked() wait for inode destruction instead of
     skipping, fixing a corner case where two matching inodes could
     exist in the hash

   - Move f_mode initialization before file_ref_init() in alloc_file()
     to respect the SLAB_TYPESAFE_BY_RCU ordering contract

   - Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
     no buffers attached, preventing a null pointer dereference when
     AS_RELEASE_ALWAYS is set but no release_folio op exists

   - Fix select restart_block to store end_time as timespec64, avoiding
     truncation of tv_sec on 32-bit architectures

   - Make dump_inode() use get_kernel_nofault() to safely access inode
     and superblock fields, matching the dump_mapping() pattern

  API modernization:

   - Make posix_acl_to_xattr() allocate the buffer internally since
     every single caller was doing it anyway. Reduces boilerplate and
     unnecessary error checking across ~15 filesystems

   - Replace deprecated simple_strtoul() with kstrtoul() for the
     ihash_entries, dhash_entries, mhash_entries, and mphash_entries
     boot parameters, adding proper error handling

   - Convert chardev code to use guard(mutex) and __free(kfree) cleanup
     patterns

   - Replace min_t() with min() or umin() in VFS code to avoid silently
     truncating unsigned long to unsigned int

   - Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
     already check the flag

  Deprecation:

   - Begin deprecating legacy BSD process accounting (acct(2)). The
     interface has numerous footguns and better alternatives exist
     (eBPF)

  Documentation:

   - Fix and complete kernel-doc for struct export_operations, removing
     duplicated documentation between ReST and source

   - Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()

  Testing:

   - Add a kunit test for initramfs cpio handling of entries with
     filesize > PATH_MAX

  Misc:

   - Add missing <linux/init_task.h> include in fs_struct.c"

* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
  posix_acl: make posix_acl_to_xattr() alloc the buffer
  fs: make insert_inode_locked() wait for inode destruction
  initramfs_test: kunit test for cpio.filesize > PATH_MAX
  fs: improve dump_inode() to safely access inode fields
  fs: add <linux/init_task.h> for 'init_fs'
  docs: exportfs: Use source code struct documentation
  fs: move initializing f_mode before file_ref_init()
  exportfs: Complete kernel-doc for struct export_operations
  exportfs: Mark struct export_operations functions at kernel-doc
  exportfs: Fix kernel-doc output for get_name()
  acct(2): begin the deprecation of legacy BSD process accounting
  device_cgroup: remove branch hint after code refactor
  VFS: fix __start_dirop() kernel-doc warnings
  fs: Describe @isnew parameter in ilookup5_nowait()
  fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
  fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
  select: store end_time as timespec64 in restart block
  chardev: Switch to guard(mutex) and __free(kfree)
  namespace: Replace simple_strtoul with kstrtoul to parse boot params
  dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
  ...
This commit is contained in:
Linus Torvalds 2026-02-09 15:13:05 -08:00
commit 9e355113f0
39 changed files with 351 additions and 293 deletions

View file

@ -119,43 +119,11 @@ For a filesystem to be exportable it must:
A file system implementation declares that instances of the filesystem
are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
super_block. This field must point to a struct export_operations
which has the following members:
encode_fh (mandatory)
Takes a dentry and creates a filehandle fragment which may later be used
to find or create a dentry for the same object.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
create a dentry for it (possibly with d_obtain_alias).
fh_to_parent (optional but strongly recommended)
Given a filehandle fragment, this should find the parent of the
implied object and create a dentry for it (possibly with
d_obtain_alias). May fail if the filehandle fragment is too small.
get_parent (optional but strongly recommended)
When given a dentry for a directory, this should return a dentry for
the parent. Quite possibly the parent dentry will have been allocated
by d_alloc_anon. The default get_parent function just returns an error
so any filehandle lookup that requires finding a parent will fail.
->lookup("..") is *not* used as a default as it can leave ".." entries
in the dcache which are too messy to work with.
get_name (optional)
When given a parent dentry and a child dentry, this should find a name
in the directory identified by the parent dentry, which leads to the
object identified by the child dentry. If no get_name function is
supplied, a default implementation is provided which uses vfs_readdir
to find potential names, and matches inode numbers to find the correct
match.
flags
Some filesystems may need to be handled differently than others. The
export_operations struct also includes a flags field that allows the
filesystem to communicate such information to nfsd. See the Export
Operations Flags section below for more explanation.
.. kernel-doc:: include/linux/exportfs.h
:identifiers: struct export_operations
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".

View file

@ -167,17 +167,11 @@ int v9fs_iop_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
if (retval)
goto err_out;
size = posix_acl_xattr_size(acl->a_count);
value = kzalloc(size, GFP_NOFS);
value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS);
if (!value) {
retval = -ENOMEM;
goto err_out;
}
retval = posix_acl_to_xattr(&init_user_ns, acl, value, size);
if (retval < 0)
goto err_out;
}
/*
@ -257,13 +251,10 @@ static int v9fs_set_acl(struct p9_fid *fid, int type, struct posix_acl *acl)
return 0;
/* Set a setxattr request to server */
size = posix_acl_xattr_size(acl->a_count);
buffer = kmalloc(size, GFP_KERNEL);
buffer = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL);
if (!buffer)
return -ENOMEM;
retval = posix_acl_to_xattr(&init_user_ns, acl, buffer, size);
if (retval < 0)
goto err_free_out;
switch (type) {
case ACL_TYPE_ACCESS:
name = XATTR_NAME_POSIX_ACL_ACCESS;
@ -275,7 +266,6 @@ static int v9fs_set_acl(struct p9_fid *fid, int type, struct posix_acl *acl)
BUG();
}
retval = v9fs_fid_xattr_set(fid, name, buffer, size, 0);
err_free_out:
kfree(buffer);
return retval;
}

View file

@ -57,7 +57,8 @@ struct posix_acl *btrfs_get_acl(struct inode *inode, int type, bool rcu)
int __btrfs_set_acl(struct btrfs_trans_handle *trans, struct inode *inode,
struct posix_acl *acl, int type)
{
int ret, size = 0;
int ret;
size_t size = 0;
const char *name;
char AUTO_KFREE(value);
@ -77,20 +78,15 @@ int __btrfs_set_acl(struct btrfs_trans_handle *trans, struct inode *inode,
if (acl) {
unsigned int nofs_flag;
size = posix_acl_xattr_size(acl->a_count);
/*
* We're holding a transaction handle, so use a NOFS memory
* allocation context to avoid deadlock if reclaim happens.
*/
nofs_flag = memalloc_nofs_save();
value = kmalloc(size, GFP_KERNEL);
value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL);
memalloc_nofs_restore(nofs_flag);
if (!value)
return -ENOMEM;
ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
if (ret < 0)
return ret;
}
if (trans)

View file

@ -2354,7 +2354,7 @@ bool block_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
if (!head)
return false;
blocksize = head->b_size;
to = min_t(unsigned, folio_size(folio) - from, count);
to = min(folio_size(folio) - from, count);
to = from + to;
if (from < blocksize && to > folio_size(folio) - blocksize)
return false;
@ -2948,6 +2948,10 @@ bool try_to_free_buffers(struct folio *folio)
if (folio_test_writeback(folio))
return false;
/* Misconfigured folio check */
if (WARN_ON_ONCE(!folio_buffers(folio)))
return true;
if (mapping == NULL) { /* can this still happen? */
ret = drop_buffers(folio, &buffers_to_free);
goto out;

View file

@ -90,7 +90,8 @@ retry:
int ceph_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
struct posix_acl *acl, int type)
{
int ret = 0, size = 0;
int ret = 0;
size_t size = 0;
const char *name = NULL;
char *value = NULL;
struct iattr newattrs;
@ -126,16 +127,11 @@ int ceph_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
}
if (acl) {
size = posix_acl_xattr_size(acl->a_count);
value = kmalloc(size, GFP_NOFS);
value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS);
if (!value) {
ret = -ENOMEM;
goto out;
}
ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
if (ret < 0)
goto out_free;
}
if (new_mode != old_mode) {
@ -172,7 +168,7 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
struct posix_acl *acl, *default_acl;
size_t val_size1 = 0, val_size2 = 0;
struct ceph_pagelist *pagelist = NULL;
void *tmp_buf = NULL;
void *tmp_buf1 = NULL, *tmp_buf2 = NULL;
int err;
err = posix_acl_create(dir, mode, &default_acl, &acl);
@ -192,15 +188,7 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
if (!default_acl && !acl)
return 0;
if (acl)
val_size1 = posix_acl_xattr_size(acl->a_count);
if (default_acl)
val_size2 = posix_acl_xattr_size(default_acl->a_count);
err = -ENOMEM;
tmp_buf = kmalloc(max(val_size1, val_size2), GFP_KERNEL);
if (!tmp_buf)
goto out_err;
pagelist = ceph_pagelist_alloc(GFP_KERNEL);
if (!pagelist)
goto out_err;
@ -213,34 +201,39 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
if (acl) {
size_t len = strlen(XATTR_NAME_POSIX_ACL_ACCESS);
err = -ENOMEM;
tmp_buf1 = posix_acl_to_xattr(&init_user_ns, acl,
&val_size1, GFP_KERNEL);
if (!tmp_buf1)
goto out_err;
err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8);
if (err)
goto out_err;
ceph_pagelist_encode_string(pagelist, XATTR_NAME_POSIX_ACL_ACCESS,
len);
err = posix_acl_to_xattr(&init_user_ns, acl,
tmp_buf, val_size1);
if (err < 0)
goto out_err;
ceph_pagelist_encode_32(pagelist, val_size1);
ceph_pagelist_append(pagelist, tmp_buf, val_size1);
ceph_pagelist_append(pagelist, tmp_buf1, val_size1);
}
if (default_acl) {
size_t len = strlen(XATTR_NAME_POSIX_ACL_DEFAULT);
err = -ENOMEM;
tmp_buf2 = posix_acl_to_xattr(&init_user_ns, default_acl,
&val_size2, GFP_KERNEL);
if (!tmp_buf2)
goto out_err;
err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8);
if (err)
goto out_err;
ceph_pagelist_encode_string(pagelist,
XATTR_NAME_POSIX_ACL_DEFAULT, len);
err = posix_acl_to_xattr(&init_user_ns, default_acl,
tmp_buf, val_size2);
if (err < 0)
goto out_err;
ceph_pagelist_encode_32(pagelist, val_size2);
ceph_pagelist_append(pagelist, tmp_buf, val_size2);
ceph_pagelist_append(pagelist, tmp_buf2, val_size2);
}
kfree(tmp_buf);
kfree(tmp_buf1);
kfree(tmp_buf2);
as_ctx->acl = acl;
as_ctx->default_acl = default_acl;
@ -250,7 +243,8 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
out_err:
posix_acl_release(acl);
posix_acl_release(default_acl);
kfree(tmp_buf);
kfree(tmp_buf1);
kfree(tmp_buf2);
if (pagelist)
ceph_pagelist_release(pagelist);
return err;

View file

@ -10,6 +10,7 @@
#include <linux/kdev_t.h>
#include <linux/slab.h>
#include <linux/string.h>
#include <linux/cleanup.h>
#include <linux/major.h>
#include <linux/errno.h>
@ -97,7 +98,8 @@ static struct char_device_struct *
__register_chrdev_region(unsigned int major, unsigned int baseminor,
int minorct, const char *name)
{
struct char_device_struct *cd, *curr, *prev = NULL;
struct char_device_struct *cd __free(kfree) = NULL;
struct char_device_struct *curr, *prev = NULL;
int ret;
int i;
@ -117,14 +119,14 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor,
if (cd == NULL)
return ERR_PTR(-ENOMEM);
mutex_lock(&chrdevs_lock);
guard(mutex)(&chrdevs_lock);
if (major == 0) {
ret = find_dynamic_major();
if (ret < 0) {
pr_err("CHRDEV \"%s\" dynamic allocation region is full\n",
name);
goto out;
return ERR_PTR(ret);
}
major = ret;
}
@ -144,7 +146,7 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor,
if (curr->baseminor >= baseminor + minorct)
break;
goto out;
return ERR_PTR(ret);
}
cd->major = major;
@ -160,12 +162,7 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor,
prev->next = cd;
}
mutex_unlock(&chrdevs_lock);
return cd;
out:
mutex_unlock(&chrdevs_lock);
kfree(cd);
return ERR_PTR(ret);
return_ptr(cd);
}
static struct char_device_struct *
@ -343,7 +340,7 @@ void __unregister_chrdev(unsigned int major, unsigned int baseminor,
kfree(cd);
}
static DEFINE_SPINLOCK(cdev_lock);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(cdev_lock);
static struct kobject *cdev_get(struct cdev *p)
{

View file

@ -3237,10 +3237,7 @@ EXPORT_SYMBOL(d_parent_ino);
static __initdata unsigned long dhash_entries;
static int __init set_dhash_entries(char *str)
{
if (!str)
return 0;
dhash_entries = simple_strtoul(str, &str, 0);
return 1;
return kstrtoul(str, 0, &dhash_entries) == 0;
}
__setup("dhash_entries=", set_dhash_entries);

View file

@ -555,7 +555,7 @@ int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
return -E2BIG;
while (len > 0) {
unsigned int bytes_to_copy = min_t(unsigned int, len,
unsigned int bytes_to_copy = min(len,
min_not_zero(offset_in_page(pos), PAGE_SIZE));
struct page *page;

View file

@ -4276,8 +4276,7 @@ void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
* get the corresponding group metadata to work with.
* For this we have goto again loop.
*/
thisgrp_len = min_t(unsigned int, (unsigned int)len,
EXT4_BLOCKS_PER_GROUP(sb) - EXT4_C2B(sbi, blkoff));
thisgrp_len = min(len, EXT4_BLOCKS_PER_GROUP(sb) - EXT4_C2B(sbi, blkoff));
clen = EXT4_NUM_B2C(sbi, thisgrp_len);
if (!ext4_sb_block_valid(sb, NULL, block, thisgrp_len)) {

View file

@ -1479,7 +1479,7 @@ static void ext4_update_super(struct super_block *sb,
/* Update the global fs size fields */
sbi->s_groups_count += flex_gd->count;
sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count,
sbi->s_blockfile_groups = min(sbi->s_groups_count,
(EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb)));
/* Update the reserved block counts only once the new group is

View file

@ -4837,7 +4837,7 @@ static int ext4_check_geometry(struct super_block *sb,
return -EINVAL;
}
sbi->s_groups_count = blocks_count;
sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count,
sbi->s_blockfile_groups = min(sbi->s_groups_count,
(EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb)));
if (((u64)sbi->s_groups_count * sbi->s_inodes_per_group) !=
le32_to_cpu(es->s_inodes_count)) {

View file

@ -1355,7 +1355,7 @@ found:
/* Fill the long name slots. */
for (i = 0; i < long_bhs; i++) {
int copy = min_t(int, sb->s_blocksize - offset, size);
int copy = umin(sb->s_blocksize - offset, size);
memcpy(bhs[i]->b_data + offset, slots, copy);
mark_buffer_dirty_inode(bhs[i], dir);
offset = 0;
@ -1366,7 +1366,7 @@ found:
err = fat_sync_bhs(bhs, long_bhs);
if (!err && i < nr_bhs) {
/* Fill the short name slot. */
int copy = min_t(int, sb->s_blocksize - offset, size);
int copy = umin(sb->s_blocksize - offset, size);
memcpy(bhs[i]->b_data + offset, slots, copy);
mark_buffer_dirty_inode(bhs[i], dir);
if (IS_DIRSYNC(dir))

View file

@ -141,8 +141,7 @@ static int fat_ioctl_fitrim(struct inode *inode, unsigned long arg)
if (copy_from_user(&range, user_range, sizeof(range)))
return -EFAULT;
range.minlen = max_t(unsigned int, range.minlen,
bdev_discard_granularity(sb->s_bdev));
range.minlen = max(range.minlen, bdev_discard_granularity(sb->s_bdev));
err = fat_trim_fs(inode, &range);
if (err < 0)

View file

@ -176,6 +176,11 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
f->f_flags = flags;
f->f_mode = OPEN_FMODE(flags);
/*
* Disable permission and pre-content events for all files by default.
* They may be enabled later by fsnotify_open_perm_and_set_mode().
*/
file_set_fsnotify_mode(f, FMODE_NONOTIFY_PERM);
f->f_op = NULL;
f->f_mapping = NULL;
@ -197,11 +202,6 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
* refcount bumps we should reinitialize the reused file first.
*/
file_ref_init(&f->f_ref, 1);
/*
* Disable permission and pre-content events for all files by default.
* They may be enabled later by fsnotify_open_perm_and_set_mode().
*/
file_set_fsnotify_mode(f, FMODE_NONOTIFY_PERM);
return 0;
}

View file

@ -6,6 +6,7 @@
#include <linux/path.h>
#include <linux/slab.h>
#include <linux/fs_struct.h>
#include <linux/init_task.h>
#include "internal.h"
/*

View file

@ -122,20 +122,16 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
* them to be refreshed the next time they are used,
* and it also updates i_ctime.
*/
size_t size = posix_acl_xattr_size(acl->a_count);
size_t size;
void *value;
if (size > PAGE_SIZE)
return -E2BIG;
value = kmalloc(size, GFP_KERNEL);
value = posix_acl_to_xattr(fc->user_ns, acl, &size, GFP_KERNEL);
if (!value)
return -ENOMEM;
ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
if (ret < 0) {
if (size > PAGE_SIZE) {
kfree(value);
return ret;
return -E2BIG;
}
/*

View file

@ -1813,7 +1813,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
goto out_iput;
folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset;
nr_bytes = min_t(unsigned, num, folio_size(folio) - folio_offset);
nr_bytes = min(num, folio_size(folio) - folio_offset);
nr_pages = (offset + nr_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
err = fuse_copy_folio(cs, &folio, folio_offset, nr_bytes, 0);

View file

@ -1323,10 +1323,8 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
static inline unsigned int fuse_wr_pages(loff_t pos, size_t len,
unsigned int max_pages)
{
return min_t(unsigned int,
((pos + len - 1) >> PAGE_SHIFT) -
(pos >> PAGE_SHIFT) + 1,
max_pages);
return min(((pos + len - 1) >> PAGE_SHIFT) - (pos >> PAGE_SHIFT) + 1,
max_pages);
}
static ssize_t fuse_perform_write(struct kiocb *iocb, struct iov_iter *ii)
@ -1607,7 +1605,7 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
struct folio *folio = page_folio(pages[i]);
unsigned int offset = start +
(folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
unsigned int len = min_t(unsigned int, ret, PAGE_SIZE - start);
unsigned int len = umin(ret, PAGE_SIZE - start);
ap->descs[ap->num_folios].offset = offset;
ap->descs[ap->num_folios].length = len;

View file

@ -83,21 +83,14 @@ struct posix_acl *gfs2_get_acl(struct inode *inode, int type, bool rcu)
int __gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type)
{
int error;
size_t len;
char *data;
size_t len = 0;
char *data = NULL;
const char *name = gfs2_acl_name(type);
if (acl) {
len = posix_acl_xattr_size(acl->a_count);
data = kmalloc(len, GFP_NOFS);
data = posix_acl_to_xattr(&init_user_ns, acl, &len, GFP_NOFS);
if (data == NULL)
return -ENOMEM;
error = posix_acl_to_xattr(&init_user_ns, acl, data, len);
if (error < 0)
goto out;
} else {
data = NULL;
len = 0;
}
error = __gfs2_xattr_set(inode, name, data, len, 0, GFS2_EATYPE_SYS);

View file

@ -1028,19 +1028,20 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
return freed;
}
static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked);
static void __wait_on_freeing_inode(struct inode *inode, bool hash_locked, bool rcu_locked);
/*
* Called with the inode lock held.
*/
static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
int (*test)(struct inode *, void *),
void *data, bool is_inode_hash_locked,
void *data, bool hash_locked,
bool *isnew)
{
struct inode *inode = NULL;
if (is_inode_hash_locked)
if (hash_locked)
lockdep_assert_held(&inode_hash_lock);
else
lockdep_assert_not_held(&inode_hash_lock);
@ -1054,7 +1055,7 @@ repeat:
continue;
spin_lock(&inode->i_lock);
if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) {
__wait_on_freeing_inode(inode, is_inode_hash_locked);
__wait_on_freeing_inode(inode, hash_locked, true);
goto repeat;
}
if (unlikely(inode_state_read(inode) & I_CREATING)) {
@ -1078,11 +1079,11 @@ repeat:
*/
static struct inode *find_inode_fast(struct super_block *sb,
struct hlist_head *head, unsigned long ino,
bool is_inode_hash_locked, bool *isnew)
bool hash_locked, bool *isnew)
{
struct inode *inode = NULL;
if (is_inode_hash_locked)
if (hash_locked)
lockdep_assert_held(&inode_hash_lock);
else
lockdep_assert_not_held(&inode_hash_lock);
@ -1096,7 +1097,7 @@ repeat:
continue;
spin_lock(&inode->i_lock);
if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) {
__wait_on_freeing_inode(inode, is_inode_hash_locked);
__wait_on_freeing_inode(inode, hash_locked, true);
goto repeat;
}
if (unlikely(inode_state_read(inode) & I_CREATING)) {
@ -1832,16 +1833,13 @@ int insert_inode_locked(struct inode *inode)
while (1) {
struct inode *old = NULL;
spin_lock(&inode_hash_lock);
repeat:
hlist_for_each_entry(old, head, i_hash) {
if (old->i_ino != ino)
continue;
if (old->i_sb != sb)
continue;
spin_lock(&old->i_lock);
if (inode_state_read(old) & (I_FREEING | I_WILL_FREE)) {
spin_unlock(&old->i_lock);
continue;
}
break;
}
if (likely(!old)) {
@ -1852,6 +1850,11 @@ int insert_inode_locked(struct inode *inode)
spin_unlock(&inode_hash_lock);
return 0;
}
if (inode_state_read(old) & (I_FREEING | I_WILL_FREE)) {
__wait_on_freeing_inode(old, true, false);
old = NULL;
goto repeat;
}
if (unlikely(inode_state_read(old) & I_CREATING)) {
spin_unlock(&old->i_lock);
spin_unlock(&inode_hash_lock);
@ -2522,16 +2525,18 @@ EXPORT_SYMBOL(inode_needs_sync);
* wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
* will DTRT.
*/
static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked)
static void __wait_on_freeing_inode(struct inode *inode, bool hash_locked, bool rcu_locked)
{
struct wait_bit_queue_entry wqe;
struct wait_queue_head *wq_head;
VFS_BUG_ON(!hash_locked && !rcu_locked);
/*
* Handle racing against evict(), see that routine for more details.
*/
if (unlikely(inode_unhashed(inode))) {
WARN_ON(is_inode_hash_locked);
WARN_ON(hash_locked);
spin_unlock(&inode->i_lock);
return;
}
@ -2539,23 +2544,22 @@ static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_lock
wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW);
prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
if (is_inode_hash_locked)
if (rcu_locked)
rcu_read_unlock();
if (hash_locked)
spin_unlock(&inode_hash_lock);
schedule();
finish_wait(wq_head, &wqe.wq_entry);
if (is_inode_hash_locked)
if (hash_locked)
spin_lock(&inode_hash_lock);
rcu_read_lock();
if (rcu_locked)
rcu_read_lock();
}
static __initdata unsigned long ihash_entries;
static int __init set_ihash_entries(char *str)
{
if (!str)
return 0;
ihash_entries = simple_strtoul(str, &str, 0);
return 1;
return kstrtoul(str, 0, &ihash_entries) == 0;
}
__setup("ihash_entries=", set_ihash_entries);
@ -3005,24 +3009,45 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
EXPORT_SYMBOL(mode_strip_sgid);
#ifdef CONFIG_DEBUG_VFS
/*
* Dump an inode.
/**
* dump_inode - dump an inode.
* @inode: inode to dump
* @reason: reason for dumping
*
* TODO: add a proper inode dumping routine, this is a stub to get debug off the
* ground.
*
* TODO: handle getting to fs type with get_kernel_nofault()?
* See dump_mapping() above.
* If inode is an invalid pointer, we don't want to crash accessing it,
* so probe everything depending on it carefully with get_kernel_nofault().
*/
void dump_inode(struct inode *inode, const char *reason)
{
struct super_block *sb = inode->i_sb;
struct super_block *sb;
struct file_system_type *s_type;
const char *fs_name_ptr;
char fs_name[32] = {};
umode_t mode;
unsigned short opflags;
unsigned int flags;
unsigned int state;
int count;
pr_warn("%s encountered for inode %px\n"
"fs %s mode %ho opflags 0x%hx flags 0x%x state 0x%x count %d\n",
reason, inode, sb->s_type->name, inode->i_mode, inode->i_opflags,
inode->i_flags, inode_state_read_once(inode), atomic_read(&inode->i_count));
if (get_kernel_nofault(sb, &inode->i_sb) ||
get_kernel_nofault(mode, &inode->i_mode) ||
get_kernel_nofault(opflags, &inode->i_opflags) ||
get_kernel_nofault(flags, &inode->i_flags)) {
pr_warn("%s: unreadable inode:%px\n", reason, inode);
return;
}
state = inode_state_read_once(inode);
count = atomic_read(&inode->i_count);
if (!sb ||
get_kernel_nofault(s_type, &sb->s_type) || !s_type ||
get_kernel_nofault(fs_name_ptr, &s_type->name) || !fs_name_ptr ||
strncpy_from_kernel_nofault(fs_name, fs_name_ptr, sizeof(fs_name) - 1) < 0)
strscpy(fs_name, "<unknown, sb unreadable>");
pr_warn("%s: inode:%px fs:%s mode:%ho opflags:%#x flags:%#x state:%#x count:%d\n",
reason, inode, fs_name, mode, opflags, flags, state, count);
}
EXPORT_SYMBOL(dump_inode);
#endif

View file

@ -61,7 +61,7 @@ static int __jfs_set_acl(tid_t tid, struct inode *inode, int type,
{
char *ea_name;
int rc;
int size = 0;
size_t size = 0;
char *value = NULL;
switch (type) {
@ -76,16 +76,11 @@ static int __jfs_set_acl(tid_t tid, struct inode *inode, int type,
}
if (acl) {
size = posix_acl_xattr_size(acl->a_count);
value = kmalloc(size, GFP_KERNEL);
value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL);
if (!value)
return -ENOMEM;
rc = posix_acl_to_xattr(&init_user_ns, acl, value, size);
if (rc < 0)
goto out;
}
rc = __jfs_setxattr(tid, inode, ea_name, value, size, 0);
out:
kfree(value);
if (!rc)

View file

@ -178,7 +178,6 @@ locks_get_lock_context(struct inode *inode, int type)
{
struct file_lock_context *ctx;
/* paired with cmpxchg() below */
ctx = locks_inode_context(inode);
if (likely(ctx) || type == F_UNLCK)
goto out;
@ -196,7 +195,18 @@ locks_get_lock_context(struct inode *inode, int type)
* Assign the pointer if it's not already assigned. If it is, then
* free the context we just allocated.
*/
if (cmpxchg(&inode->i_flctx, NULL, ctx)) {
spin_lock(&inode->i_lock);
if (!(inode->i_opflags & IOP_FLCTX)) {
VFS_BUG_ON_INODE(inode->i_flctx, inode);
WRITE_ONCE(inode->i_flctx, ctx);
/*
* Paired with locks_inode_context().
*/
smp_store_release(&inode->i_opflags, inode->i_opflags | IOP_FLCTX);
spin_unlock(&inode->i_lock);
} else {
VFS_BUG_ON_INODE(!inode->i_flctx, inode);
spin_unlock(&inode->i_lock);
kmem_cache_free(flctx_cache, ctx);
ctx = locks_inode_context(inode);
}

View file

@ -879,7 +879,7 @@ static bool try_to_unlazy(struct nameidata *nd)
{
struct dentry *parent = nd->path.dentry;
BUG_ON(!(nd->flags & LOOKUP_RCU));
VFS_BUG_ON(!(nd->flags & LOOKUP_RCU));
if (unlikely(nd->flags & LOOKUP_CACHED)) {
drop_links(nd);
@ -919,7 +919,8 @@ out:
static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
{
int res;
BUG_ON(!(nd->flags & LOOKUP_RCU));
VFS_BUG_ON(!(nd->flags & LOOKUP_RCU));
if (unlikely(nd->flags & LOOKUP_CACHED)) {
drop_links(nd);
@ -1631,9 +1632,6 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path)
struct dentry *dentry = path->dentry;
unsigned int flags = dentry->d_flags;
if (likely(!(flags & DCACHE_MANAGED_DENTRY)))
return true;
if (unlikely(nd->flags & LOOKUP_NO_XDEV))
return false;

View file

@ -49,20 +49,14 @@ static unsigned int mp_hash_shift __ro_after_init;
static __initdata unsigned long mhash_entries;
static int __init set_mhash_entries(char *str)
{
if (!str)
return 0;
mhash_entries = simple_strtoul(str, &str, 0);
return 1;
return kstrtoul(str, 0, &mhash_entries) == 0;
}
__setup("mhash_entries=", set_mhash_entries);
static __initdata unsigned long mphash_entries;
static int __init set_mphash_entries(char *str)
{
if (!str)
return 0;
mphash_entries = simple_strtoul(str, &str, 0);
return 1;
return kstrtoul(str, 0, &mphash_entries) == 0;
}
__setup("mphash_entries=", set_mphash_entries);

View file

@ -641,13 +641,9 @@ static noinline int ntfs_set_acl_ex(struct mnt_idmap *idmap,
value = NULL;
flags = XATTR_REPLACE;
} else {
size = posix_acl_xattr_size(acl->a_count);
value = kmalloc(size, GFP_NOFS);
value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS);
if (!value)
return -ENOMEM;
err = posix_acl_to_xattr(&init_user_ns, acl, value, size);
if (err < 0)
goto out;
flags = 0;
}

View file

@ -90,14 +90,9 @@ int __orangefs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
type);
if (acl) {
size = posix_acl_xattr_size(acl->a_count);
value = kmalloc(size, GFP_KERNEL);
value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL);
if (!value)
return -ENOMEM;
error = posix_acl_to_xattr(&init_user_ns, acl, value, size);
if (error < 0)
goto out;
}
gossip_debug(GOSSIP_ACL_DEBUG,
@ -111,7 +106,6 @@ int __orangefs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
*/
error = orangefs_inode_setxattr(inode, name, value, size, 0);
out:
kfree(value);
if (!error)
set_cached_acl(inode, type, acl);

View file

@ -829,19 +829,19 @@ EXPORT_SYMBOL (posix_acl_from_xattr);
/*
* Convert from in-memory to extended attribute representation.
*/
int
void *
posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl,
void *buffer, size_t size)
size_t *sizep, gfp_t gfp)
{
struct posix_acl_xattr_header *ext_acl = buffer;
struct posix_acl_xattr_header *ext_acl;
struct posix_acl_xattr_entry *ext_entry;
int real_size, n;
size_t size;
int n;
real_size = posix_acl_xattr_size(acl->a_count);
if (!buffer)
return real_size;
if (real_size > size)
return -ERANGE;
size = posix_acl_xattr_size(acl->a_count);
ext_acl = kmalloc(size, gfp);
if (!ext_acl)
return NULL;
ext_entry = (void *)(ext_acl + 1);
ext_acl->a_version = cpu_to_le32(POSIX_ACL_XATTR_VERSION);
@ -864,7 +864,8 @@ posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl,
break;
}
}
return real_size;
*sizep = size;
return ext_acl;
}
EXPORT_SYMBOL (posix_acl_to_xattr);

View file

@ -1038,14 +1038,11 @@ static long do_restart_poll(struct restart_block *restart_block)
{
struct pollfd __user *ufds = restart_block->poll.ufds;
int nfds = restart_block->poll.nfds;
struct timespec64 *to = NULL, end_time;
struct timespec64 *to = NULL;
int ret;
if (restart_block->poll.has_timeout) {
end_time.tv_sec = restart_block->poll.tv_sec;
end_time.tv_nsec = restart_block->poll.tv_nsec;
to = &end_time;
}
if (restart_block->poll.has_timeout)
to = &restart_block->poll.end_time;
ret = do_sys_poll(ufds, nfds, to);
@ -1077,8 +1074,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
restart_block->poll.nfds = nfds;
if (timeout_msecs >= 0) {
restart_block->poll.tv_sec = end_time.tv_sec;
restart_block->poll.tv_nsec = end_time.tv_nsec;
restart_block->poll.end_time = end_time;
restart_block->poll.has_timeout = 1;
} else
restart_block->poll.has_timeout = 0;

View file

@ -1467,7 +1467,7 @@ static ssize_t iter_to_pipe(struct iov_iter *from,
n = DIV_ROUND_UP(left + start, PAGE_SIZE);
for (i = 0; i < n; i++) {
int size = min_t(int, left, PAGE_SIZE - start);
int size = umin(left, PAGE_SIZE - start);
buf.page = pages[i];
buf.offset = start;

View file

@ -21,7 +21,7 @@ static inline int devcgroup_inode_permission(struct inode *inode, int mask)
if (likely(!S_ISBLK(inode->i_mode) && !S_ISCHR(inode->i_mode)))
return 0;
if (likely(!inode->i_rdev))
if (!inode->i_rdev)
return 0;
if (S_ISBLK(inode->i_mode))

View file

@ -201,9 +201,9 @@ struct handle_to_path_ctx {
* @commit_metadata: commit metadata changes to stable storage
*
* See Documentation/filesystems/nfs/exporting.rst for details on how to use
* this interface correctly.
* this interface correctly and the definition of the flags.
*
* encode_fh:
* @encode_fh:
* @encode_fh should store in the file handle fragment @fh (using at most
* @max_len bytes) information that can be used by @decode_fh to recover the
* file referred to by the &struct dentry @de. If @flag has CONNECTABLE bit
@ -215,7 +215,7 @@ struct handle_to_path_ctx {
* greater than @max_len*4 bytes). On error @max_len contains the minimum
* size(in 4 byte unit) needed to encode the file handle.
*
* fh_to_dentry:
* @fh_to_dentry:
* @fh_to_dentry is given a &struct super_block (@sb) and a file handle
* fragment (@fh, @fh_len). It should return a &struct dentry which refers
* to the same file that the file handle fragment refers to. If it cannot,
@ -227,31 +227,44 @@ struct handle_to_path_ctx {
* created with d_alloc_root. The caller can then find any other extant
* dentries by following the d_alias links.
*
* fh_to_parent:
* @fh_to_parent:
* Same as @fh_to_dentry, except that it returns a pointer to the parent
* dentry if it was encoded into the filehandle fragment by @encode_fh.
*
* get_name:
* @get_name:
* @get_name should find a name for the given @child in the given @parent
* directory. The name should be stored in the @name (with the
* understanding that it is already pointing to a %NAME_MAX+1 sized
* understanding that it is already pointing to a %NAME_MAX + 1 sized
* buffer. get_name() should return %0 on success, a negative error code
* or error. @get_name will be called without @parent->i_rwsem held.
*
* get_parent:
* @get_parent:
* @get_parent should find the parent directory for the given @child which
* is also a directory. In the event that it cannot be found, or storage
* space cannot be allocated, a %ERR_PTR should be returned.
*
* permission:
* @permission:
* Allow filesystems to specify a custom permission function.
*
* open:
* @open:
* Allow filesystems to specify a custom open function.
*
* commit_metadata:
* @commit_metadata:
* @commit_metadata should commit metadata changes to stable storage.
*
* @get_uuid:
* Get a filesystem unique signature exposed to clients.
*
* @map_blocks:
* Map and, if necessary, allocate blocks for a layout.
*
* @commit_blocks:
* Commit blocks in a layout once the client is done with them.
*
* @flags:
* Allows the filesystem to communicate to nfsd that it may want to do things
* differently when dealing with it.
*
* Locking rules:
* get_parent is called with child->d_inode->i_rwsem down
* get_name is not (which is possibly inconsistent)

View file

@ -242,7 +242,14 @@ bool locks_owner_has_blockers(struct file_lock_context *flctx,
static inline struct file_lock_context *
locks_inode_context(const struct inode *inode)
{
return smp_load_acquire(&inode->i_flctx);
/*
* Paired with smp_store_release in locks_get_lock_context().
*
* Ensures ->i_flctx will be visible if we spotted the flag.
*/
if (likely(!(smp_load_acquire(&inode->i_opflags) & IOP_FLCTX)))
return NULL;
return READ_ONCE(inode->i_flctx);
}
#else /* !CONFIG_FILE_LOCKING */
@ -469,7 +476,7 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
* could end up racing with tasks trying to set a new lease on this
* file.
*/
flctx = READ_ONCE(inode->i_flctx);
flctx = locks_inode_context(inode);
if (!flctx)
return 0;
smp_mb();
@ -488,7 +495,7 @@ static inline int break_deleg(struct inode *inode, unsigned int flags)
* could end up racing with tasks trying to set a new lease on this
* file.
*/
flctx = READ_ONCE(inode->i_flctx);
flctx = locks_inode_context(inode);
if (!flctx)
return 0;
smp_mb();
@ -533,8 +540,11 @@ static inline int break_deleg_wait(struct delegated_inode *di)
static inline int break_layout(struct inode *inode, bool wait)
{
struct file_lock_context *flctx;
smp_mb();
if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) {
flctx = locks_inode_context(inode);
if (flctx && !list_empty_careful(&flctx->flc_lease)) {
unsigned int flags = LEASE_BREAK_LAYOUT;
if (!wait)

View file

@ -631,6 +631,7 @@ is_uncached_acl(struct posix_acl *acl)
#define IOP_MGTIME 0x0020
#define IOP_CACHED_LINK 0x0040
#define IOP_FASTPERM_MAY_EXEC 0x0080
#define IOP_FLCTX 0x0100
/*
* Inode state bits. Protected by inode->i_lock

View file

@ -108,11 +108,13 @@ extern const struct proc_ns_operations utsns_operations;
* @ns_tree: namespace tree nodes and active reference count
*/
struct ns_common {
struct {
refcount_t __ns_ref; /* do not use directly */
} ____cacheline_aligned_in_smp;
u32 ns_type;
struct dentry *stashed;
const struct proc_ns_operations *ops;
unsigned int inum;
refcount_t __ns_ref; /* do not use directly */
union {
struct ns_tree;
struct rcu_head ns_rcu;

View file

@ -44,8 +44,9 @@ posix_acl_from_xattr(struct user_namespace *user_ns, const void *value,
}
#endif
int posix_acl_to_xattr(struct user_namespace *user_ns,
const struct posix_acl *acl, void *buffer, size_t size);
extern void *posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl,
size_t *sizep, gfp_t gfp);
static inline const char *posix_acl_xattr_name(int type)
{
switch (type) {

View file

@ -6,6 +6,7 @@
#define __LINUX_RESTART_BLOCK_H
#include <linux/compiler.h>
#include <linux/time64.h>
#include <linux/types.h>
struct __kernel_timespec;
@ -50,8 +51,7 @@ struct restart_block {
struct pollfd __user *ufds;
int nfds;
int has_timeout;
unsigned long tv_sec;
unsigned long tv_nsec;
struct timespec64 end_time;
} poll;
};
};

View file

@ -624,8 +624,9 @@ config SCHED_HW_PRESSURE
arch_update_hw_pressure() and arch_scale_thermal_pressure().
config BSD_PROCESS_ACCT
bool "BSD Process Accounting"
bool "BSD Process Accounting (DEPRECATED)"
depends on MULTIUSER
default n
help
If you say Y here, a user level program will be able to instruct the
kernel (via a special system call) to write process accounting
@ -635,7 +636,9 @@ config BSD_PROCESS_ACCT
command name, memory usage, controlling terminal etc. (the complete
list is in the struct acct in <file:include/linux/acct.h>). It is
up to the user level program to do useful things with this
information. This is generally a good idea, so say Y.
information. This mechanism is antiquated and has significant
scalability issues. You probably want to use eBPF instead. Say
N unless you really need this.
config BSD_PROCESS_ACCT_V3
bool "BSD Process Accounting version 3 file format"

View file

@ -447,6 +447,53 @@ out:
kfree(tbufs);
}
static void __init initramfs_test_fname_path_max(struct kunit *test)
{
char *err;
size_t len;
struct kstat st0, st1;
char fdata[] = "this file data will not be unpacked";
struct test_fname_path_max {
char fname_oversize[PATH_MAX + 1];
char fname_ok[PATH_MAX];
char cpio_src[(CPIO_HDRLEN + PATH_MAX + 3 + sizeof(fdata)) * 2];
} *tbufs = kzalloc(sizeof(struct test_fname_path_max), GFP_KERNEL);
struct initramfs_test_cpio c[] = { {
.magic = "070701",
.ino = 1,
.mode = S_IFDIR | 0777,
.nlink = 1,
.namesize = sizeof(tbufs->fname_oversize),
.fname = tbufs->fname_oversize,
.filesize = sizeof(fdata),
.data = fdata,
}, {
.magic = "070701",
.ino = 2,
.mode = S_IFDIR | 0777,
.nlink = 1,
.namesize = sizeof(tbufs->fname_ok),
.fname = tbufs->fname_ok,
} };
memset(tbufs->fname_oversize, '/', sizeof(tbufs->fname_oversize) - 1);
memset(tbufs->fname_ok, '/', sizeof(tbufs->fname_ok) - 1);
memcpy(tbufs->fname_oversize, "fname_oversize",
sizeof("fname_oversize") - 1);
memcpy(tbufs->fname_ok, "fname_ok", sizeof("fname_ok") - 1);
len = fill_cpio(c, ARRAY_SIZE(c), tbufs->cpio_src);
/* unpack skips over fname_oversize instead of returning an error */
err = unpack_to_rootfs(tbufs->cpio_src, len);
KUNIT_EXPECT_NULL(test, err);
KUNIT_EXPECT_EQ(test, init_stat("fname_oversize", &st0, 0), -ENOENT);
KUNIT_EXPECT_EQ(test, init_stat("fname_ok", &st1, 0), 0);
KUNIT_EXPECT_EQ(test, init_rmdir("fname_ok"), 0);
kfree(tbufs);
}
/*
* The kunit_case/_suite struct cannot be marked as __initdata as this will be
* used in debugfs to retrieve results after test has run.
@ -459,6 +506,7 @@ static struct kunit_case __refdata initramfs_test_cases[] = {
KUNIT_CASE(initramfs_test_hardlink),
KUNIT_CASE(initramfs_test_many),
KUNIT_CASE(initramfs_test_fname_pad),
KUNIT_CASE(initramfs_test_fname_path_max),
{},
};

View file

@ -159,58 +159,86 @@ void free_pids(struct pid **pids)
free_pid(pids[tmp]);
}
struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
size_t set_tid_size)
struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,
size_t arg_set_tid_size)
{
int set_tid[MAX_PID_NS_LEVEL + 1] = {};
int pid_max[MAX_PID_NS_LEVEL + 1] = {};
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
int retval = -ENOMEM;
bool retried_preload;
/*
* set_tid_size contains the size of the set_tid array. Starting at
* arg_set_tid_size contains the size of the arg_set_tid array. Starting at
* the most nested currently active PID namespace it tells alloc_pid()
* which PID to set for a process in that most nested PID namespace
* up to set_tid_size PID namespaces. It does not have to set the PID
* for a process in all nested PID namespaces but set_tid_size must
* up to arg_set_tid_size PID namespaces. It does not have to set the PID
* for a process in all nested PID namespaces but arg_set_tid_size must
* never be greater than the current ns->level + 1.
*/
if (set_tid_size > ns->level + 1)
if (arg_set_tid_size > ns->level + 1)
return ERR_PTR(-EINVAL);
/*
* Prep before we take locks:
*
* 1. allocate and fill in pid struct
*/
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
return ERR_PTR(retval);
tmp = ns;
get_pid_ns(ns);
pid->level = ns->level;
refcount_set(&pid->count, 1);
spin_lock_init(&pid->lock);
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
init_waitqueue_head(&pid->wait_pidfd);
INIT_HLIST_HEAD(&pid->inodes);
for (i = ns->level; i >= 0; i--) {
int tid = 0;
int pid_max = READ_ONCE(tmp->pid_max);
/*
* 2. perm check checkpoint_restore_ns_capable()
*
* This stores found pid_max to make sure the used value is the same should
* later code need it.
*/
for (tmp = ns, i = ns->level; i >= 0; i--) {
pid_max[ns->level - i] = READ_ONCE(tmp->pid_max);
if (set_tid_size) {
tid = set_tid[ns->level - i];
if (arg_set_tid_size) {
int tid = set_tid[ns->level - i] = arg_set_tid[ns->level - i];
retval = -EINVAL;
if (tid < 1 || tid >= pid_max)
goto out_free;
if (tid < 1 || tid >= pid_max[ns->level - i])
goto out_abort;
/*
* Also fail if a PID != 1 is requested and
* no PID 1 exists.
*/
if (tid != 1 && !tmp->child_reaper)
goto out_free;
goto out_abort;
retval = -EPERM;
if (!checkpoint_restore_ns_capable(tmp->user_ns))
goto out_free;
set_tid_size--;
goto out_abort;
arg_set_tid_size--;
}
idr_preload(GFP_KERNEL);
spin_lock(&pidmap_lock);
tmp = tmp->parent;
}
/*
* Prep is done, id allocation goes here:
*/
retried_preload = false;
idr_preload(GFP_KERNEL);
spin_lock(&pidmap_lock);
for (tmp = ns, i = ns->level; i >= 0;) {
int tid = set_tid[ns->level - i];
if (tid) {
nr = idr_alloc(&tmp->idr, NULL, tid,
@ -220,6 +248,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* alreay in use. Return EEXIST in that case.
*/
if (nr == -ENOSPC)
nr = -EEXIST;
} else {
int pid_min = 1;
@ -235,19 +264,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* a partially initialized PID (see below).
*/
nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
pid_max, GFP_ATOMIC);
pid_max[ns->level - i], GFP_ATOMIC);
if (nr == -ENOSPC)
nr = -EAGAIN;
}
spin_unlock(&pidmap_lock);
idr_preload_end();
if (nr < 0) {
retval = (nr == -ENOSPC) ? -EAGAIN : nr;
if (unlikely(nr < 0)) {
/*
* Preload more memory if idr_alloc{,cyclic} failed with -ENOMEM.
*
* The IDR API only allows us to preload memory for one call, while we may end
* up doing several under pidmap_lock with GFP_ATOMIC. The situation may be
* salvageable with GFP_KERNEL. But make sure to not loop indefinitely if preload
* did not help (the routine unfortunately returns void, so we have no idea
* if it got anywhere).
*
* The lock can be safely dropped and picked up as historically pid allocation
* for different namespaces was *not* atomic -- we try to hold on to it the
* entire time only for performance reasons.
*/
if (nr == -ENOMEM && !retried_preload) {
spin_unlock(&pidmap_lock);
idr_preload_end();
retried_preload = true;
idr_preload(GFP_KERNEL);
spin_lock(&pidmap_lock);
continue;
}
retval = nr;
goto out_free;
}
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;
i--;
retried_preload = false;
}
/*
@ -257,25 +309,15 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* is what we have exposed to userspace for a long time and it is
* documented behavior for pid namespaces. So we can't easily
* change it even if there were an error code better suited.
*
* This can't be done earlier because we need to preserve other
* error conditions.
*/
retval = -ENOMEM;
get_pid_ns(ns);
refcount_set(&pid->count, 1);
spin_lock_init(&pid->lock);
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
init_waitqueue_head(&pid->wait_pidfd);
INIT_HLIST_HEAD(&pid->inodes);
upid = pid->numbers + ns->level;
idr_preload(GFP_KERNEL);
spin_lock(&pidmap_lock);
if (!(ns->pid_allocated & PIDNS_ADDING))
goto out_unlock;
if (unlikely(!(ns->pid_allocated & PIDNS_ADDING)))
goto out_free;
pidfs_add_pid(pid);
for ( ; upid >= pid->numbers; --upid) {
for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
idr_replace(&upid->ns->idr, pid, upid->nr);
upid->ns->pid_allocated++;
@ -286,13 +328,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
return pid;
out_unlock:
spin_unlock(&pidmap_lock);
idr_preload_end();
put_pid_ns(ns);
out_free:
spin_lock(&pidmap_lock);
while (++i <= ns->level) {
upid = pid->numbers + i;
idr_remove(&upid->ns->idr, upid->nr);
@ -303,7 +339,10 @@ out_free:
idr_set_cursor(&ns->idr, 0);
spin_unlock(&pidmap_lock);
idr_preload_end();
out_abort:
put_pid_ns(ns);
kmem_cache_free(ns->pid_cachep, pid);
return ERR_PTR(retval);
}