Merge kmem_cache_create() refactoring by Christian Brauner.
Note this includes a merge of the vfs.file tree that contains the
prerequisity kmem_cache_create_rcu() work.
Merge most of SLUB feature work for 6.12:
- Barrier for pending kfree_rcu() in kmem_cache_destroy() and associated
refactoring of the destroy path (Vlastimil Babka)
- CONFIG_SLUB_RCU_DEBUG to allow KASAN catching UAF bugs in
SLAB_TYPESAFE_BY_RCU caches (Jann Horn)
- kmem_cache_charge() for delayed kmemcg charging (Shakeel Butt)
As kmem_cache_create() became a _Generic() wrapper macro, it currently
has no kerneldoc despite being the main API to use. Add it. Also adjust
kmem_cache_create_usercopy() kerneldoc to indicate it is now a legacy
wrapper.
Also expand the kerneldoc for struct kmem_cache_args, especially for the
freeptr_offset field, where important details were removed with the
removal of kmem_cache_create_rcu().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Make __kmem_cache_create() a static inline function.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Make kmem_cache_create_usercopy() a static inline function.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Now that we have ported all users of kmem_cache_create_rcu() to struct
kmem_cache_args the function is unused and can be removed.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Use _Generic() to create a compatibility layer that type switches on the
third argument to either call __kmem_cache_create() or
__kmem_cache_create_args(). If NULL is passed for the struct
kmem_cache_args argument use default args making porting for callers
that don't care about additional arguments easy.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Pass down struct kmem_cache_args to calculate_sizes() so we can use
args->{use}_freeptr_offset directly. This allows us to remove
->rcu_freeptr_offset from struct kmem_cache.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
and initialize most things in do_kmem_cache_create(). In a follow-up
patch we'll remove rcu_freeptr_offset from struct kmem_cache.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
do_kmem_cache_create() is the only caller and we're going to pass down
struct kmem_cache_args in a follow-up patch.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Pass struct kmem_cache_args to create_cache() so that we can later
simplify further helpers.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Currently we have multiple kmem_cache_create*() variants that take up to
seven separate parameters with one of the functions having to grow an
eigth parameter in the future to handle both usercopy and a custom
freelist pointer.
Add a struct kmem_cache_args structure and move less common parameters
into it. Core parameters such as name, object size, and flags continue
to be passed separately.
Add a new function __kmem_cache_create_args() that takes a struct
kmem_cache_args pointer and port do_kmem_cache_create_usercopy() over to
it.
In follow-up patches we will port the other kmem_cache_create*()
variants over to it as well.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Merge prerequisities from the vfs git tree for the following series that
introduces kmem_cache_args. The vfs.file branch includes the addition of
kmem_cache_create_rcu() which was needed in vfs for the filp cache
optimization. The following series refactors this code.
At the moment, the slab objects are charged to the memcg at the
allocation time. However there are cases where slab objects are
allocated at the time where the right target memcg to charge it to is
not known. One such case is the network sockets for the incoming
connection which are allocated in the softirq context.
Couple hundred thousand connections are very normal on large loaded
server and almost all of those sockets underlying those connections get
allocated in the softirq context and thus not charged to any memcg.
However later at the accept() time we know the right target memcg to
charge. Let's add new API to charge already allocated objects, so we can
have better accounting of the memory usage.
To measure the performance impact of this change, tcp_crr is used from
the neper [1] performance suite. Basically it is a network ping pong
test with new connection for each ping pong.
The server and the client are run inside 3 level of cgroup hierarchy
using the following commands:
Server:
$ tcp_crr -6
Client:
$ tcp_crr -6 -c -H ${server_ip}
If the client and server run on different machines with 50 GBPS NIC,
there is no visible impact of the change.
For the same machine experiment with v6.11-rc5 as base.
base (throughput) with-patch
tcp_crr 14545 (+- 80) 14463 (+- 56)
It seems like the performance impact is within the noise.
Link: https://github.com/google/neper [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com> # net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
We can first assess the flags, if it's unmergeable, there's no need
to calculate the size and align.
Signed-off-by: Xavier <xavier_qy@163.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Since commit 946fa0dbf2 ("mm/slub: extend redzone check to extra
allocated kmalloc space than requested"), setting orig_size treats
the wasted space (object_size - orig_size) as a redzone. However with
init_on_free=1 we clear the full object->size, including the redzone.
Additionally we clear the object metadata, including the stored orig_size,
making it zero, which makes check_object() treat the whole object as a
redzone.
These issues lead to the following BUG report with "slub_debug=FUZ
init_on_free=1":
[ 0.000000] =============================================================================
[ 0.000000] BUG kmalloc-8 (Not tainted): kmalloc Redzone overwritten
[ 0.000000] -----------------------------------------------------------------------------
[ 0.000000]
[ 0.000000] 0xffff000010032858-0xffff00001003285f @offset=2136. First byte 0x0 instead of 0xcc
[ 0.000000] FIX kmalloc-8: Restoring kmalloc Redzone 0xffff000010032858-0xffff00001003285f=0xcc
[ 0.000000] Slab 0xfffffdffc0400c80 objects=36 used=23 fp=0xffff000010032a18 flags=0x3fffe0000000200(workingset|node=0|zone=0|lastcpupid=0x1ffff)
[ 0.000000] Object 0xffff000010032858 @offset=2136 fp=0xffff0000100328c8
[ 0.000000]
[ 0.000000] Redzone ffff000010032850: cc cc cc cc cc cc cc cc ........
[ 0.000000] Object ffff000010032858: cc cc cc cc cc cc cc cc ........
[ 0.000000] Redzone ffff000010032860: cc cc cc cc cc cc cc cc ........
[ 0.000000] Padding ffff0000100328b4: 00 00 00 00 00 00 00 00 00 00 00 00 ............
[ 0.000000] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc3-next-20240814-00004-g61844c55c3f4 #144
[ 0.000000] Hardware name: NXP i.MX95 19X19 board (DT)
[ 0.000000] Call trace:
[ 0.000000] dump_backtrace+0x90/0xe8
[ 0.000000] show_stack+0x18/0x24
[ 0.000000] dump_stack_lvl+0x74/0x8c
[ 0.000000] dump_stack+0x18/0x24
[ 0.000000] print_trailer+0x150/0x218
[ 0.000000] check_object+0xe4/0x454
[ 0.000000] free_to_partial_list+0x2f8/0x5ec
To address the issue, use orig_size to clear the used area. And restore
the value of orig_size after clear the remaining area.
When CONFIG_SLUB_DEBUG not defined, (get_orig_size()' directly returns
s->object_size. So when using memset to init the area, the size can simply
be orig_size, as orig_size returns object_size when CONFIG_SLUB_DEBUG not
enabled. And orig_size can never be bigger than object_size.
Fixes: 946fa0dbf2 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested")
Cc: <stable@vger.kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Christian Brauner <brauner@kernel.org> says:
When a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer
must be located outside of the object because we don't know what part of
the memory can safely be overwritten as it may be needed to prevent
object recycling.
That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a
new cacheline. This is the case for e.g., struct file. After having it
shrunk down by 40 bytes and having it fit in three cachelines we still
have SLAB_TYPESAFE_BY_RCU adding a fourth cacheline because it needs to
accommodate the free pointer.
Add a new kmem_cache_create_rcu() function that allows the caller to
specify an offset where the free pointer is supposed to be placed.
Before this series cat /proc/slabinfo:
filp 1198 1248 256 32 2 : tunables 0 0 0 : slabdata 39 39 0
^^^
After this series cat /proc/slabinfo:
filp 1323 1323 192 21 1 : tunables 0 0 0 : slabdata 63 63 0
^^^
* patches from https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-0-5460bc1f09f6@kernel.org:
fs: use kmem_cache_create_rcu()
mm: add kmem_cache_create_rcu()
mm: remove unused root_cache argument
Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-0-5460bc1f09f6@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Switch to the new kmem_cache_create_rcu() helper which allows us to use
a custom free pointer offset avoiding the need to have an external free
pointer which would grow struct file behind our backs.
Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-3-5460bc1f09f6@kernel.org
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer
must be located outside of the object because we don't know what part of
the memory can safely be overwritten as it may be needed to prevent
object recycling.
That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a
new cacheline. This is the case for e.g., struct file. After having it
shrunk down by 40 bytes and having it fit in three cachelines we still
have SLAB_TYPESAFE_BY_RCU adding a fourth cacheline because it needs to
accommodate the free pointer.
Add a new kmem_cache_create_rcu() function that allows the caller to
specify an offset where the free pointer is supposed to be placed.
Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-2-5460bc1f09f6@kernel.org
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that we shrunk struct file to 192 bytes aka 3 cachelines reorder
struct file to not leave any holes or have members cross cachelines.
Add a short comment to each of the fields and mark the cachelines.
It's possible that we may have to tweak this based on profiling in the
future. So far I had Jens test this comparing io_uring with non-fixed
and fixed files and it improved performance. The layout is a combination
of Jens' and my changes.
Link: https: //lore.kernel.org/r/20240824-peinigen-hocken-7384b977c643@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that we shrank struct file by 24 bytes we still have a 4 byte hole.
If we move struct file_ra_state into the union and f_iocb_flags out of
the union we close that whole and bring down struct file to 192 bytes.
Which means struct file is 3 cachelines and we managed to shrink it by
40 bytes this cycle.
I've tried to audit all codepaths that use f_ra and none of them seem to
rely on it in file->f_op->release() and never have since commit
1da177e4c3 ("Linux-2.6.12-rc2").
Link: https://lore.kernel.org/r/20240823-luftdicht-berappen-d69a2166a0db@brauner
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
We do embedd struct fown_struct into struct file letting it take up 32
bytes in total. We could tweak struct fown_struct to be more compact but
really it shouldn't even be embedded in struct file in the first place.
Instead, actual users of struct fown_struct should allocate the struct
on demand. This frees up 24 bytes in struct file.
That will have some potentially user-visible changes for the ownership
fcntl()s. Some of them can now fail due to allocation failures.
Practically, that probably will almost never happen as the allocations
are small and they only happen once per file.
The fown_struct is used during kill_fasync() which is used by e.g.,
pipes to generate a SIGIO signal. Sending of such signals is conditional
on userspace having set an owner for the file using one of the F_OWNER
fcntl()s. Such users will be unaffected if struct fown_struct is
allocated during the fcntl() call.
There are a few subsystems that call __f_setown() expecting
file->f_owner to be allocated:
(1) tun devices
file->f_op->fasync::tun_chr_fasync()
-> __f_setown()
There are no callers of tun_chr_fasync().
(2) tty devices
file->f_op->fasync::tty_fasync()
-> __tty_fasync()
-> __f_setown()
tty_fasync() has no additional callers but __tty_fasync() has. Note
that __tty_fasync() only calls __f_setown() if the @on argument is
true. It's called from:
file->f_op->release::tty_release()
-> tty_release()
-> __tty_fasync()
-> __f_setown()
tty_release() calls __tty_fasync() with @on false
=> __f_setown() is never called from tty_release().
=> All callers of tty_release() are safe as well.
file->f_op->release::tty_open()
-> tty_release()
-> __tty_fasync()
-> __f_setown()
__tty_hangup() calls __tty_fasync() with @on false
=> __f_setown() is never called from tty_release().
=> All callers of __tty_hangup() are safe as well.
From the callchains it's obvious that (1) and (2) end up getting called
via file->f_op->fasync(). That can happen either through the F_SETFL
fcntl() with the FASYNC flag raised or via the FIOASYNC ioctl(). If
FASYNC is requested and the file isn't already FASYNC then
file->f_op->fasync() is called with @on true which ends up causing both
(1) and (2) to call __f_setown().
(1) and (2) are the only subsystems that call __f_setown() from the
file->f_op->fasync() handler. So both (1) and (2) have been updated to
allocate a struct fown_struct prior to calling fasync_helper() to
register with the fasync infrastructure. That's safe as they both call
fasync_helper() which also does allocations if @on is true.
The other interesting case are file leases:
(3) file leases
lease_manager_ops->lm_setup::lease_setup()
-> __f_setown()
Which in turn is called from:
generic_add_lease()
-> lease_manager_ops->lm_setup::lease_setup()
-> __f_setown()
So here again we can simply make generic_add_lease() allocate struct
fown_struct prior to the lease_manager_ops->lm_setup::lease_setup()
which happens under a spinlock.
With that the two remaining subsystems that call __f_setown() are:
(4) dnotify
(5) sockets
Both have their own custom ioctls to set struct fown_struct and both
have been converted to allocate a struct fown_struct on demand from
their respective ioctls.
Interactions with O_PATH are fine as well e.g., when opening a /dev/tty
as O_PATH then no file->f_op->open() happens thus no file->f_owner is
allocated. That's fine as no file operation will be set for those and
the device has never been opened. fcntl()s called on such things will
just allocate a ->f_owner on demand. Although I have zero idea why'd you
care about f_owner on an O_PATH fd.
Link: https://lore.kernel.org/r/20240813-work-f_owner-v2-1-4e9343a79f9f@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In kmem_buckets_create(), the kmem_buckets object is allocated by
kmem_cache_alloc() from kmem_buckets_cache, but in the failure case,
it's freed by kfree(). This is not wrong, but using kmem_cache_free()
is the more common pattern, so use it.
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Currently, KASAN is unable to catch use-after-free in SLAB_TYPESAFE_BY_RCU
slabs because use-after-free is allowed within the RCU grace period by
design.
Add a SLUB debugging feature which RCU-delays every individual
kmem_cache_free() before either actually freeing the object or handing it
off to KASAN, and change KASAN to poison freed objects as normal when this
option is enabled.
For now I've configured Kconfig.debug to default-enable this feature in the
KASAN GENERIC and SW_TAGS modes; I'm not enabling it by default in HW_TAGS
mode because I'm not sure if it might have unwanted performance degradation
effects there.
Note that this is mostly useful with KASAN in the quarantine-based GENERIC
mode; SLAB_TYPESAFE_BY_RCU slabs are basically always also slabs with a
->ctor, and KASAN's assign_tag() currently has to assign fixed tags for
those, reducing the effectiveness of SW_TAGS/HW_TAGS mode.
(A possible future extension of this work would be to also let SLUB call
the ->ctor() on every allocation instead of only when the slab page is
allocated; then tag-based modes would be able to assign new tags on every
reallocation.)
Tested-by: syzbot+263726e59eab6b442723@syzkaller.appspotmail.com
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz> #slab
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Currently, when KASAN is combined with init-on-free behavior, the
initialization happens before KASAN's "invalid free" checks.
More importantly, a subsequent commit will want to RCU-delay the actual
SLUB freeing of an object, and we'd like KASAN to still validate
synchronously that freeing the object is permitted. (Otherwise this
change will make the existing testcase kmem_cache_invalid_free fail.)
So add a new KASAN hook that allows KASAN to pre-validate a
kmem_cache_free() operation before SLUB actually starts modifying the
object or its metadata.
Inside KASAN, this:
- moves checks from poison_slab_object() into check_slab_allocation()
- moves kasan_arch_is_ready() up into callers of poison_slab_object()
- removes "ip" argument of poison_slab_object() and __kasan_slab_free()
(since those functions no longer do any reporting)
Acked-by: Vlastimil Babka <vbabka@suse.cz> #slub
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Add a test that will create cache, allocate one object, kfree_rcu() it
and attempt to destroy it. As long as the usage of kvfree_rcu_barrier()
in kmem_cache_destroy() works correctly, there should be no warnings in
dmesg and the test should pass.
Additionally add a test_leak_destroy() test that leaks an object on
purpose and verifies that kmem_cache_destroy() catches it.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
We would like to replace call_rcu() users with kfree_rcu() where the
existing callback is just a kmem_cache_free(). However this causes
issues when the cache can be destroyed (such as due to module unload).
Currently such modules should be issuing rcu_barrier() before
kmem_cache_destroy() to have their call_rcu() callbacks processed first.
This barrier is however not sufficient for kfree_rcu() in flight due
to the batching introduced by a35d16905e ("rcu: Add basic support for
kfree_rcu() batching").
This is not a problem for kmalloc caches which are never destroyed, but
since removing SLOB, kfree_rcu() is allowed also for any other cache,
that might be destroyed.
In order not to complicate the API, put the responsibility for handling
outstanding kfree_rcu() in kmem_cache_destroy() itself. Use the newly
introduced kvfree_rcu_barrier() to wait before destroying the cache.
This is similar to how we issue rcu_barrier() for SLAB_TYPESAFE_BY_RCU
caches, but has to be done earlier, as the latter only needs to wait for
the empty slab pages to finish freeing, and not objects from the slab.
Users of call_rcu() with arbitrary callbacks should still issue
rcu_barrier() before destroying the cache and unloading the module, as
kvfree_rcu_barrier() is not a superset of rcu_barrier() and the
callbacks may be invoking module code or performing other actions that
are necessary for a successful unload.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Add a kvfree_rcu_barrier() function. It waits until all
in-flight pointers are freed over RCU machinery. It does
not wait any GP completion and it is within its right to
return immediately if there are no outstanding pointers.
This function is useful when there is a need to guarantee
that a memory is fully freed before destroying memory caches.
For example, during unloading a kernel module.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
There used to be a rcu_barrier() for SLAB_TYPESAFE_BY_RCU caches in
kmem_cache_destroy() until commit 657dc2f972 ("slab: remove
synchronous rcu_barrier() call in memcg cache release path") moved it to
an asynchronous work that finishes the destroying of such caches.
The motivation for that commit was the MEMCG_KMEM integration that at
the time created and removed clones of the global slab caches together
with their cgroups, and blocking cgroups removal was unwelcome. The
implementation later changed to per-object memcg tracking using a single
cache, so there should be no more need for a fast non-blocking
kmem_cache_destroy(), which is typically only done when a module is
unloaded etc.
Going back to synchronous barrier has the following advantages:
- simpler implementation
- it's easier to test the result of kmem_cache_destroy() in a kunit test
Thus effectively revert commit 657dc2f972. It is not a 1:1 revert as
the code has changed since. The main part is that kmem_cache_release(s)
is always called from kmem_cache_destroy(), but for SLAB_TYPESAFE_BY_RCU
caches there's a rcu_barrier() first.
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
kfence_shutdown_cache() is called under slab_mutex when the cache is
destroyed synchronously, and outside slab_mutex during the delayed
destruction of SLAB_TYPESAFE_BY_RCU caches.
It seems it should always be safe to call it outside of slab_mutex so we
can just move the call to kmem_cache_release(), which is called outside.
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
kmem_cache_destroy() includes removing the associated sysfs and debugfs
directories, and the cache from the list of caches that appears in
/proc/slabinfo. Currently this might not happen immediately when:
- the cache is SLAB_TYPESAFE_BY_RCU and the cleanup is delayed,
including the directores removal
- __kmem_cache_shutdown() fails due to outstanding objects - the
directories remain indefinitely
When a cache is recreated with the same name, such as due to module
unload followed by a load, the directories will fail to be recreated for
the new instance of the cache due to the old directories being present.
The cache will also appear twice in /proc/slabinfo.
While we want to convert the SLAB_TYPESAFE_BY_RCU cleanup to be
synchronous again, the second point remains. So let's fix this first and
have the directories and slabinfo removed immediately in
kmem_cache_destroy() and regardless of __kmem_cache_shutdown() success.
This should not make debugging harder if __kmem_cache_shutdown() fails,
because a detailed report of outstanding objects is printed into dmesg
already due to the failure.
Also simplify kmem_cache_release() sysfs handling by using
__is_defined(SLAB_SUPPORTS_SYSFS).
Note the resulting code in kmem_cache_destroy() is a bit ugly but will
be further simplified - this is in order to make small bisectable steps.
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
There's only one caller of shutdown_cache() so move the code into it.
Preparatory patch for further changes, no functional change.
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Revert commit 8014c46ad9 ("slub: use alloc_pages_node() in alloc_slab_page()").
The patch disabled the numa policy support in the slab allocator. It
did not consider that alloc_pages() uses memory policies but
alloc_pages_node() does not.
As a result of this patch slab memory allocations are no longer spread via
interleave policy across all available NUMA nodes on bootup. Instead
all slab memory is allocated close to the boot processor. This leads to
an imbalance of memory accesses on NUMA systems.
Also applications using MPOL_INTERLEAVE as a memory policy will no longer
spread slab allocations over all nodes in the interleave set but allocate
memory locally. This may also result in unbalanced allocations
on a single numa node.
SLUB does not apply memory policies to individual object allocations.
However, it relies on the page allocators support of memory policies
through alloc_pages() to do the NUMA memory allocations on a per
folio or page level. SLUB also applies memory policies when retrieving
partial allocated slab pages from the partial list.
Fixes: 8014c46ad9 ("slub: use alloc_pages_node() in alloc_slab_page()")
Signed-off-by: Christoph Lameter <cl@gentwo.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Depending on how remote_node_defrag_ratio is configured, allocations can
end up in this path as a result of the local node being OOM, despite the
allocation overall being unconstrained (node == -1).
When we print a warning, printing the current CPU makes that situation
more clear (i.e., you can immediately see which node's OOM status
matters for the allocation at hand).
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
- rhashtable conversion for vfs inodes
- rcu_pending, btree key cache conversion
+ nocow deadlock fix
+ fix for new rebalance_work accounting
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbKB9wACgkQE6szbY3K
bnYMtw/+LSGV/eqwLwdeuABggU5gehWjxqkWF/uGE7fPP8pP0dJQnvCLRVKtAro2
0mJh6j+kM602fU5jH/W8WNn1h8J7dAdkyqI7P/D3ZgTdaCtxso+A0Nj95CYpNdWY
ESX9CLUYSxtFatT/kfWvlRvqlSJBYo7WgNsV6tcnPdpC+Oki6Kwlq22iI+ma9Ty7
uDbgd05/R9KCSxaaV+9iojCsEq6h/tuFH8Z3f3SevA8H29odh5mt0UWNn05pf3mt
rAnDUJ5TQVYubMIcbS6MhjVoLZ3AxOefkk4pctdbdmGSPJcssDeXvATn/wHYl6Fp
+et1ECRU3Sc3dqcmT0RaTm/yxYytdtKA4HVxS4ELKbsIM2xU0Pjq3JQwKzHRwXDd
a3r0WXa+LqHkBP37g0HhuxhxAECnbpUM9bvDivgGssVDLyxfMKUkhDsuzegjrHAF
v5H08myk5maKvLv+dD6e23t0l1i9eB/bSsw1iNGOgZP4k9gsUlESvppFGw/10F+Q
1Y/qeSiNTG9kJyo9PQTOZ6rFVxrfaZ9NFP4EAXcWId81OsQHYY8XnE5XaJATxnwF
MzCgNdmzuf67X6Q8fCeNCJtiZ5sCmbyENGd6hbyYFDg+R02p0NOM4ABVN6BBfXJ+
eHPyu2bvusIZt8MD6c7fOxyGsGdgLxIv/SkqLayZdxEaY3VvS2g=
=ejxu
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- assorted syzbot fixes
- some upgrade fixes for old (pre 1.0) filesystems
- fix for moving data off a device that was switched to durability=0
after data had been written to it.
- nocow deadlock fix
- fix for new rebalance_work accounting
* tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs: (28 commits)
bcachefs: Fix rebalance_work accounting
bcachefs: Fix failure to flush moves before sleeping in copygc
bcachefs: don't use rht_bucket() in btree_key_cache_scan()
bcachefs: add missing inode_walker_exit()
bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
bcachefs: Fix double assignment in check_dirent_to_subvol()
bcachefs: Fix refcounting in discard path
bcachefs: Fix compat issue with old alloc_v4 keys
bcachefs: Fix warning in bch2_fs_journal_stop()
fs/super.c: improve get_tree() error message
bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
bcachefs: Fix replay_now_at() assert
bcachefs: Fix locking in bch2_ioc_setlabel()
bcachefs: fix failure to relock in btree_node_fill()
bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
bcachefs: unlock_long() before resort in journal replay
bcachefs: fix missing bch2_err_str()
bcachefs: fix time_stats_to_text()
bcachefs: Fix bch2_bucket_gens_init()
bcachefs: Fix bch2_trigger_alloc assert
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbJteoACgkQiiy9cAdy
T1F+Pwv/RHXSnQD+jkFEfQCEgsZZOfWD0V74VZqm90N48gfB3giZw9mtV4I1jQzI
0+UerZjN7lHIDC4f6qp48TSEodHpprAxLfsg5JJN/OxDE+0MSbctTjLeHlduVzw6
iHEdaE3jWN0p4YZRdbyrUCaOoTEk9cKwiG7r2DjArNyQ8kClveeqrGfdZUDTHNkv
IIs6CJ8PFo7dicpAIGPmMz1TGq5Lh2EFjZTYEweSSlyXUNKaWgz3BXBIXD4LwK6w
mFjGPxGNBDorcvzHcOUZnrpfACB3WNOSPN/WK5sQL6LXGCx3sWtUvGxLFkxFwjSq
D7gvo7qnBuycNyR03RfmWyXYx+2KzdYoAUGTNV114zMJskBC0QhIIF6JK+xZdPZX
XHxbr4CRR7fsaZOur5MTWXEzVJxvC1irULKoBp7lvYpEoAV6yXpK3XegAHIASKUE
/Cw9qikIvxrMg4BjWPP1JhbKRw92uL2ty4oO913hbnBsScS8jCystuNl6ataiXWq
PN5rN4sy
=bGOb
-----END PGP SIGNATURE-----
Merge tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- query directory flex array fix
- fix potential null ptr reference in open
- fix error message in some open cases
- two minor cleanups
* tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd:
smb/server: update misguided comment of smb2_allocate_rsp_buf()
smb/server: remove useless assignment of 'file_present' in smb2_open()
smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open()
smb/server: fix return value of smb2_open()
ksmbd: the buffer of smb2 query dir response has at least 1 byte
- Fix KASLR base offset to account for symbol offsets in the vmlinux
ELF file, preventing tool breakages like the drgn debugger
- Fix potential memory corruption of physmem_info during kernel physical
address randomization
- Fix potential memory corruption due to overlap between the relocated
lowcore and identity mapping by correctly reserving lowcore memory
- Fix performance regression and avoid randomizing identity mapping base
by default
- Fix unnecessary delay of AP bus binding complete uevent to prevent
startup lag in KVM guests using AP
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmbJDQYACgkQjYWKoQLX
FBgTdwf9FNkHvLFhf5JbqlIERrjI9Ax8lQwCrAAJOidWwyKKs5hkFXUbf8JeMO1/
r/eIWI/hqeeQhm/YXWsdrO1KOi2tS92eHTztelTZjKS7d2nLEkl5EELRtE6lVwWK
6T/iENQNtBibRnK6zDRb3acb/MGkdQEDfNmvRwI02ZwIvGlv6bQnQEspKc69YJOo
DiDHb+aqpsSjAY9QlRzM/Dxg3NUknEYOfxoDY6rG9cL1KnZxk+PDfy+z9gno44Tx
vf+G55lBQ+vunQsV/9YHKYsytsj7kYCECp/W50W1ExrOBPhZRR9zM2S14BVCGuIW
EdLVD8R1h0oRcgqlCIrKsnxAqatzIQ==
=RsEC
-----END PGP SIGNATURE-----
Merge tag 's390-6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix KASLR base offset to account for symbol offsets in the vmlinux
ELF file, preventing tool breakages like the drgn debugger
- Fix potential memory corruption of physmem_info during kernel
physical address randomization
- Fix potential memory corruption due to overlap between the relocated
lowcore and identity mapping by correctly reserving lowcore memory
- Fix performance regression and avoid randomizing identity mapping
base by default
- Fix unnecessary delay of AP bus binding complete uevent to prevent
startup lag in KVM guests using AP
* tag 's390-6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/boot: Fix KASLR base offset off by __START_KERNEL bytes
s390/boot: Avoid possible physmem_info segment corruption
s390/ap: Refine AP bus bindings complete processing
s390/mm: Pin identity mapping base to zero
s390/mm: Prevent lowcore vs identity mapping overlap
The important core fix is another tweak to our discard discovery
issues. The off by 512 in logical block count seems bad, but in fact
the inline was only ever used in debug prints, which is why no-one
noticed.
Signed-off-by: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZsmoJyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishUp5AQCe8tyb
L0iMWG8/9SeQ5Eyj6EN4iy+6nGxFx+c86XtP2wEA7du6y4of9+rOPVHLn8NyLALH
WkJ6K1876z7qsbYhKqA=
=TXd7
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The important core fix is another tweak to our discard discovery
issues. The off by 512 in logical block count seems bad, but in fact
the inline was only ever used in debug prints, which is why no-one
noticed"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: Do not attempt to configure discard unless LBPME is set
scsi: MAINTAINERS: Add header files to SCSI SUBSYSTEM
scsi: ufs: qcom: Add UFSHCD_QUIRK_BROKEN_LSDBS_CAP for SM8550 SoC
scsi: ufs: core: Add a quirk for handling broken LSDBS field in controller capabilities register
scsi: core: Fix the return value of scsi_logical_block_count()
scsi: MAINTAINERS: Update HiSilicon SAS controller driver maintainer