mirror of
https://github.com/torvalds/linux.git
synced 2026-03-08 03:44:45 +01:00
476 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
4cff5c05e0 |
mm.git review status for linus..mm-stable
Everything:
Total patches: 325
Reviews/patch: 1.39
Reviewed rate: 72%
Excluding DAMON:
Total patches: 262
Reviews/patch: 1.63
Reviewed rate: 82%
Excluding DAMON and zram:
Total patches: 248
Reviews/patch: 1.72
Reviewed rate: 86%
- The 14 patch series "powerpc/64s: do not re-activate batched TLB
flush" from Alexander Gordeev makes arch_{enter|leave}_lazy_mmu_mode()
nest properly.
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- The 7 patch series "zram: introduce compressed data writeback" from
Richard Chang and Sergey Senozhatsky implements data compression for
zram writeback.
- The 8 patch series "mm: folio_zero_user: clear page ranges" from David
Hildenbrand adds clearing of contiguous page ranges for hugepages.
Large improvements during demand faulting are demonstrated.
- The 2 patch series "memcg cleanups" from Chen Ridong tideis up some
memcg code.
- The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and
tracepoint for damos stats" from SeongJae Park improves DAMOS stat's
provided information, deterministic control, and readability.
- The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness
fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging
selftests.
- The 5 patch series "Fix va_high_addr_switch.sh test failure - again"
from Chunyu Hu addresses several issues in the va_high_addr_switch test.
- The 5 patch series "mm/damon/tests/core-kunit: extend existing test
scenarios" from Shu Anzai improves the KUnit test coverage for DAMON.
- The 2 patch series "mm/khugepaged: fix dirty page handling for
MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was
causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN.
- The 29 patch series "arch, mm: consolidate hugetlb early reservation"
from Mike Rapoport reworks and consolidates a pile of straggly code
related to reservation of hugetlb memory from bootmem and creation of
CMA areas for hugetlb.
- The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo
Stoakes cleans up the anon_vma implementation in various ways.
- The 3 patch series "tweaks for __alloc_pages_slowpath()" from
Vlastimil Babka does a little streamlining of the page allocator's
slowpath code.
- The 8 patch series "memcg: separate private and public ID namespaces"
from Shakeel Butt cleans up the memcg ID code and prevents the
internal-only private IDs from being exposed to userspace.
- The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from
Kefeng Wang cleans up the allocation of frozen folios and avoids some
atomic refcount operations.
- The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from
SeongJae Park improves DAMOS's movement of memory betewwn the active and
inactive LRUs and adds auto-tuning of the ratio-based quotas and of
monitoring intervals.
- The 18 patch series "Support page table check on PowerPC" from Andrew
Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc.
- The 3 patch series "nodemask: align nodes_and{,not} with underlying
bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot()
propagate the return values from the underlying bit operations, enabling
some cleanup in calling code.
- The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API
callers" from SeongJae Park cleans up some DAMON internal interfaces.
- The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from
Shivank Garg does some cleanup work in khupaged and fixes a scan limit
accounting issue.
- The 24 patch series "mm: balloon infrastructure cleanups" from David
Hildenbrand goes to town on the balloon infrastructure and its page
migration function. Mainly cleanups, also some locking simplification.
- The 2 patch series "mm/vmscan: add tracepoint and reason for
kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to
the page reclaim code.
- The 3 patch series "Replace wq users and add WQ_PERCPU to
alloc_workqueue() users" from Marco Crivellari is part of Marco's
kernel-wide migration from the legacy workqueue APIs over to the
preferred unbound workqueues.
- The 9 patch series "Various mm kselftests improvements/fixes" from
Kevin Brodsky provides various unrelated improvements/fixes for the mm
kselftests.
- The 5 patch series "mm: accelerate gigantic folio allocation" from
Kefeng Wang greatly speeds up gigantic folio allocation, mainly by
avoiding unnecessary work in pfn_range_valid_contig().
- The 5 patch series "selftests/damon: improve leak detection and wss
estimation reliability" from SeongJae Park improves the reliability of
two of the DAMON selftests.
- The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos
filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work
in the core DAMON code.
- The 8 patch series "Docs/mm/damon: update intro, modules, maintainer
profile, and misc" from SeongJae Park performs maintenance work on the
DAMON documentation.
- The 10 patch series "mm: add and use vma_assert_stabilised() helper"
from Lorenzo Stoakes refactors and cleans up the core VMA code. The
main aim here is to be able to use the mmap write lock's lockdep state
to perform various assertions regarding the locking which the VMA code
requires.
- The 19 patch series "mm, swap: swap table phase II: unify swapin use"
from Kairui Song removes some old swap code (swap cache bypassing and
swap synchronization) which wasn't working very well. Various other
cleanups and simplifications were made. The end result is a 20% speedup
in one benchmark.
- The 8 patch series "enable PT_RECLAIM on more 64-bit architectures"
from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch,
mips, parisc, um, Various cleanups were performed along the way.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA
jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP
rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI=
=+HLt
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "powerpc/64s: do not re-activate batched TLB flush" makes
arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- "zram: introduce compressed data writeback" implements data
compression for zram writeback (Richard Chang and Sergey Senozhatsky)
- "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
page ranges for hugepages. Large improvements during demand faulting
are demonstrated (David Hildenbrand)
- "memcg cleanups" tidies up some memcg code (Chen Ridong)
- "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
stats" improves DAMOS stat's provided information, deterministic
control, and readability (SeongJae Park)
- "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
issues in the hugetlb cgroup charging selftests (Li Wang)
- "Fix va_high_addr_switch.sh test failure - again" addresses several
issues in the va_high_addr_switch test (Chunyu Hu)
- "mm/damon/tests/core-kunit: extend existing test scenarios" improves
the KUnit test coverage for DAMON (Shu Anzai)
- "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
transiently return -EAGAIN (Shivank Garg)
- "arch, mm: consolidate hugetlb early reservation" reworks and
consolidates a pile of straggly code related to reservation of
hugetlb memory from bootmem and creation of CMA areas for hugetlb
(Mike Rapoport)
- "mm: clean up anon_vma implementation" cleans up the anon_vma
implementation in various ways (Lorenzo Stoakes)
- "tweaks for __alloc_pages_slowpath()" does a little streamlining of
the page allocator's slowpath code (Vlastimil Babka)
- "memcg: separate private and public ID namespaces" cleans up the
memcg ID code and prevents the internal-only private IDs from being
exposed to userspace (Shakeel Butt)
- "mm: hugetlb: allocate frozen gigantic folio" cleans up the
allocation of frozen folios and avoids some atomic refcount
operations (Kefeng Wang)
- "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
of memory betewwn the active and inactive LRUs and adds auto-tuning
of the ratio-based quotas and of monitoring intervals (SeongJae Park)
- "Support page table check on PowerPC" makes
CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)
- "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
nodes_and() and nodes_andnot() propagate the return values from the
underlying bit operations, enabling some cleanup in calling code
(Yury Norov)
- "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
some DAMON internal interfaces (SeongJae Park)
- "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
in khupaged and fixes a scan limit accounting issue (Shivank Garg)
- "mm: balloon infrastructure cleanups" goes to town on the balloon
infrastructure and its page migration function. Mainly cleanups, also
some locking simplification (David Hildenbrand)
- "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
additional tracepoints to the page reclaim code (Jiayuan Chen)
- "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
part of Marco's kernel-wide migration from the legacy workqueue APIs
over to the preferred unbound workqueues (Marco Crivellari)
- "Various mm kselftests improvements/fixes" provides various unrelated
improvements/fixes for the mm kselftests (Kevin Brodsky)
- "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
folio allocation, mainly by avoiding unnecessary work in
pfn_range_valid_contig() (Kefeng Wang)
- "selftests/damon: improve leak detection and wss estimation
reliability" improves the reliability of two of the DAMON selftests
(SeongJae Park)
- "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION" does some cleanup work in the core DAMON code
(SeongJae Park)
- "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
performs maintenance work on the DAMON documentation (SeongJae Park)
- "mm: add and use vma_assert_stabilised() helper" refactors and cleans
up the core VMA code. The main aim here is to be able to use the mmap
write lock's lockdep state to perform various assertions regarding
the locking which the VMA code requires (Lorenzo Stoakes)
- "mm, swap: swap table phase II: unify swapin use" removes some old
swap code (swap cache bypassing and swap synchronization) which
wasn't working very well. Various other cleanups and simplifications
were made. The end result is a 20% speedup in one benchmark (Kairui
Song)
- "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
available on 64-bit alpha, loongarch, mips, parisc, and um. Various
cleanups were performed along the way (Qi Zheng)
* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
mm: move pte table reclaim code to memory.c
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
zsmalloc: make common caches global
mm: add SPDX id lines to some mm source files
mm/zswap: use %pe to print error pointers
mm/vmscan: use %pe to print error pointers
mm/readahead: fix typo in comment
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
mm: refactor vma_map_pages to use vm_insert_pages
mm/damon: unify address range representation with damon_addr_range
mm/cma: replace snprintf with strscpy in cma_new_area
...
|
||
|
|
148f95f75c |
slab updates for 7.0
-----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmmK68UbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiamM8H/0eOKSvZG/C/HdTm36cy pVjOjgX9KmlHoeH1dOMjqgL2KfOIBis8j1GY0Q/qF1a86uzQa6uuz4XdmJeTUkEE YfzwOdaLIR0U6R/gIH9YPfyU9h3VBLUNtotDculntSO3ZgwY5QUHQHz+ROnVG5SU MSQ2oshSRkh06LRIlvbd0kLax8vZy8UjfYPonF33+XRya17nIY6V2DvqC0MDuEcM jWvbQfm5HTamTAlSV4bmJw+U/FehEdpC4U0ulsAtQILGpJvHCwqDGCNQRFkzcsaM yhi1JLFCZcoHqbQycZMNAypPERfIp8O5thSU6xU2AP/cNl2scR/7/MSuWOvjKBv4 pKU= =u52A -----END PGP SIGNATURE----- Merge tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - The percpu sheaves caching layer was introduced as opt-in in 6.18 and now we enable it for all caches and remove the previous cpu (partial) slab caching mechanism. Besides the lower locking overhead and much more likely fastpath when freeing, this removes the rather complicated code related to the cpu slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all its complications for PREEMPT_RT or kmalloc_nolock(). The lockless slab freelist+counters update operation using try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects, and to allow flushing objects from sheaves to slabs mostly without the node list_lock (Vlastimil Babka) - Eliminate slabobj_ext metadata overhead when possible. Instead of using kmalloc() to allocate the array for memcg and/or allocation profiling tag pointers, use leftover space in a slab or per-object padding due to alignment (Harry Yoo) - Various followup improvements to the above (Hao Li) * tag 'slab-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (39 commits) slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches mm/slab: place slabobj_ext metadata in unused space within s->size mm/slab: move [__]ksize and slab_ksize() to mm/slub.c mm/slab: save memory by allocating slabobj_ext array from leftover mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison mm/slab: use stride to access slabobj_ext mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper ext4: specify the free pointer offset for ext4_inode_cache mm/slab: allow specifying free pointer offset when using constructor mm/slab: use unsigned long for orig_size to ensure proper metadata align slub: clarify object field layout comments mm/slab: avoid allocating slabobj_ext array from its own slab slub: avoid list_lock contention from __refill_objects_any() mm/slub: cleanup and repurpose some stat items mm/slub: remove DEACTIVATE_TO_* stat items slab: remove frozen slab checks from __slab_free() slab: update overview comments slab: refill sheaves from all nodes slab: remove unused PREEMPT_RT specific macros ... |
||
|
|
9c8c02df3f |
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
The PT_RECLAIM can work on all architectures that support MMU_GATHER_RCU_TABLE_FREE, except for those that have selected HAVE_ARCH_TLB_REMOVE_TABLE,so make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE. BTW, change PT_RECLAIM to be enabled by default, since nobody should want to turn it off. Link: https://lkml.kernel.org/r/83b034810935a9ff18e425b085e065bb0acb28f3.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
086498aed3 |
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
For architectures that define __HAVE_ARCH_TLB_REMOVE_TABLE, the page tables at the pmd/pud level are generally not of struct ptdesc type, and do not have pt_rcu_head member, thus these architectures cannot support PT_RECLAIM. In preparation for enabling PT_RECLAIM on more architectures, convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config, so that we can make conditional judgments in Kconfig. Link: https://lkml.kernel.org/r/5ebfa3d4b56e63c6906bda5eccaa9f7194d3a86b.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Andreas Larsson <andreas@gaisler.com> [sparc, UP&SMP] Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1421758055 |
mm: rename CONFIG_MEMORY_BALLOON -> CONFIG_BALLOON
Let's make it consistent with the naming of the files but also with the naming of CONFIG_BALLOON_MIGRATION. While at it, add a "/* CONFIG_BALLOON */". Link: https://lkml.kernel.org/r/20260119230133.3551867-24-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cd8e95d80b |
mm: rename CONFIG_BALLOON_COMPACTION to CONFIG_BALLOON_MIGRATION
While compaction depends on migration, the other direction is not the case. So let's make it clearer that this is all about migration of balloon pages. Adjust all comments/docs in the core to talk about "migration" instead of "compaction". While at it add some "/* CONFIG_BALLOON_MIGRATION */". Link: https://lkml.kernel.org/r/20260119230133.3551867-23-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7cf3318a25 |
mm/kconfig: make BALLOON_COMPACTION depend on MIGRATION
Migration support for balloon memory depends on MIGRATION not COMPACTION.
Compaction is simply another user of page migration.
The last dependency on compaction.c was effectively removed with commit
|
||
|
|
e323b52cf0 |
slab: remove SLUB_CPU_PARTIAL
We have removed the partial slab usage from allocation paths. Now remove the whole config option and associated code. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
ee628d9cc8 |
mm: add basic tests for lazy_mmu
Add basic KUnit tests for the generic aspects of the lazy MMU mode: ensure that it appears active when it should, depending on how enable/disable and pause/resume pairs are nested. [akpm@linux-foundation.org: export ppc64_tlb_batch and __flush_tlb_pending to modules] [ritesh.list@gmail.com: use EXPORT_SYMBOL_IF_KUNIT()] Link: https://lkml.kernel.org/r/87a4zhkt6h.ritesh.list@gmail.com [kevin.brodsky@arm.com: move MODULE_IMPORT_NS(), add comment] Link: https://lkml.kernel.org/r/20251217163812.2633648-2-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20251215150323.2218608-15-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7303ecbfe4 |
mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
Architectures currently opt in for implementing lazy_mmu helpers by defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE. In preparation for introducing a generic lazy_mmu layer that will require storage in task_struct, let's switch to a cleaner approach: instead of defining a macro, select a CONFIG option. This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each arch select it when it implements lazy_mmu helpers. __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h> relies on the new CONFIG instead. On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is selected. This creates some complications in arch/x86/boot/, because a few files manually undefine PARAVIRT* options. As a result <asm/paravirt.h> does not define the lazy_mmu helpers, but this breaks the build as <linux/pgtable.h> only defines them if !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean way out of this - let's just undefine that new CONFIG too. Link: https://lkml.kernel.org/r/20251215150323.2218608-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
269031b15c |
x86/kaslr: Recognize all ZONE_DEVICE users as physaddr consumers
Commit |
||
|
|
31807483d3 |
mm/memory-failure: remove the selection of RAS
commit
|
||
|
|
2ec4196718 |
mm: handle poisoning of pfn without struct pages
Poison (or ECC) errors can be very common on a large size cluster. The kernel MM currently does not handle ECC errors / poison on a memory region that is not backed by struct pages. If a memory region mapped using remap_pfn_range() for example, but not added to the kernel, MM will not have associated struct pages. Add a new mechanism to handle memory failure on such memory. Make kernel MM expose a function to allow modules managing the device memory to register the device memory SPA and the address space associated it. MM maintains this information as an interval tree. On poison, MM can search for the range that the poisoned PFN belong and use the address_space to determine the mapping VMA. In this implementation, kernel MM follows the following sequence that is largely similar to the memory_failure() handler for struct page backed memory: 1. memory_failure() is triggered on reception of a poison error. An absence of struct page is detected and consequently memory_failure_pfn() is executed. 2. memory_failure_pfn() collects the processes mapped to the PFN. 3. memory_failure_pfn() sends SIGBUS to all the processes mapping the faulty PFN using kill_procs(). Note that there is one primary difference versus the handling of the poison on struct pages, which is to skip unmapping to the faulty PFN. This is done to handle the huge PFNMAP support added recently [1] that enables VM_PFNMAP vmas to map at PMD or PUD level. A poison to a PFN mapped in such as way would need breaking the PMD/PUD mapping into PTEs that will get mirrored into the S2. This can greatly increase the cost of table walks and have a major performance impact. Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1] Link: https://lkml.kernel.org/r/20251102184434.2406-3-ankita@nvidia.com Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Cc: Aniket Agashe <aniketa@nvidia.com> Cc: Borislav Betkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Kirti Wankhede <kwankhede@nvidia.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew R. Ochs <mochs@nvidia.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Neo Jia <cjia@nvidia.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shuai Xue <xueshuai@linux.alibaba.com> Cc: Smita Koralahalli Channabasappa <smita.koralahallichannabasappa@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tarun Gupta <targupta@nvidia.com> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Cc: Vikram Sethi <vsethi@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e97d7c5165 |
mm: shmem/tmpfs hugepage defaults config choice
Allow to override defaults for shemem and tmpfs at config time. This is consistent with how transparent hugepages can be configured. Same results can be achieved with the existing 'transparent_hugepage_shmem' and 'transparent_hugepage_tmpfs' settings in the kernel command line, but it is more convenient to define basic settings at config time instead of changing kernel command line later. Defaults for shmem and tmpfs were not changed. They are remained the same as before: 'never' for both cases. Options 'deny' and 'force' are omitted intentionally since these are special values and supposed to be used for emergencies or testing and are not expected to be permanent ones. Primary motivation for adding config option is to enable policy enforcement at build time. In large-scale production environments (Meta's for example), the kernel configuration is often maintained centrally close to the kernel code itself and owned by the kernel engineers, while boot parameters are managed independently (e.g. by provisioning systems). In such setups, the kernel build defines the supported and expected behavior in a single place, but there is no reliable or uniform control over the kernel command line options. A build-time default allows kernel integrators to enforce a predictable hugepage policy for shmem/tmpfs on a base layer, ensuring reproducible behavior and avoiding configuration drift caused by possible boot-time differences. In short, primary benefit is mostly operational: it provides a way to codify preferred policy in the kernel configuration, which is versioned, reviewed, and tested as part of the kernel build process, rather than depending on potentially variable boot parameters. [d@ilvokhin.com: v2] Link: https://lkml.kernel.org/r/aQECPpjd-fU_TC79@shell.ilvokhin.com Link: https://lkml.kernel.org/r/aPpv8sAa2sYgNu3L@shell.ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5ba2f0a155 |
mm: introduce deferred freeing for kernel page tables
This introduces a conditional asynchronous mechanism, enabled by CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the freeing of pages that are used as page tables for kernel address mappings. These pages are now queued to a work struct instead of being freed immediately. This deferred freeing allows for batch-freeing of page tables, providing a safe context for performing a single expensive operation (TLB flush) for a batch of kernel page tables instead of performing that expensive operation for each page table. Link: https://lkml.kernel.org/r/20251022082635.2462433-8-baolu.lu@linux.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
900fcf00e1 |
mm: remove the BOUNCE config option
Commit
|
||
|
|
39231e8d6b |
mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
In the past, CONFIG_ARCH_HAS_GIGANTIC_PAGE indicated that we support runtime allocation of gigantic hugetlb folios. In the meantime it evolved into a generic way for the architecture to state that it supports gigantic hugetlb folios. In commit |
||
|
|
8804d970fa |
Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation. - The 4 patch series "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs. - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters. - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps. - The 2 patch series "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code. - The 11 patch series "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code. - The 5 patch series "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero. - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature. - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs. - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code. - The 7 patch series "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code. - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations. - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc. - The 3 patch series "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path. - The 5 patch series "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code. - The 2 patch series "test that rmap behaves as expected" from Wei Yang adds some rmap selftests. - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers. - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues. - The 3 patch series "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks. - The 2 patch series "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code. - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem. - The 4 patch series "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/. - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c. - The 2 patch series "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation. - The 3 patch series "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc). - The 2 patch series "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code. - The 37 patch series "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function. - The 2 patch series "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only. - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code. - The 12 patch series "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy. - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages(). - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver. - The 2 patch series "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code. - The 14 patch series "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations. - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little. - The 3 patch series "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code. - The 2 patch series "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature. - The 3 patch series "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work. - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem. - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code. - The 10 patch series "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources. - The 5 patch series "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON. - The 7 patch series "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix. - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information. - The 2 patch series "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma. - The 2 patch series "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems. - The 6 patch series "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate. - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters. - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA= =4mhY -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ... |
||
|
|
af92793e52 |
slab: Introduce kmalloc_nolock() and kfree_nolock().
kmalloc_nolock() relies on ability of local_trylock_t to detect
the situation when per-cpu kmem_cache is locked.
In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
disables IRQs and marks s->cpu_slab->lock as acquired.
local_lock_is_locked(&s->cpu_slab->lock) returns true when
slab is in the middle of manipulating per-cpu cache
of that specific kmem_cache.
kmalloc_nolock() can be called from any context and can re-enter
into ___slab_alloc():
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
or
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
can be acquired without a deadlock before invoking the function.
If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
retries in a different kmalloc bucket. The second attempt will
likely succeed, since this cpu locked different kmem_cache.
Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current _task_. In this case
re-entrance into the same kmalloc bucket is unsafe, and
kmalloc_nolock() tries a different bucket that is most likely is
not locked by the current task. Though it may be locked by a
different task it's safe to rt_spin_lock() and sleep on it.
Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.
kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
and (in_nmi() or in PREEMPT_RT).
SLUB_TINY config doesn't use local_lock_is_locked() and relies on
spin_trylock_irqsave(&n->list_lock) to allocate,
while kfree_nolock() always defers to irq_work.
Note, kfree_nolock() must be called _only_ for objects allocated
with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
will miss kmemleak/kfence book keeping and will cause false positives.
large_kmalloc is not supported by either kmalloc_nolock()
or kfree_nolock().
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
cde31ecdd1 |
mm/memory-failure: don't select MEMORY_ISOLATION
We added that "select MEMORY_ISOLATION" in commit |
||
|
|
f8f03eb5f0 |
mm: stop making SPARSEMEM_VMEMMAP user-selectable
Patch series "mm: remove nth_page()", v2. As discussed recently with Linus, nth_page() is just nasty and we would like to remove it. To recap, the reason we currently need nth_page() within a folio is because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the memmap is allocated per memory section. While buddy allocations cannot cross memory section boundaries, hugetlb and dax folios can. So crossing a memory section means that "page++" could do the wrong thing. Instead, nth_page() on these problematic configs always goes from page->pfn, to the go from (++pfn)->page, which is rather nasty. Likely, many people have no idea when nth_page() is required and when it might be dropped. We refer to such problematic PFN ranges and "non-contiguous pages". If we only deal with "contiguous pages", there is not need for nth_page(). Besides that "obvious" folio case, we might end up using nth_page() within CMA allocations (again, could span memory sections), and in one corner case (kfence) when processing memblock allocations (again, could span memory sections). So let's handle all that, add sanity checks, and remove nth_page(). Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups Patch #6 -> #13 : disallow folios to have non-contiguous pages Patch #14 -> #20 : remove nth_page() usage within folios Patch #22 : disallow CMA allocations of non-contiguous pages Patch #23 -> #33 : sanity+check + remove nth_page() usage within SG entry Patch #34 : sanity-check + remove nth_page() usage in unpin_user_page_range_dirty_lock() Patch #35 : remove nth_page() in kfence Patch #36 : adjust stale comment regarding nth_page Patch #37 : mm: remove nth_page() A lot of this is inspired from the discussion at [1] between Linus, Jason and me, so cudos to them. This patch (of 37): In an ideal world, we wouldn't have to deal with SPARSEMEM without SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is considered too costly and consequently not supported. However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just like we already do for arm64, s390 and x86. So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without SPARSEMEM_VMEMMAP. This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone for loongarch, powerpc, riscv and sparc. All architectures only enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big downside to using the VMEMMAP (quite the contrary). This is a preparation for not supporting (1) folio sizes that exceed a single memory section (2) CMA allocations of non-contiguous page ranges in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit possible impact as much as possible (e.g., gigantic hugetlb page allocations suddenly fails). Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1] Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Cc: Alex Dubov <oakad@yahoo.com> Cc: Alex Willamson <alex.williamson@redhat.com> Cc: Bart van Assche <bvanassche@acm.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Brett Creeley <brett.creeley@amd.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ingo Molnar <mingo@redhat.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Maxim Levitky <maximlevitsky@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Niklas Cassel <cassel@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2ccd9fecd9 |
mm: remove unused zpool layer
With zswap using zsmalloc directly, there are no more in-tree users of this code. Remove it. With zpool gone, zsmalloc is now always a simple dependency and no longer something the user needs to configure. Hide CONFIG_ZSMALLOC from the user and have zswap and zram pull it in as needed. Link: https://lkml.kernel.org/r/20250829162212.208258-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.se> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2d8bd8049e |
mm: add persistent huge zero folio
Many places in the kernel need to zero out larger chunks, but the maximum segment that can be zeroed out at a time by ZERO_PAGE is limited by PAGE_SIZE. This is especially annoying in block devices and filesystems where multiple ZERO_PAGEs are attached to the bio in different bvecs. With multipage bvec support in block layer, it is much more efficient to send out larger zero pages as a part of single bvec. This concern was raised during the review of adding Large Block Size support to XFS[1][2]. Usually huge_zero_folio is allocated on demand, and it will be deallocated by the shrinker if there are no users of it left. At moment, huge_zero_folio infrastructure refcount is tied to the process lifetime that created it. This might not work for bio layer as the completions can be async and the process that created the huge_zero_folio might no longer be alive. And, one of the main points that came up during discussion is to have something bigger than zero page as a drop-in replacement. Add a config option PERSISTENT_HUGE_ZERO_FOLIO that will result in allocating the huge zero folio during early init and never free the memory by disabling the shrinker. This makes using the huge_zero_folio without having to pass any mm struct and does not tie the lifetime of the zero folio to anything, making it a drop-in replacement for ZERO_PAGE. If PERSISTENT_HUGE_ZERO_FOLIO config option is enabled, then mm_get_huge_zero_folio() will simply return the allocated page instead of dynamically allocating a new PMD page. Use this option carefully in resource constrained systems as it uses one full PMD sized page for zeroing purposes. [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ Link: https://lkml.kernel.org/r/20250811084113.647267-4-kernel@pankajraghav.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Co-developed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4c89792ea0 |
mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
... and hide it behind a kconfig option. There is really no need for any
!xen code to perform this check.
The naming is a bit off: we want to find the "normal" page when a PTE was
marked "special". So it's really not "finding a special" page.
Improve the documentation, and add a comment in the code where XEN ends up
performing the pte_mkspecial() through a hypercall. More details can be
found in commit
|
||
|
|
9a4f90e246 |
mm: remove mm/io-mapping.c
This is dead code, which was used from commit |
||
|
|
d438d27341 |
mm: remove devmap related functions and page table bits
Now that DAX and all other reference counts to ZONE_DEVICE pages are managed normally there is no need for the special devmap PTE/PMD/PUD page table bits. So drop all references to these, freeing up a software defined page table bit on architectures supporting it. Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Will Deacon <will@kernel.org> # arm64 Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
59b5ed409d |
mm/percpu: conditionally define _shared_alloc_tag via CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU
Recently discovered this entry while checking kallsyms on ARM64: ffff800083e509c0 D _shared_alloc_tag If ARCH_NEEDS_WEAK_PER_CPU is not defined(it is only defined for s390 and alpha architectures), there's no need to statically define the percpu variable _shared_alloc_tag. Therefore, we need to implement isolation for this purpose. When building the core kernel code for s390 or alpha architectures, ARCH_NEEDS_WEAK_PER_CPU remains undefined (as it is gated by #if defined(MODULE)). However, when building modules for these architectures, the macro is explicitly defined. Therefore, we remove all instances of ARCH_NEEDS_WEAK_PER_CPU from the code and introduced CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU to replace the relevant logic. We can now conditionally define the perpcu variable _shared_alloc_tag based on CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU. This allows architectures (such as s390/alpha) that require weak definitions for percpu variables in modules to include the definition, while others can omit it via compile-time exclusion. Link: https://lkml.kernel.org/r/20250618015809.1235761-1-hao.ge@linux.dev Signed-off-by: Hao Ge <gehao@kylinos.cn> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Chistoph Lameter <cl@linux.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3800d55250 |
mm: rename CONFIG_PAGE_BLOCK_ORDER to CONFIG_PAGE_BLOCK_MAX_ORDER
The config is in fact an additional upper limit of pageblock_order, so rename it to avoid confusion. Link: https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Juan Yescas <jyescas@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
bafa31a1ce |
mm: Kconfig: use verb *use* in plural form in description
*workloads* is plural requiring the verb *use* in plural form.
Link: https://lkml.kernel.org/r/20250603061303.479551-2-pmenzel@molgen.mpg.de
Fixes:
|
||
|
|
b7191581a9 |
LoongArch changes for v6.16
1, Adjust the 'make install' operation; 2, Support SCHED_MC (Multi-core scheduler); 3, Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS; 4, Enable HAVE_ARCH_STACKLEAK; 5, Increase max supported CPUs up to 2048; 6, Introduce the numa_memblks conversion; 7, Add PWM controller nodes in dts; 8, Some bug fixes and other small changes. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmhEEh8WHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImenIIEACT43zpQD6lApWZ8tk8dVYXK6Lb pKqXLIhAe9g9cSDC6QfwUTwOjv5RZxJBXgKkQfYSSd3ZEZ6hQIqb2JVYMMpNy6WA I5fJ75pr/E6RMtueg5laYxJ/zLJMM5F2qK/Moff7JGeUuDU5BJEG0R7j7KEhtgo0 5Cm+ktS7waJhNuJposa/Eay6IDWH9Of2UoNboPWmMfLvuZbIp9f1HpjwpZF6xnZ/ xeIhVOcJw8Xft/0xrcx1xZt/nYnJflNwLgaa7/CGa2q1X7yxVarfMUcY63x6cZXX 7jLppejqptPqixf+JFRGKBr8Y82Vu/1TW7XlKaywKrjHPZ/FXJsU682x6SQy2uac nOiu38qCc5DA9pKB57zxL/5DqTn+yAPGpnC//HvF9/koWCOI6lA5bYzSbBwmkpm7 DS7mVKtfPTicUs5fZ7DsHRjXuyqs6i1+s7wNaTkgG4Y/gBNAffHa36ON5jM5Wzzz WJGcMEIG0Or6tt/yUVrzY390aYzIOkyAoWJgUQ+tPCXecyjnor9i5P3ZC66dIm2r UTbC7mX4+bBa/MA1xKKMmCHdJtMzO2A+tduZRf/lO8iYs4c6IJBBaXhW9xHxf8tT Av/sXnShJWsrRW40g6Kp85ghDcLBS6SLBm5Ux5OLd2byWIeeF/R2BHdF6qb4V3j6 Y2ZHk1ewvLgB0BQ/ww== =FH8x -----END PGP SIGNATURE----- Merge tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch updates from Huacai Chen: - Adjust the 'make install' operation - Support SCHED_MC (Multi-core scheduler) - Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS - Enable HAVE_ARCH_STACKLEAK - Increase max supported CPUs up to 2048 - Introduce the numa_memblks conversion - Add PWM controller nodes in dts - Some bug fixes and other small changes * tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: platform/loongarch: laptop: Unregister generic_sub_drivers on exit platform/loongarch: laptop: Add backlight power control support platform/loongarch: laptop: Get brightness setting from EC on probe LoongArch: dts: Add PWM support to Loongson-2K2000 LoongArch: dts: Add PWM support to Loongson-2K1000 LoongArch: dts: Add PWM support to Loongson-2K0500 LoongArch: vDSO: Correctly use asm parameters in syscall wrappers LoongArch: Fix panic caused by NULL-PMD in huge_pte_offset() LoongArch: Preserve firmware configuration when desired LoongArch: Avoid using $r0/$r1 as "mask" for csrxchg LoongArch: Introduce the numa_memblks conversion LoongArch: Increase max supported CPUs up to 2048 LoongArch: Enable HAVE_ARCH_STACKLEAK LoongArch: Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS LoongArch: Add SCHED_MC (Multi-core scheduler) support LoongArch: Add some annotations in archhelp LoongArch: Using generic scripts/install.sh in `make install` LoongArch: Add a default install.sh |
||
|
|
e13e7922d0 |
mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary. This
means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
The CMA_MIN_ALIGNMENT_BYTES increases because it depends on MAX_PAGE_ORDER
which depends on ARCH_FORCE_MAX_ORDER. The value of ARCH_FORCE_MAX_ORDER
increases on 16k and 64k kernels.
For example, in ARM, the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
There are some extreme cases for the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x400000>; /* 4 MiB */
...
};
};
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x2000000>; /* 32 MiB */
...
};
};
Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that allows to set the
page block order in all the architectures. The maximum page block order
will be given by ARCH_FORCE_MAX_ORDER.
By default, CONFIG_PAGE_BLOCK_ORDER will have the same value that
ARCH_FORCE_MAX_ORDER. This will make sure that current kernel
configurations won't be affected by this change. It is a opt-in change.
This patch will allow to have the same CMA alignment requirements for
large page sizes (16KiB, 64KiB) as that in 4kb kernels by setting a lower
pageblock_order.
Tests:
- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 on
4k and 16k kernels.
- Verified that Transparent Huge Pages work when pageblock_order is 1,
7, 10 on 4k and 16k kernels.
- Verified that dma-buf heaps allocations work when pageblock_order is
1, 7, 10 on 4k and 16k kernels.
Benchmarks:
The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
reason for the pageblock_order 7 is because this value makes the min CMA
alignment requirement the same as that in 4kb kernels (2MB).
- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
(https://developer.android.com/ndk/guides/simpleperf) to measure the #
of instructions and page-faults on 16k kernels. The benchmark was
executed 10 times. The averages are below:
# instructions | #page-faults
order 10 | order 7 | order 10 | order 7
--------------------------------------------------------
13,891,765,770 | 11,425,777,314 | 220 | 217
14,456,293,487 | 12,660,819,302 | 224 | 219
13,924,261,018 | 13,243,970,736 | 217 | 221
13,910,886,504 | 13,845,519,630 | 217 | 221
14,388,071,190 | 13,498,583,098 | 223 | 224
13,656,442,167 | 12,915,831,681 | 216 | 218
13,300,268,343 | 12,930,484,776 | 222 | 218
13,625,470,223 | 14,234,092,777 | 219 | 218
13,508,964,965 | 13,432,689,094 | 225 | 219
13,368,950,667 | 13,683,587,37 | 219 | 225
-------------------------------------------------------------------
13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
There were 4.85% #instructions when order was 7, in comparison with order
10.
13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
The number of page faults in order 7 and 10 were the same.
These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
on the 16k kernels with pageblock_order 7 and 10.
order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
15.8 | 16.4 | 0.6 | 3.80%
16.4 | 16.2 | -0.2 | -1.22%
16.6 | 16.3 | -0.3 | -1.81%
16.8 | 16.3 | -0.5 | -2.98%
16.6 | 16.8 | 0.2 | 1.20%
-------------------------------------------------------------------
16.44 16.4 -0.04 -0.24% Averages
The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
Link: https://lkml.kernel.org/r/20250521215807.1860663-1-jyescas@google.com
Signed-off-by: Juan Yescas <jyescas@google.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
a24f2fb70c |
LoongArch: Introduce the numa_memblks conversion
Commit
|
||
|
|
cc79061b8f |
mm: khugepaged: decouple SHMEM and file folios' collapse
Originally, the file pages collapse was intended for tmpfs/shmem to merge into THP in the background. However, now not only tmpfs/shmem can support large folios, but some other file systems (such as XFS, erofs ...) also support large folios. Therefore, it is time to decouple the support of file folios collapse from SHMEM. Link: https://lkml.kernel.org/r/ce5c2314e0368cf34bda26f9bacf01c982d4da17.1747119309.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d59f43b574 |
memblock: add support for scratch memory
With KHO (Kexec HandOver), we need a way to ensure that the new kernel does not allocate memory on top of any memory regions that the previous kernel was handing over. But to know where those are, we need to include them in the memblock.reserved array which may not be big enough to hold all ranges that need to be persisted across kexec. To resize the array, we need to allocate memory. That brings us into a catch 22 situation. The solution to that is limit memblock allocations to the scratch regions: safe regions to operate in the case when there is memory that should remain intact across kexec. KHO provides several "scratch regions" as part of its metadata. These scratch regions are contiguous memory blocks that known not to contain any memory that should be persisted across kexec. These regions should be large enough to accommodate all memblock allocations done by the kexeced kernel. We introduce a new memblock_set_scratch_only() function that allows KHO to indicate that any memblock allocation must happen from the scratch regions. Later, we may want to perform another KHO kexec. For that, we reuse the same scratch regions. To ensure that no eventually handed over data gets allocated inside a scratch region, we flip the semantics of the scratch region with memblock_clear_scratch_only(): After that call, no allocations may happen from scratch memblock regions. We will lift that restriction in the next patch. Link: https://lkml.kernel.org/r/20250509074635.3187114-3-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6f110a5e4f |
Disable SLUB_TINY for build testing
... and don't error out so hard on missing module descriptions. Before commit |
||
|
|
eb0ece1602 |
- The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was founf to be incorrect. - The 4 patch series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The 17 patch series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The 5 patch series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The 4 patch series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The 12 patch series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The 2 patch series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The 3 patch series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The 4 patch series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The 4 patch series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The 4 patch series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The 18 patch series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The 27 patch series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The 12 patch series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The 8 patch series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The 2 patch series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The 3 patch series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The 3 patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The 3 patch series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The 9 patch series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The 5 patch series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The 13 patch series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The 13 patch series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The 3 patch series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The 8 patch series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The 2 patch series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The 3 patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The 4 patch series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The 3 patch series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The 5 patch series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The 5 patch series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The 4 patch series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The 2 patch series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF Ee93qSSLR1BkNdDw+931Pu0mXfbnBw== =Pn2K -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ... |
||
|
|
e63ee43e3e |
mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page mapcounts in large folios
We're close to the finishing line: let's introduce a new CONFIG_NO_PAGE_MAPCOUNT config option where we will incrementally remove any dependencies on per-page mapcounts in large folios. Once that's done, we'll stop maintaining the per-page mapcounts with this config option enabled. CONFIG_NO_PAGE_MAPCOUNT will be EXPERIMENTAL for now, as we'll have to learn about some of the real world impact of some of the implications. As writing "!CONFIG_NO_PAGE_MAPCOUNT" is really nasty, let's introduce a helper config option "CONFIG_PAGE_MAPCOUNT" that expresses the negation. Link: https://lkml.kernel.org/r/20250303163014.1128035-16-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6af8cb80d3 |
mm/rmap: basic MM owner tracking for large folios (!hugetlb)
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1) or whether it "maybe mapped shared" by multiple MMs (mapcount > 1). For PMD-sized folios that were PMD-mapped, we were able to use a similar mechanism (single PMD mapping), but for PTE-mapped folios and in the future folios that span multiple PMDs, this does not work. So we need a different mechanism to handle large folios. Let's add a new mechanism to detect whether a large folio is "certainly mapped exclusively", or whether it is "maybe mapped shared". We'll use this information next to optimize CoW reuse for PTE-mapped anonymous THP, and to convert folio_likely_mapped_shared() to folio_maybe_mapped_shared(), independent of per-page mapcounts. For each large folio, we'll have two slots, whereby a slot stores: (1) an MM id: unique id assigned to each MM (2) a per-MM mapcount If a slot is unoccupied, it can be taken by the next MM that maps folio page. In addition, we'll remember the current state -- "mapped exclusively" vs. "maybe mapped shared" -- and use a bit spinlock to sync on updates and to reduce the total number of atomic accesses on updates. In the future, it might be possible to squeeze a proper spinlock into "struct folio". For now, keep it simple, as we require the whole thing with THP only, that is incompatible with RT. As we have to squeeze this information into the "struct folio" of even folios of order-1 (2 pages), and we generally want to reduce the required metadata, we'll assign each MM a unique ID that can fit into an int. In total, we can squeeze everything into 4x int (2x long) on 64bit. 32bit support is a bit challenging, because we only have 2x long == 2x int in order-1 folios. But we can make it work for now, because we neither expect many MMs nor very large folios on 32bit. We will reliably detect folios as "mapped exclusively" vs. "mapped shared" as long as only two MMs map pages of a folio at one point in time -- for example with fork() and short-lived child processes, or with apps that hand over state from one instance to another. As soon as three MMs are involved at the same time, we might detect "maybe mapped shared" although the folio is "mapped exclusively". Example 1: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (4) App1 unmaps all folio pages -> We will detect "mapped exclusively". Example 2: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (3) App3 faults in a folio page -> No slot available, tracked as "unknown" (4) App1 and App2 unmap all folio pages -> We will detect "maybe mapped shared". Make use of __always_inline to keep possible performance degradation when (un)mapping large folios to a minimum. Note: by squeezing the two flags into the "unsigned long" that stores the MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic setting/clearing of the "maybe mapped shared" bit, effectively not adding any new atomics on the hot path when updating the large mapcount + new metadata, which further helps reduce the runtime overhead in micro-benchmarks. Link: https://lkml.kernel.org/r/20250303163014.1128035-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d65917c423 |
mm/sparse: allow for alternate vmemmap section init at boot
Add functions that are called just before the per-section memmap is initialized and just before the memmap page structures are initialized. They are called sparse_vmemmap_init_nid_early and sparse_vmemmap_init_nid_late, respectively. This allows for mm subsystems to add calls to initialize memmap and page structures in a specific way, if using SPARSEMEM_VMEMMAP. Specifically, hugetlb can pre-HVO bootmem allocated pages that way, so that no time and resources are wasted on allocating vmemmap pages, only to free them later (and possibly unnecessarily running the system out of memory in the process). Refactor some code and export a few convenience functions for external use. In sparse_init_nid, skip any sections that are already initialized, e.g. they have been initialized by sparse_vmemmap_init_nid_early already. The hugetlb code to use these functions will be added in a later commit. Export section_map_size, as any alternate memmap init code will want to use it. The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT, which is selected if an architecture-specific option, ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set. In the future, if other subsystems want to do preinit too, they can do it in a similar fashion. The internal config option is there because a section flag is used, and the number of flags available is architecture-dependent (see mmzone.h). Architecures can decide if there is room for the flag when enabling options that select SPARSEMEM_VMEMMAP_PREINIT. Fortunately, as of right now, all sparse vmemmap using architectures do have room. Link: https://lkml.kernel.org/r/20250228182928.2645936-11-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6df8bae8e8 |
mm: zbud: remove zbud
The zbud compressed pages allocator is rarely used, most users use zsmalloc. zbud consumes much more memory (only stores 1 or 2 compressed pages per physical page). The only advantage of zbud is a marginal performance improvement that by no means justify the memory overhead. Historically, zsmalloc had significantly worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. In a kernel build test on tmpfs in a limited cgroup, zbud 2-3% less time than zsmalloc, but at the cost of using ~32% more memory (1.5G vs 1.13G). The tradeoff does not make sense for zbud in any practical scenario. The only alleged advantage of zbud is not having the dependency on CONFIG_MMU, but CONFIG_SWAP already depends on CONFIG_MMU anyway, and zbud is only used by zswap. Remove zbud after z3fold's removal, leaving zsmalloc as the one and only zpool allocator. Leave the removal of the zpool API (and its associated config options) to a followup cleanup after no more allocators show up. Deprecating zbud for a few cycles before removing it was initially proposed [2], like z3fold was marked as deprecated for 2 cycles [3]. However, Johannes rightfully pointed out that the 2 cycles is too short for most downstream consumers, and z3fold was deprecated first only as a courtesy anyway. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/Z5gdnSX5Lv-nfjQL@google.com/ [3]https://lore.kernel.org/lkml/20240904233343.933462-1-yosryahmed@google.com/ Link: https://lkml.kernel.org/r/20250129180633.3501650-3-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
58ba73e521 |
mm: z3fold: remove z3fold
Patch series "mm: zswap: remove z3fold and zbud", v2. After 2 cycles of deprecating z3fold, remove it as well as zbud (rationale in specific patches). This patch (of 2): Z3fold has been marked as deprecated for 2 cycles and no one complained, as expected. As there are no known users, remove the code now. Link: https://lkml.kernel.org/r/20250129180633.3501650-1-yosry.ahmed@linux.dev Link: https://lkml.kernel.org/r/20250129180633.3501650-2-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c9f8f1242a |
slab: don't batch kvfree_rcu() with SLUB_TINY
kvfree_rcu() is batched for better performance except on TINY_RCU, which is a simple implementation for small UP systems. Similarly SLUB_TINY is an option intended for small systems, whether or not used together with TINY_RCU. In case SLUB_TINY is used with !TINY_RCU, it makes arguably sense to not do the batching and limit the memory footprint. It's also suboptimal to have RCU-specific #ifdefs in slab code. With that, add CONFIG_KVFREE_RCU_BATCHED to determine whether batching kvfree_rcu() implementation is used. It is not set by a user prompt, but enabled by default and disabled in case TINY_RCU or SLUB_TINY are enabled. Use the new config for #ifdef's in slab code and extend their scope to cover all code used by the batched kvfree_rcu(). For example there's no need to perform kvfree_rcu_init() if the batching is disabled. Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
44d46b76c3 |
mm: add build-time option for hotplug memory default online type
Memory hotplug presently auto-onlines memory into a zone the kernel deems
appropriate if CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y.
The memhp_default_state boot param enables runtime config, but it's not
possible to do this at build-time.
Remove CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, and replace it with
CONFIG_MHP_DEFAULT_ONLINE_TYPE_* choices that sync with the boot param.
Selections:
CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
=> mhp_default_online_type = "offline"
Memory will not be onlined automatically.
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO
=> mhp_default_online_type = "online"
Memory will be onlined automatically in a zone deemed.
appropriate by the kernel.
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
=> mhp_default_online_type = "online_kernel"
Memory will be onlined automatically.
The zone may allow kernel data (e.g. ZONE_NORMAL).
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
=> mhp_default_online_type = "online_movable"
Memory will be onlined automatically.
The zone will be ZONE_MOVABLE.
Default to CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE to match the existing
default CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n behavior.
Existing users of CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y should use
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO.
[gourry@gourry.net: update KConfig comments]
Link: https://lkml.kernel.org/r/20241226182918.648799-1-gourry@gourry.net
Link: https://lkml.kernel.org/r/20241220210709.300066-1-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
6375e95f38 |
mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.
The following are a memory usage snapshot of one process which actually
happened on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.
As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).
Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).
For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.
The following code snippet can show the effect of optimization:
mmap 50G
while (1) {
for (; i < 1024 * 25; i++) {
touch 2M memory
madvise MADV_DONTNEED 2M
}
}
As we can see, the memory usage of VmPTE is reduced:
before after
VIRT 50.0 GB 50.0 GB
RES 3.1 MB 3.1 MB
VmPTE 102640 KB 240 KB
[zhengqi.arch@bytedance.com: fix uninitialized symbol 'ptl']
Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com
Link: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/
Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
ba1f9c8fe3 |
arm64 updates for 6.13:
* Support for running Linux in a protected VM under the Arm Confidential
Compute Architecture (CCA)
* Guarded Control Stack user-space support. Current patches follow the
x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
patches (already on the list) will add support for clone3() allowing
finer-grained control of the shadow stack size and placement from libc
* AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
getting close with the upcoming dpISA support)
* Other arch features:
- In-kernel use of the memcpy instructions, FEAT_MOPS (previously only
exposed to user; uaccess support not merged yet)
- MTE: hugetlbfs support and the corresponding kselftests
- Optimise CRC32 using the PMULL instructions
- Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
- Optimise the kernel TLB flushing to use the range operations
- POE/pkey (permission overlays): further cleanups after bringing the
signal handler in line with the x86 behaviour for 6.12
* arm64 perf updates:
- Support for the NXP i.MX91 PMU in the existing IMX driver
- Support for Ampere SoCs in the Designware PCIe PMU driver
- Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
- Support for Samsung's 'Mongoose' CPU PMU
- Support for PMUv3.9 finer-grained userspace counter access control
- Switch back to platform_driver::remove() now that it returns 'void'
- Add some missing events for the CXL PMU driver
* Miscellaneous arm64 fixes/cleanups:
- Page table accessors cleanup: type updates, drop unused macros,
reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
check addresses before runtime P4D/PUD folding
- Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
FEAT_ECV for the generic timers) allowing Linux to boot with
firmware deployments that don't set SCTLR_EL3.ECVEn
- ACPI/arm64: tighten the check for the array of platform timer
structures and adjust the error handling procedure in
gtdt_parse_timer_block()
- Optimise the cache flush for the uprobes xol slot (skip if no
change) and other uprobes/kprobes cleanups
- Fix the context switching of tpidrro_el0 when kpti is enabled
- Dynamic shadow call stack fixes
- Sysreg updates
- Various arm64 kselftest improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmc5POIACgkQa9axLQDI
XvEDYA//a3eeNkgMuGdnSCVcLz+zy+oNwAwboG/4X1DqL8jiCbI4npwugPx95RIA
YZOUvo9T2aL3OyefpUHll4gFHqx9OwoZIig2F70TEUmlPsGUbh0KBkdfQF3xZPdl
EwV0kHSGEqMWMBwsGJGwgCYrUaf1MUQzh1GBl7VJ2ts5XsJBaBeOyKkysij26wtZ
V+aHq2IUx7qQS7+HC/4P6IoHxKziFcsCMovaKaynP4cw9xXBQbDMcNlHEwndOMyk
pu2zrv7GG0j3KQuVP/2Alf5FKhmI0GVGP/6Nc/zsOmw96w8Kf7HfzEtkHawr2aRq
rqg/c9ivzDn1p+fUBo4ZYtrRk4IAY+yKu6hdzdLTP5+bQrBTWTO9rjQVBm9FAGYT
sCdEj1NqzvExvNHD7X6ut/GJ05lmce3K+qeSXSEysN9gqiT3eomYWMXrD2V2lxzb
rIDDcb/icfaqjt14Mksh19r/rzNeq7noj9CGSmcqw0BHZfHzl38Lai6pdfYzCNyn
vCM/c4c1D/WWX8/lifO1JZVbhDk1jy82Iphg2KEhL8iKPxDsKBBZLmYuU1oa7tMo
WryGAz9+GQwd+W9chFuaOEtMnzvW2scEJ5Eb2fEf0Qj0aEurkL+C9dZR6o1GN77V
DBUxtU628Ef4PJJGfbNCwZzdd8UPYG3a/mKfQQ3dz0oz2LySlW4=
=wDot
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Support for running Linux in a protected VM under the Arm
Confidential Compute Architecture (CCA)
- Guarded Control Stack user-space support. Current patches follow the
x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
patches (already on the list) will add support for clone3() allowing
finer-grained control of the shadow stack size and placement from
libc
- AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
getting close with the upcoming dpISA support)
- Other arch features:
- In-kernel use of the memcpy instructions, FEAT_MOPS (previously
only exposed to user; uaccess support not merged yet)
- MTE: hugetlbfs support and the corresponding kselftests
- Optimise CRC32 using the PMULL instructions
- Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
- Optimise the kernel TLB flushing to use the range operations
- POE/pkey (permission overlays): further cleanups after bringing
the signal handler in line with the x86 behaviour for 6.12
- arm64 perf updates:
- Support for the NXP i.MX91 PMU in the existing IMX driver
- Support for Ampere SoCs in the Designware PCIe PMU driver
- Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
- Support for Samsung's 'Mongoose' CPU PMU
- Support for PMUv3.9 finer-grained userspace counter access
control
- Switch back to platform_driver::remove() now that it returns
'void'
- Add some missing events for the CXL PMU driver
- Miscellaneous arm64 fixes/cleanups:
- Page table accessors cleanup: type updates, drop unused macros,
reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
check addresses before runtime P4D/PUD folding
- Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
FEAT_ECV for the generic timers) allowing Linux to boot with
firmware deployments that don't set SCTLR_EL3.ECVEn
- ACPI/arm64: tighten the check for the array of platform timer
structures and adjust the error handling procedure in
gtdt_parse_timer_block()
- Optimise the cache flush for the uprobes xol slot (skip if no
change) and other uprobes/kprobes cleanups
- Fix the context switching of tpidrro_el0 when kpti is enabled
- Dynamic shadow call stack fixes
- Sysreg updates
- Various arm64 kselftest improvements
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits)
arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
kselftest/arm64: Try harder to generate different keys during PAC tests
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
arm64/ptrace: Clarify documentation of VL configuration via ptrace
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
acpi/arm64: remove unnecessary cast
arm64/mm: Change protval as 'pteval_t' in map_range()
kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
kselftest/arm64: Add FPMR coverage to fp-ptrace
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Enable build of PAC tests with LLVM=1
kselftest/arm64: Check that SVCR is 0 in signal handlers
selftests/mm: Fix unused function warning for aarch64_write_signal_pkey()
kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix build with stricter assemblers
arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
arm64/scs: Deal with 64-bit relative offsets in FDE frames
...
|
||
|
|
b7c5f9a1fb |
resource: remove dependency on SPARSEMEM from GET_FREE_REGION
We want to use the functions (get_free_mem_region()) configured via GET_FREE_REGION in resource kunit tests. However, GET_FREE_REGION depends on SPARSEMEM now. This makes resource kunit tests cannot be built on some architectures lacking SPARSEMEM, or causes config warning as follows, WARNING: unmet direct dependencies detected for GET_FREE_REGION Depends on [n]: SPARSEMEM [=n] Selected by [y]: - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y] When get_free_mem_region() was introduced the only consumers were those looking to pass the address range to memremap_pages(). That address range needed to be mindful of the maximum addressable platform physical address which at the time only SPARSMEM defined via MAX_PHYSMEM_BITS. Given that memremap_pages() also depended on SPARSEMEM via ZONE_DEVICE, it was easier to just depend on that definition than invent a general MAX_PHYSMEM_BITS concept outside of SPARSEMEM. Turns out that decision was buggy and did not account for KASAN consumption of physical address space. That problem was resolved recently with commit |
||
|
|
bcc9d04e74 |
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
Since multiple architectures have support for shadow stacks and we need to select support for this feature in several places in the generic code provide a generic config option that the architectures can select. Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kees Cook <kees@kernel.org> Tested-by: Kees Cook <kees@kernel.org> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-1-222b78d87eee@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> |
||
|
|
a334407810 |
mm: make SPLIT_PTE_PTLOCKS depend on SMP
SPLIT_PTE_PTLOCKS depends on "NR_CPUS >= 4". Unfortunately, that
evaluates to true if there is no NR_CPUS configuration option. This
results in CONFIG_SPLIT_PTE_PTLOCKS=y for mac_defconfig. This in turn
causes the m68k "q800" and "virt" machines to crash in qemu if debugging
options are enabled.
Making CONFIG_SPLIT_PTE_PTLOCKS dependent on the existence of NR_CPUS does
not work since a dependency on the existence of a numeric Kconfig entry
always evaluates to false. Example:
config HAVE_NO_NR_CPUS
def_bool y
depends on !NR_CPUS
After adding this to a Kconfig file, "make defconfig" includes:
$ grep NR_CPUS .config
CONFIG_NR_CPUS=64
CONFIG_HAVE_NO_NR_CPUS=y
Defining NR_CPUS for m68k does not help either since many architectures
define NR_CPUS only for SMP configurations.
Make SPLIT_PTE_PTLOCKS depend on SMP instead to solve the problem.
Link: https://lkml.kernel.org/r/20240924154205.1491376-1-linux@roeck-us.net
Fixes:
|
||
|
|
7a2369b74a |
mm: z3fold: deprecate CONFIG_Z3FOLD
The z3fold compressed pages allocator is rarely used, most users use zsmalloc. The only disadvantage of zsmalloc in comparison is the dependency on MMU, and zbud is a more common option for !MMU as it was the default zswap allocator for a long time. Historically, zsmalloc had worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. That analysis showed that z3fold does not have any advantage over zsmalloc or zbud considering both performance and memory usage. In a kernel build test on tmpfs in a limited cgroup, z3fold took 3% more time and used 1.8% more memory. The latency of zswap_load() was 7% higher, and that of zswap_store() was 10% higher. Zsmalloc is better in all metrics. Moreover, z3fold apparently has latent bugs, which was made noticeable by a recent soft lockup bug report with z3fold [2]. Switching to zsmalloc not only fixed the problem, but also reduced the swap usage from 6~8G to 1~2G. Other users have also reported being bitten by mistakenly enabling z3fold. Other than hurting users, z3fold is repeatedly causing wasted engineering effort. Apart from investigating the above bug, it came up in multiple development discussions (e.g. [3]) as something we need to handle, when there aren't any legit users (at least not intentionally). The natural course of action is to deprecate z3fold, and remove in a few cycles if no objections are raised from active users. Next on the list should be zbud, as it offers marginal latency gains at the cost of huge memory waste when compared to zsmalloc. That one will need to wait until zsmalloc does not depend on MMU. Rename the user-visible config option from CONFIG_Z3FOLD to CONFIG_Z3FOLD_DEPRECATED so that users with CONFIG_Z3FOLD=y get a new prompt with explanation during make oldconfig. Also, remove CONFIG_Z3FOLD=y from defconfigs. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/EF0ABD3E-A239-4111-A8AB-5C442E759CF3@gmail.com/ [3]https://lore.kernel.org/lkml/CAJD7tkbnmeVugfunffSovJf9FAgy9rhBVt_tx=nxUveLUfqVsA@mail.gmail.com/ [arnd@arndb.de: deprecate ZSWAP_ZPOOL_DEFAULT_Z3FOLD as well] Link: https://lkml.kernel.org/r/20240909202625.1054880-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20240904233343.933462-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vitaly Wool <vitaly.wool@konsulko.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6857be5fec |
mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
Patch series "mm: Support huge pfnmaps", v2.
Overview
========
This series implements huge pfnmaps support for mm in general. Huge
pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels,
similar to what we do with dax / thp / hugetlb so far to benefit from TLB
hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where
it can grow as large as 8GB or even bigger.
Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.
Implementation
==============
In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).
One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user. This patchset will add the 2nd
user of it. Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.
The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping. This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
that gup-fast will be able to identify them and fail properly.
Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes. A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now. Half of this series is about that and
converting all existing pfnmap walkers to use the new API properly.
Hopefully the new API also looks better to avoid exposing e.g. pgtable
lock details into the callers, so that it can be used in an even more
straightforward way.
Here, three more options will be introduced and involved in huge pfnmap:
- ARCH_SUPPORTS_HUGE_PFNMAP
Arch developers will need to select this option when huge pfnmap is
supported in arch's Kconfig. After this patchset applied, both x86_64
and arm64 will start to enable it by default.
- ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
These options are for driver developers to identify whether current
arch / config supports huge pfnmaps, making decision on whether it can
use the huge pfnmap APIs to inject them. One can refer to the last
vfio-pci patch from Alex on the use of them properly in a device
driver.
So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.
Patch Layout
============
Patch 1: Introduce the new options mentioned above for huge PFNMAPs
Patch 2: A tiny cleanup
Patch 3-8: Preparation patches for huge pfnmap (include introduce
special bit for pmd/pud)
Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and
then drop follow_pte() API
Patch 17: Add huge pfnmap support for x86_64
Patch 18: Add huge pfnmap support for arm64
Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex)
TODO
====
More architectures / More page sizes
------------------------------------
Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems
to have plan to support arm64 1G later on top of this series [2].
Any arch will need to first support THP / THP_1G, then provide a special
bit in pmds/puds to support huge pfnmaps.
remap_pfn_range() support
-------------------------
Currently, remap_pfn_range() still only maps PTEs. With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.
When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better
TLB hit rate and overall faster MMIO accesses similar to processor on
hugepages.
More driver support
-------------------
VFIO is so far the only consumer for the huge pfnmaps after this series
applied. Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes. This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar. But I don't
think I know all the drivers to know the full picture.
Credits all go to Alex on help testing the GPU/NIC use cases above.
[0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
[1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
[2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
This patch (of 19):
This patch introduces the option to introduce special pte bit into
pmd/puds. Archs can start to define pmd_special / pud_special when
supported by selecting the new option. Per-arch support will be added
later.
Before that, create fallbacks for these helpers so that they are always
available.
Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|