linux/Documentation
Chris Li 87cc51571a docs/mm: add document for swap table
Patch series "mm, swap: introduce swap table as swap cache (phase I)", v4.

This is the first phase of the bigger series implementing basic
infrastructures for the Swap Table idea proposed at the LSF/MM/BPF topic
"Integrate swap cache, swap maps with swap allocator" [1].  To give credit
where it is due, this is based on Chris Li's idea and a prototype of using
cluster size atomic arrays to implement swap cache.

This phase I contains 15 patches, introduces the swap table infrastructure
and uses it as the swap cache backend.  By doing so, we have up to ~5-20%
performance gain in throughput, RPS or build time for benchmark and
workload tests.  The speed up is due to less contention on the swap cache
access and shallower swap cache lookup path.  The cluster size is much
finer-grained than the 64M address space split, which is removed in this
phase I.  It also unifies and cleans up the swap code base.

Each swap cluster will dynamically allocate the swap table, which is an
atomic array to cover every swap slot in the cluster.  It replaces the
swap cache backed by XArray.  In phase I, the static allocated swap_map
still co-exists with the swap table.  The memory usage is about the same
as the original on average.  A few exception test cases show about 1%
higher in memory usage.  In the following phases of the series, swap_map
will merge into the swap table without additional memory allocation.  It
will result in net memory reduction compared to the original swap cache.

Testing has shown that phase I has a significant performance improvement
from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
workloads.

The full picture with a summary can be found at [2].  An older bigger
series of 28 patches is posted at [3].

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
                           Before:         After:
System time:               219.12s         158.16s        (-27.82%)
Sum Throughput:            4767.13 MB/s    6128.59 MB/s   (+28.55%)
Single process Throughput: 150.21 MB/s     196.52 MB/s    (+30.83%)
Free latency:              175047.58 us    131411.87 us   (-24.92%)

usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
PMEM as swap)
                           Before:         After:
System time:               356.16s         284.68s      (-20.06%)
Sum Throughput:            4648.35 MB/s    5453.52 MB/s (+17.32%)
Single process Throughput: 141.63 MB/s     168.35 MB/s  (+18.86%)
Free latency:              499907.71 us    484977.03 us (-2.99%)

This shows an improvement of more than 20% improvement in most readings.

Build kernel test:
==================
The following result matrix is from building kernel with defconfig on
tmpfs with ZSWAP / ZRAM, using different memory pressure and setups. 
Measuring sys and real time in seconds, less is better (user time is
almost identical as expected):

 -j<NR> / Mem  | Sys before / after  | Real before / after
Using 16G ZRAM with memcg limit:
     6  / 192M | 9686 / 9472  -2.21% | 2130  / 2096   -1.59%
     12 / 256M | 6610 / 6451  -2.41% |  827  /  812   -1.81%
     24 / 384M | 5938 / 5701  -3.37% |  414  /  405   -2.17%
     48 / 768M | 4696 / 4409  -6.11% |  188  /  182   -3.19%
With 64k folio:
     24 / 512M | 4222 / 4162  -1.42% |  326  /  321   -1.53%
     48 / 1G   | 3688 / 3622  -1.79% |  151  /  149   -1.32%
With ZSWAP with 3G memcg (using higher limit due to kmem account):
     48 / 3G   |  603 /  581  -3.65% |  81   /   80   -1.23%

Testing extremely high global memory and schedule pressure: Using ZSWAP
with 32G NVMEs in a 48c VM that has 4G memory, no memcg limit, system
components take up about 1.5G already, using make -j48 to build defconfig:

Before:  sys time: 2069.53s            real time: 135.76s
After:   sys time: 2021.13s (-2.34%)   real time: 134.23s (-1.12%)

On another 48c 4G memory VM, using 16G ZRAM as swap, testing make
-j48 with same config:

Before:  sys time: 1756.96s            real time: 111.01s
After:   sys time: 1715.90s (-2.34%)   real time: 109.51s (-1.35%)

All cases are more or less faster, and no regression even under extremely
heavy global memory pressure.

Redis / Valkey bench:
=====================
The test machine is a ARM64 VM with 1536M memory 12 cores, Redis is set to
use 2500M memory, and ZRAM swap size is set to 5G:

Testing with:
redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get

                no BGSAVE                with BGSAVE
Before:         487576.06 RPS            280016.02 RPS
After:          487541.76 RPS (-0.01%)   300155.32 RPS (+7.19%)

Testing with:
redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get
                no BGSAVE                with BGSAVE
Before:         466789.59 RPS            281213.92 RPS
After:          466402.89 RPS (-0.08%)   298411.84 RPS (+6.12%)

With BGSAVE enabled, most Redis memory will have a swap count > 1 so swap
cache is heavily in use.  We can see a about 6% performance gain.  No
BGSAVE is very slightly slower (<0.1%) due to the higher memory pressure
of the co-existence of swap_map and swap table.  This will be optimzed
into a net gain and up to 20% gain in BGSAVE case in the following phases.

HDD swap is also ~40% faster with usemem because we removed an old
contention workaround.


This patch (of 15):

Swap table is the new swap cache.

[chrisl@kernel.org: move swap table document, redo swap table size sentence]
  Link: https://lkml.kernel.org/r/CACePvbXjaUyzB_9RSSSgR6BNvz+L9anvn0vcNf_J0jD7-4Yy6Q@mail.gmail.com
Link: https://lkml.kernel.org/r/20250916160100.31545-1-ryncsn@gmail.com
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
Link: https://lkml.kernel.org/r/20250916160100.31545-2-ryncsn@gmail.com
Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2]
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:22 -07:00
..
ABI Docs/ABI/damon: document addr_unit file 2025-09-13 16:55:24 -07:00
accel
accounting delaytop: enhance error logging and add PSI feature description 2025-08-02 12:01:41 -07:00
admin-guide mm: shmem: fix the strategy for the tmpfs 'huge=' options 2025-09-21 14:22:19 -07:00
arch It has been a relatively busy cycle for docs, especially the build system: 2025-07-31 08:36:51 -07:00
block Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists 2025-06-13 09:25:42 -06:00
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3 2025-06-26 09:49:39 -07:00
cdrom cdrom: Call cdrom_mrw_exit from cdrom_release function 2025-07-22 19:10:17 -06:00
core-api mm: remove unused zpool layer 2025-09-21 14:21:59 -07:00
cpu-freq
crypto crypto: engine - remove {prepare,unprepare}_crypt_hardware callbacks 2025-07-18 20:52:00 +10:00
dev-tools kasan/hw-tags: introduce kasan.write_only option 2025-09-21 14:22:10 -07:00
devicetree drm fixes for 6.16-rc4 2025-08-28 19:56:32 -07:00
doc-guide docs: sphinx: add a file with the requirements for lowest version 2025-06-25 12:22:48 -06:00
driver-api mm: remove unused zpool layer 2025-09-21 14:21:59 -07:00
edac
fault-injection docs: fault-injection: drop reference to md-faulty 2025-07-24 08:31:46 -06:00
fb
features
filesystems prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE 2025-09-13 16:55:05 -07:00
firmware-guide Merge branch 'acpi-misc' 2025-07-22 17:12:57 +02:00
firmware_class
fpga
gpu drm for 6.17-rc1 2025-07-30 19:26:49 -07:00
hid HID: intel-thc-hid: Separate max input size control conditional list 2025-06-20 08:55:52 +02:00
hwmon hwmon updates for v6.17 2025-07-31 13:34:06 -07:00
i2c
iio docs: iio: add ADXL313 accelerometer 2025-07-14 19:20:50 +01:00
images
infiniband
input Input: Add and document BTN_GRIP* 2025-07-27 01:41:20 -07:00
isdn
kbuild docs: kconfig: add alldefconfig to the all*configs 2025-07-26 15:31:29 +09:00
kernel-hacking
leds
litmus-tests
livepatch
locking
maintainer
mhi
misc-devices
mm docs/mm: add document for swap table 2025-09-21 14:22:22 -07:00
netlabel
netlink netlink: specs: ethtool: fix module EEPROM input/output arguments 2025-07-31 10:57:02 -07:00
networking mptcp: disable add_addr retransmission when timeout is 0 2025-08-18 17:39:58 -07:00
nvdimm
nvme docs: nvme: fix grammar in nvme-pci-endpoint-target.rst 2025-07-17 13:38:07 +02:00
PCI selftests: pci_endpoint: Add doorbell test case 2025-07-24 16:51:47 -05:00
pcmcia
peci
power Merge branches 'pm-runtime' and 'pm-powercap' 2025-07-22 18:01:15 +02:00
process Documentation: smooth the text flow in the security bug reporting process 2025-08-17 12:23:30 +02:00
RCU rcu: Document concurrent quiescent state reporting for offline CPUs 2025-07-22 17:10:50 +05:30
rust
scheduler sched_ext: Changes for v6.17 2025-07-31 16:29:46 -07:00
scsi scsi: fc_transport: docs: Add documentation for FC Remote Ports 2025-06-09 21:49:26 -04:00
security hardening updates for v6.17-rc1 2025-07-28 17:16:12 -07:00
sound ALSA: docs: Add documents for recently changes in snd-usb-audio 2025-08-29 11:17:35 +02:00
sphinx sphinx: kernel_abi: fix performance regression with O=<dir> 2025-07-24 08:36:17 -06:00
sphinx-static docs: CSS: make cross-reference links more evident 2025-06-09 14:43:39 -06:00
spi
staging
sunrpc/xdr
target
tee
timers
tools tracing tools changes for 6.17: 2025-08-01 10:23:13 -07:00
trace tracing changes for 6.17 2025-08-01 10:29:36 -07:00
translations Summary of significant series in this pull request: 2025-07-31 14:57:54 -07:00
usb It has been a relatively busy cycle for docs, especially the build system: 2025-07-31 08:36:51 -07:00
userspace-api iommufd: Fix spelling errors in iommufd.rst 2025-08-18 11:15:06 -03:00
virt Documentation: KVM: Use unordered list for pre-init VGIC registers 2025-07-29 13:43:50 -04:00
w1
watchdog
wmi platform/x86: Add lenovo-wmi-* driver Documentation 2025-07-03 10:54:24 +03:00
.gitignore
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py docs: conf.py: several coding style fixes 2025-06-25 12:22:48 -06:00
docutils.conf
index.rst
Kconfig
Makefile docs: Makefile: disable check rules on make cleandocs 2025-06-25 12:22:47 -06:00
memory-barriers.txt docs/memory-barriers.txt: Add wait_event_cmd() and wait_event_exclusive_cmd() 2025-07-09 10:08:14 -07:00
SubmittingPatches
subsystem-apis.rst