linux/kernel
Sumanth Korikkar eea5706cb0 resource: improve child resource handling in release_mem_region_adjustable()
When memory block is removed via try_remove_memory(), it eventually
reaches release_mem_region_adjustable().  The current implementation
assumes that when a busy memory resource is split into two, all child
resources remain in the lower address range.

This simplification causes problems when child resources actually belong
to the upper split.  For example:

* Initial memory layout:
lsmem
RANGE                                 SIZE   STATE REMOVABLE  BLOCK
0x0000000000000000-0x00000002ffffffff  12G  online       yes   0-95

* /proc/iomem
00000000-2dfefffff : System RAM
  158834000-1597b3fff : Kernel code
  1597b4000-159f50fff : Kernel data
  15a13c000-15a218fff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

* After offlining and removing range
  0x150000000-0x157ffffff
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED
(output according to upcoming lsmem changes with the configured column:
s390)
RANGE                                  SIZE   STATE  BLOCK  CONFIGURED
0x0000000000000000-0x000000014fffffff  5.3G  online   0-41  yes
0x0000000150000000-0x0000000157ffffff  128M offline     42  no
0x0000000158000000-0x00000002ffffffff  6.6G  online  43-95  yes

The iomem resource gets split into two entries, but kernel code, kernel
data, and kernel bss remain attached to the lower resource [0–5376M]
instead of the correct upper resource [5504M–12288M].

As a result, WARN_ON() triggers in release_mem_region_adjustable()
("Usecase: split into two entries - we need a new resource")
------------[ cut here ]------------
WARNING: CPU: 5 PID: 858 at kernel/resource.c:1486
release_mem_region_adjustable+0x210/0x280
Modules linked in:
CPU: 5 UID: 0 PID: 858 Comm: chmem Not tainted 6.17.0-rc2-11707-g2c36aaf3ba4e
Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
Krnl PSW : 0704d00180000000 0000024ec0dae0e4
           (release_mem_region_adjustable+0x214/0x280)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 00000002ffffafc0 fffffffffffffff0 0000000000000000
           000000014fffffff 0000024ec2257608 0000000000000000 0000024ec2301758
           0000024ec22680d0 00000000902c9140 0000000150000000 00000002ffffafc0
           000003ffa61d8d18 0000024ec21fb478 0000024ec0dae014 000001cec194fbb0
Krnl Code: 0000024ec0dae0d8: af000000            mc      0,0
           0000024ec0dae0dc: a7f4ffc1            brc     15,0000024ec0dae05e
          #0000024ec0dae0e0: af000000            mc      0,0
          >0000024ec0dae0e4: a5defffd            llilh   %r13,65533
           0000024ec0dae0e8: c04000c6064c        larl    %r4,0000024ec266ed80
           0000024ec0dae0ee: eb1d400000f8        laa     %r1,%r13,0(%r4)
           0000024ec0dae0f4: 07e0                bcr     14,%r0
           0000024ec0dae0f6: a7f4ffc0            brc     15,0000024ec0dae076

 [<0000024ec0dae0e4>] release_mem_region_adjustable+0x214/0x280
([<0000024ec0dadf3c>] release_mem_region_adjustable+0x6c/0x280)
 [<0000024ec10a2130>] try_remove_memory+0x100/0x140
 [<0000024ec10a4052>] __remove_memory+0x22/0x40
 [<0000024ec18890f6>] config_mblock_store+0x326/0x3e0
 [<0000024ec11f7056>] kernfs_fop_write_iter+0x136/0x210
 [<0000024ec1121e86>] vfs_write+0x236/0x3c0
 [<0000024ec11221b8>] ksys_write+0x78/0x110
 [<0000024ec1b6bfbe>] __do_syscall+0x12e/0x350
 [<0000024ec1b782ce>] system_call+0x6e/0x90
Last Breaking-Event-Address:
 [<0000024ec0dae014>] release_mem_region_adjustable+0x144/0x280
---[ end trace 0000000000000000 ]---

Also, resource adjustment doesn't happen and stale resources still cover
[0-12288M].  Later, memory re-add fails in register_memory_resource() with
-EBUSY.

i.e: /proc/iomem is still:
00000000-2dfefffff : System RAM
  158834000-1597b3fff : Kernel code
  1597b4000-159f50fff : Kernel data
  15a13c000-15a218fff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

Enhance release_mem_region_adjustable() to reassign child resources to the
correct parent after a split.  Children are now assigned based on their
actual range: If they fall within the lower split, keep them in the lower
parent.  If they fall within the upper split, move them to the upper
parent.

Kernel code/data/bss regions are not offlined, so they will always reside
entirely within one parent and never span across both.

Output after the enhancement:
* Initial state /proc/iomem (before removal of memory block):
00000000-2dfefffff : System RAM
  1f94f8000-1fa477fff : Kernel code
  1fa478000-1fac14fff : Kernel data
  1fae00000-1faedcfff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

* Offline and remove 0x1e8000000-0x1efffffff memory range
* /proc/iomem
00000000-1e7ffffff : System RAM
1f0000000-2dfefffff : System RAM
  1f94f8000-1fa477fff : Kernel code
  1fa478000-1fac14fff : Kernel data
  1fae00000-1faedcfff : Kernel bss
2dff00000-2ffefffff : Crash kernel
2fff00000-2ffffffff : System RAM

Link: https://lkml.kernel.org/r/20250912123021.3219980-1-sumanthk@linux.ibm.com
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21 14:22:34 -07:00
..
bpf bpf: Fix memory leak of bpf_scc_info objects 2025-08-02 09:04:57 -07:00
cgroup cgroup: avoid null de-ref in css_rstat_exit() 2025-08-09 08:46:32 -10:00
configs configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON 2025-07-21 21:41:57 -07:00
debug TTY/Serial driver updates for 6.15-rc1 2025-04-02 18:17:33 -07:00
dma dma-remap: drop nth_page() in dma_common_contiguous_remap() 2025-09-21 14:22:06 -07:00
entry ARM: 2025-07-30 17:14:01 -07:00
events mm: convert uprobes to mm_flags_*() accessors 2025-09-13 16:54:57 -07:00
futex futex: Use user_write_access_begin/_end() in futex_put_value() 2025-08-11 17:53:21 +02:00
gcov kbuild: require gcc-8 and binutils-2.30 2025-04-30 21:53:35 +02:00
irq genirq/test: Resolve irq lock inversion warnings 2025-08-06 10:29:48 +02:00
kcsan kcsan: test: Initialize dummy variable 2025-07-23 08:51:32 +02:00
livepatch sched,livepatch: Untangle cond_resched() and live-patching 2025-05-14 13:16:24 +02:00
locking - Make sure sanity checks down in the mutex lock path happen on the correct 2025-08-17 05:57:47 -07:00
module Significant patch series in this pull request: 2025-08-05 16:02:07 +03:00
power drm for 6.17-rc1 2025-07-30 19:26:49 -07:00
printk printk changes for 6.17 2025-08-04 10:54:36 -07:00
rcu mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversion 2025-09-13 16:54:42 -07:00
sched mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversion 2025-09-13 16:54:42 -07:00
time bitmap-for-6.17 2025-07-31 16:52:32 -07:00
trace tracing fixes for v6.17-rc2: 2025-08-23 10:11:34 -04:00
unwind unwind: Finish up unwind when a task exits 2025-07-31 10:20:11 -04:00
.gitignore kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-06-24 20:30:37 +09:00
acct.c acct: block access to kernel internal filesystems 2025-02-12 12:24:16 +01:00
async.c
audit.c audit: record AUDIT_ANOM_* events regardless of presence of rules 2025-04-11 14:14:41 -04:00
audit.h audit,module: restore audit logging in load failure case 2025-06-16 17:00:06 -04:00
audit_fsnotify.c
audit_tree.c replace collect_mounts()/drop_collected_mounts() with a safer variant 2025-06-23 14:01:49 -04:00
audit_watch.c fs: add kern_path_locked_negative() 2025-04-15 11:32:34 +02:00
auditfilter.c audit: fix suffixed '/' filename matching 2024-12-05 19:22:38 -05:00
auditsc.c audit,module: restore audit logging in load failure case 2025-06-16 17:00:06 -04:00
backtracetest.c backtracetest: add MODULE_DESCRIPTION() 2024-06-24 22:24:55 -07:00
bounds.c bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS 2024-04-29 08:29:29 -07:00
capability.c capability: Remove unused has_capability 2025-03-07 22:03:09 -06:00
cfi.c cfi: Move BPF CFI types and helpers to generic code 2025-07-31 18:23:53 -07:00
compat.c
configs.c
context_tracking.c context_tracking: Make RCU watch ct_kernel_exit_state() warning 2025-03-04 18:44:29 -08:00
cpu.c cpu: Remove obsolete comment from takedown_cpu() 2025-08-06 22:48:12 +02:00
cpu_pm.c
crash_core.c kdump: wait for DMA to finish when using CMA 2025-07-19 19:08:23 -07:00
crash_dump_dm_crypt.c crash_dump: retrieve dm crypt keys in kdump kernel 2025-05-21 10:48:21 -07:00
crash_reserve.c kdump: implement reserve_crashkernel_cma 2025-07-19 19:08:23 -07:00
cred.c cred: remove old {override,revert}_creds() helpers 2024-12-02 11:25:09 +01:00
delayacct.c delayacct: remove redundant code and adjust indentation 2025-05-27 19:40:33 -07:00
dma.c
elfcorehdr.c crash: remove dependency of FA_DUMP on CRASH_DUMP 2024-02-23 17:48:22 -08:00
exec_domain.c
exit.c task_stack.h: clean-up stack_not_used() implementation 2025-09-21 14:22:00 -07:00
exit.h
extable.c
fail_function.c
fork.c fork: check charging success before zeroing stack 2025-09-21 14:22:00 -07:00
freezer.c sched,freezer: Remove unnecessary warning in __thaw_task 2025-07-17 07:56:50 -10:00
gen_kheaders.sh kheaders: make it possible to override TAR 2025-08-06 10:23:36 +09:00
groups.c
hung_task.c hung_task: extend hung task blocker tracking to rwsems 2025-07-19 19:08:26 -07:00
iomem.c mm/memremap: Pass down MEMREMAP_* flags to arch_memremap_wb() 2025-02-21 15:05:38 +01:00
irq_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
jump_label.c jump_label: Use RCU in all users of __module_text_address(). 2025-03-10 11:54:46 +01:00
kallsyms.c bpf: Clean up individual BTF_ID code 2025-07-16 18:34:42 -07:00
kallsyms_internal.h kallsyms: get rid of code for absolute kallsyms 2024-07-20 16:33:21 +09:00
kallsyms_selftest.c kallsyms: Use kthread_run_on_cpu() 2025-01-02 22:12:12 +01:00
kallsyms_selftest.h
kcmp.c kcmp: improve performance adding an unlikely hint to task comparisons 2025-02-21 10:25:33 +01:00
Kconfig.freezer
Kconfig.hz kernel: Fix "select" wording on HZ_250 description 2025-02-21 09:20:30 +01:00
Kconfig.kexec kho: mm: don't allow deferred struct page with KHO 2025-08-19 16:35:53 -07:00
Kconfig.locks
Kconfig.preempt sched: No PREEMPT_RT=y for all{yes,mod}config 2024-11-07 15:25:05 +01:00
kcov.c kcov: fix typo in comment of kcov_fault_in_area 2025-07-09 22:57:52 -07:00
kexec.c kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kexec_core.c Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
kexec_elf.c kexec: initialize ELF lowest address to ULONG_MAX 2025-03-16 22:30:47 -07:00
kexec_file.c Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
kexec_handover.c kho: make sure kho_scratch argument is fully consumed 2025-09-13 16:55:18 -07:00
kexec_internal.h kexec: enable CMA based contiguous allocation 2025-08-02 12:01:38 -07:00
kheaders.c kheaders: Simplify attribute through __BIN_ATTR_SIMPLE_RO() 2024-12-24 09:46:49 +01:00
kprobes.c kprobes: Add missing kerneldoc for __get_insn_slot 2025-07-15 18:45:34 +09:00
kstack_erase.c stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth 2025-07-21 21:40:39 -07:00
ksyms_common.c
ksysfs.c kernel/ksysfs.c: simplify bin_attribute definition 2025-01-07 16:59:15 +01:00
kthread.c ipvs: Fix estimator kthreads preferred affinity 2025-08-13 08:34:33 +02:00
latencytop.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Makefile Kbuild updates for v6.17 2025-08-06 07:32:52 +03:00
module_signature.c
notifier.c reboot: move reboot_notifier_list to kernel/reboot.c 2024-11-05 17:12:31 -08:00
nsproxy.c kernel/nsproxy: remove unnecessary guards 2025-05-09 13:13:54 +02:00
padata.c padata: use cpumask_nth() 2025-06-13 17:26:17 +08:00
panic.c Significant patch series in this pull request: 2025-08-03 16:23:09 -07:00
params.c params: Replace deprecated strcpy() with strscpy() and memcpy() 2025-08-16 21:47:25 +02:00
pid.c Summary 2025-07-29 21:43:08 -07:00
pid_namespace.c pid: Do not set pid_max in new pid namespaces 2025-03-06 10:18:36 +01:00
pid_sysctl.h treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
profile.c profiling: remove profile=sleep support 2024-08-04 13:36:28 -07:00
ptrace.c ptrace: introduce PTRACE_SET_SYSCALL_INFO request 2025-05-11 17:48:15 -07:00
range.c
reboot.c - The 7 patch series "powerpc/crash: use generic crashkernel 2025-04-01 10:06:52 -07:00
regset.c regset: use kvzalloc() for regset_get_alloc() 2024-04-25 21:07:03 -07:00
relay.c relayfs: support a counter tracking if data is too big to write 2025-07-09 22:57:52 -07:00
resource.c resource: improve child resource handling in release_mem_region_adjustable() 2025-09-21 14:22:34 -07:00
resource_kunit.c resource, kunit: fix user-after-free in resource_test_region_intersects() 2024-10-09 12:47:19 -07:00
rseq.c rseq: Fix segfault on registration when rseq_cs is non-zero 2025-03-06 22:26:49 +01:00
scftorture.c scftorture: Handle NULL argument passed to scf_add_to_free_list(). 2024-11-14 16:09:51 -08:00
scs.c
seccomp.c seccomp: avoid the lock trip seccomp_filter_release in common case 2025-02-24 11:17:10 -08:00
signal.c signal: Fix memory leak for PIDFD_SELF* sentinels 2025-08-19 13:51:28 +02:00
smp.c smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment 2025-08-02 14:24:50 +02:00
smpboot.c sched/smp: Use the SMP version of idle_thread_set_boot_cpu() 2025-06-13 08:47:20 +02:00
smpboot.h
softirq.c lockdep: Fix wait context check on softirq for PREEMPT_RT 2025-03-25 10:46:44 +01:00
stacktrace.c
static_call.c
static_call_inline.c Modules changes for 6.15-rc1 2025-03-30 15:44:36 -07:00
stop_machine.c sched/core: Fix migrate_swap() vs. hotplug 2025-07-01 15:02:03 +02:00
sys.c prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE 2025-09-13 16:55:05 -07:00
sys_ni.c Probes updates for v6.11: 2024-07-18 12:19:20 -07:00
sysctl-test.c sysctl: move u8 register test to lib/test_sysctl.c 2025-04-14 14:13:41 +02:00
sysctl.c sysctl: rename kern_table -> sysctl_subsys_table 2025-07-23 11:56:02 +02:00
task_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
taskstats.c fdget(), more trivial conversions 2024-11-03 01:28:06 -05:00
torture.c torture: Add get_torture_init_jiffies() for test-start time 2025-02-05 07:14:24 -08:00
tracepoint.c tracepoint: Print the function symbol when tracepoint_debug is set 2025-03-21 15:30:10 -04:00
tsacct.c tsacct: replace strncpy() with strscpy() 2024-07-12 16:39:53 -07:00
ucount.c ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() 2025-08-02 12:01:38 -07:00
uid16.c
uid16.h
umh.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
up.c
user-return-notifier.c
user.c uidgid: make sure we fit into one cacheline 2024-09-12 12:16:09 +02:00
user_namespace.c uidgid: add map_id_range_up() 2025-02-12 12:12:27 +01:00
utsname.c
utsname_sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
vhost_task.c vhost: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)) 2025-08-01 09:11:08 -04:00
vmcore_info.c crash: export PAGE_UNACCEPTED_MAPCOUNT_VALUE to vmcoreinfo 2025-05-11 17:54:04 -07:00
watch_queue.c vfs-6.15-rc1.pipe 2025-03-24 09:52:37 -07:00
watchdog.c kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count 2025-05-21 10:48:22 -07:00
watchdog_buddy.c watchdog: fix opencoded cpumask_next_wrap() in watchdog_next_cpu() 2025-07-31 11:28:03 -04:00
watchdog_perf.c watchdog/perf: Provide function for adjusting the event period 2025-07-04 13:17:30 +01:00
workqueue.c workqueue: Changes for v6.17 2025-07-31 15:40:22 -07:00
workqueue_internal.h