The get_sample() function in the hwlat tracer assumes the caller holds
hwlat_data.lock, but this is not actually happening. The result is
unprotected data access to hwlat_data, and in per-cpu mode can result in
false sharing which may show up as false positive latency events.
The specific case of false sharing observed was primarily between
hwlat_data.sample_width and hwlat_data.count. These are separated by
just 8B and are therefore likely to share a cache line. When one thread
modifies count, the cache line is in a modified state so when other
threads read sample_width in the main latency detection loop, they fetch
the modified cache line. On some systems, the fetch itself may be slow
enough to count as a latency event, which could set up a self
reinforcing cycle of latency events as each event increments count which
then causes more latency events, continuing the cycle.
The other result of the unprotected data access is that hwlat_data.count
can end up with duplicate or missed values, which was observed on some
systems in testing.
Convert hwlat_data.count to atomic64_t so it can be safely modified
without locking, and prevent false sharing by pulling sample_width into
a local variable.
One system this was tested on was a dual socket server with 32 CPUs on
each numa node. With settings of 1us threshold, 1000us width, and
2000us window, this change reduced the number of latency events from
500 per second down to approximately 1 event per minute. Some machines
tested did not exhibit measurable latency from the false sharing.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260210074810.6328-1-clord@mykolab.com
Signed-off-by: Colin Lord <clord@mykolab.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
affinity of unbound kthreads (node or custom cpumask) against
housekeeping (CPU isolation) constraints and CPU hotplug events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset isolated
partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making nohz_full=
also mutable through cpuset in the future.
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy
rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI
H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY
7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z
sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03
+0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3
+fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ
Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m
UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP
rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz
XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ
/Dr2+aOhwYw3UD6djn3u94M9
=nWGh
-----END PGP SIGNATURE-----
Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
- Remove the unused omap-cpufreq driver (Andreas Kemnade)
- Optimize error handling code in cpufreq_boost_trigger_state() and
make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
supports boost (Lifeng Zheng)
- Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
Dhruva Gole, and Konrad Dybcio)
- Minor improvements to the cpufreq and cpumask rust implementation
(Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)
- Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)
- Update arch_freq_scale in the CPPC cpufreq driver's frequency
invariance engine (FIE) in scheduler ticks if the related CPPC
registers are not in PCC (Jie Zhan)
- Assorted minor cleanups and improvements in ARM cpufreq drivers (Juan
Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)
- Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
Gupta)
- Make the scaling_setspeed cpufreq sysfs attribute return the actual
requested frequency to avoid confusion (Pengjie Zhang)
- Simplify the idle CPU time granularity test in the ondemand cpufreq
governor (Frederic Weisbecker)
- Enable asym capacity in intel_pstate only when CPU SMT is not
possible (Yaxiong Tian)
- Update the description of rate_limit_us default value in cpufreq
documentation (Yaxiong Tian)
- Add a command line option to adjust the C-states table in the
intel_idle driver, remove the 'preferred_cstates' module parameter
from it, add C-states validation to it and clean it up (Artem
Bityutskiy)
- Make the menu cpuidle governor always check the time till the closest
timer event when the scheduler tick has been stopped to prevent it
from mistakenly selecting the deepest available idle state (Rafael
Wysocki)
- Update the teo cpuidle governor to avoid making suboptimal decisions
in certain corner cases and generally improve idle state selection
accuracy (Rafael Wysocki)
- Remove an unlikely() annotation on the early-return condition in
menu_select() that leads to branch misprediction 100% of the time
on systems with only 1 idle state enabled, like ARM64 servers (Breno
Leitao)
- Add Christian Loehle to MAINTAINERS as a cpuidle reviewer (Christian
Loehle)
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
- Add PL4 support for Ice Lake to the Intel RAPL power capping
driver (Daniel Tang)
- Replace sprintf() with sysfs_emit() in power capping sysfs show
functions (Sumeet Pawnikar)
- Make dev_pm_opp_get_level() return value match the documentation
after a previous update of the latter (Aleks Todorov)
- Use scoped for each OF child loop in the OPP code (Krzysztof
Kozlowski)
- Fix a bug in an example code snippet and correct typos in the energy
model management documentation (Patrick Little)
- Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
* idle_monitor: Fix incorrect value logged after stop
* Fix inverted APERF capability check
* Use strcspn() to strip trailing newline
* Reset errno before strtoull()
* Show C0 in idle-info dump
- Improve cpupower installation procedure by making the systemd step
optional and allowing users to disable the installation of systemd's
unit file (João Marcos Costa)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmmDr5ASHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1Q8oH/0KRqdidHzesIQl6gd5WSS/sWdxODRUt
R9dEGQQ6LXCY0z05RAq29HZQf618fYuRFX4PSrtyCvrcRJK7MJKuzK55MRq0MC3c
c/2pL1PdpHexjLXUP9pcoxrYjetsr7SnD6Y0M3JfOPg1E/bG8sp1DlnE8cdqrL0W
lrdB2cEGewT2SVkNhCIQ2n6bwfQwmLlfQl1vXTM8BA7xCjoslePUJlRphAFVAt/J
5fQxSOH0eSxK5PYQFUDM2D2J3uMAN0pFb6eIjwVYYqjABqV//BPl99Rv2W3ElJq7
K/SICRWlvzyINCgF15QAUtQHWdINxSb0GzovECVxODHOv0N4mKHdpNU=
=QlVe
-----END PGP SIGNATURE-----
Merge tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"By the number of commits, cpufreq is the leading party (again) and the
most visible change there is the removal of the omap-cpufreq driver
that has not been used for a long time (good riddance). There are also
quite a few changes in the cppc_cpufreq driver, mostly related to
fixing its frequency invariance engine in the case when the CPPC
registers used by it are not in PCC. In addition to that, support for
AM62L3 is added to the ti-cpufreq driver and the cpufreq-dt-platdev
list is updated for some platforms. The remaining cpufreq changes are
assorted fixes and cleanups.
Next up is cpuidle and the changes there are dominated by intel_idle
driver updates, mostly related to the new command line facility
allowing users to adjust the list of C-states used by the driver.
There are also a few updates of cpuidle governors, including two menu
governor fixes and some refinements of the teo governor, and a
MAINTAINERS update adding Christian Loehle as a cpuidle reviewer.
[Thanks for stepping up Christian!]
The most significant update related to system suspend and hibernation
is the one to stop freezing the PM runtime workqueue during system PM
transitions which allows some deadlocks to be avoided. There is also a
fix for possible concurrent bit field updates in the core device
suspend code and a few other minor fixes.
Apart from the above, several drivers are updated to discard the
return value of pm_runtime_put() which is going to be converted to a
void function as soon as everybody stops using its return value, PL4
support for Ice Lake is added to the Intel RAPL power capping driver,
and there are assorted cleanups, documentation fixes, and some
cpupower utility improvements.
Specifics:
- Remove the unused omap-cpufreq driver (Andreas Kemnade)
- Optimize error handling code in cpufreq_boost_trigger_state() and
make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
supports boost (Lifeng Zheng)
- Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
Dhruva Gole, and Konrad Dybcio)
- Minor improvements to the cpufreq and cpumask rust implementation
(Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)
- Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)
- Update arch_freq_scale in the CPPC cpufreq driver's frequency
invariance engine (FIE) in scheduler ticks if the related CPPC
registers are not in PCC (Jie Zhan)
- Assorted minor cleanups and improvements in ARM cpufreq drivers
(Juan Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)
- Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
Gupta)
- Make the scaling_setspeed cpufreq sysfs attribute return the actual
requested frequency to avoid confusion (Pengjie Zhang)
- Simplify the idle CPU time granularity test in the ondemand cpufreq
governor (Frederic Weisbecker)
- Enable asym capacity in intel_pstate only when CPU SMT is not
possible (Yaxiong Tian)
- Update the description of rate_limit_us default value in cpufreq
documentation (Yaxiong Tian)
- Add a command line option to adjust the C-states table in the
intel_idle driver, remove the 'preferred_cstates' module parameter
from it, add C-states validation to it and clean it up (Artem
Bityutskiy)
- Make the menu cpuidle governor always check the time till the
closest timer event when the scheduler tick has been stopped to
prevent it from mistakenly selecting the deepest available idle
state (Rafael Wysocki)
- Update the teo cpuidle governor to avoid making suboptimal
decisions in certain corner cases and generally improve idle state
selection accuracy (Rafael Wysocki)
- Remove an unlikely() annotation on the early-return condition in
menu_select() that leads to branch misprediction 100% of the time
on systems with only 1 idle state enabled, like ARM64 servers
(Breno Leitao)
- Add Christian Loehle to MAINTAINERS as a cpuidle reviewer
(Christian Loehle)
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael
Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
- Add PL4 support for Ice Lake to the Intel RAPL power capping driver
(Daniel Tang)
- Replace sprintf() with sysfs_emit() in power capping sysfs show
functions (Sumeet Pawnikar)
- Make dev_pm_opp_get_level() return value match the documentation
after a previous update of the latter (Aleks Todorov)
- Use scoped for each OF child loop in the OPP code (Krzysztof
Kozlowski)
- Fix a bug in an example code snippet and correct typos in the
energy model management documentation (Patrick Little)
- Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
* idle_monitor: Fix incorrect value logged after stop
* Fix inverted APERF capability check
* Use strcspn() to strip trailing newline
* Reset errno before strtoull()
* Show C0 in idle-info dump
- Improve cpupower installation procedure by making the systemd step
optional and allowing users to disable the installation of
systemd's unit file (João Marcos Costa)"
* tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
PM: sleep: core: Avoid bit field races related to work_in_progress
PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
cpufreq: Documentation: Update description of rate_limit_us default value
cpufreq: intel_pstate: Enable asym capacity only when CPU SMT is not possible
PM: wakeup: Handle empty list in wakeup_sources_walk_start()
PM: EM: Documentation: Fix bug in example code snippet
Documentation: Fix typos in energy model documentation
cpuidle: governors: teo: Refine intercepts-based idle state lookup
cpuidle: governors: teo: Adjust the classification of wakeup events
cpufreq: ondemand: Simplify idle cputime granularity test
cpufreq: userspace: make scaling_setspeed return the actual requested frequency
PM: hibernate: Drop NULL pointer checks before acomp_request_free()
cpufreq: CPPC: Add generic helpers for sysfs show/store
cpufreq: scmi: Fix device_node reference leak in scmi_cpu_domain_id()
cpufreq: ti-cpufreq: add support for AM62L3 SoC
cpufreq: dt-platdev: Add ti,am62l3 to blocklist
cpufreq/amd-pstate: Add comment explaining nominal_perf usage for performance policy
cpufreq: scmi: correct SCMI explanation
cpufreq: dt-platdev: Block the driver from probing on more QC platforms
rust: cpumask: rename methods of Cpumask for clarity and consistency
...
- Update the ACPICA code in the kernel to upstream version 20251212
which includes the following changes:
* Add support for new ACPI table DTPR (Michal Camacho Romero)
* Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
* Add UUIDs for Microsoft fan extensions and UUIDs associated with
TPM 2.0 devices (Armin Wolf)
* Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
(Alexey Simakov)
* Add KEYP ACPI table definition (Dave Jiang)
* Add support for the Microsoft display mux _OSI string (Armin Wolf)
* Add definitions for the IOVT ACPI table (Xianglai Li)
* Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
* Include all fields in subtable type1 for PPTT (Ben Horgan)
* Add GICv5 MADT structures and Arm IORT IWB node definitions (Jose
Marinho)
* Update Parameter Block structure for RAS2 and add a new flag in
Memory Affinity Structure for SRAT (Pawel Chmielewski)
* Add _VDM (Voltage Domain) object (Pawel Chmielewski)
- Add support for GICv5 ACPI probing on ARM which is based on the
GICv5 MADT structures and ARM IORT IWB node definitions recently
added to ACPICA (Lorenzo Pieralisi)
- Rework ACPI PM notification setup for PCI root buses and modify the
ACPI PM setup for devices to register wakeup source objects under
physical (that is, PCI, platform, etc.) devices instead of doing that
under their ACPI companions (Rafael Wysocki)
- Adjust debug messages regarding postponed ACPI PM printed during
system resume to be more accurate (Rafael Wysocki)
- Remove dead code from lps0_device_attach() (Gergo Koteles)
- Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
systems with ACPI LPS0 support to address a functional issue on
Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
backlights fail to resume after suspend (Jakob Riemenschneider)
- Add sysfs attribute cid for exposing _CID lists under ACPI device
objects (Rafael Wysocki)
- Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
interface code (Sumeet Pawnikar)
- Use acpi_get_local_u64_address() in the code implementing ACPI
support for PCI to evaluate _ADR instead of evaluating that object
directly (Andy Shevchenko)
- Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
serial IRQs on that system (Ai Chao)
- Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
systems where _OSC error bits are set even though the _OSC return
buffer contains acknowledged feature bits (Rafael Wysocki)
- Clean up and rearrange \_SB._OSC handling for general platform
features and USB4 features to avoid code duplication and unnecessary
memory management overhead (Rafael Wysocki)
- Make the ACPI core device enumeration code handle PNP0C01 and PNP0C02
("system resource") device objects directly instead of letting the
legacy PNP system driver handle them to avoid device enumeration
issues on systems where PNP0C02 is present in the _CID list under
ACPI device objects with a _HID matching a proper device driver in
Linux (Rafael Wysocki)
- Drop workarounds for the known device enumeration issues related to
_CID lists containing PNP0C02 (Rafael Wysocki)
- Drop outdated comment regarding removed function in the ACPI-based
device enumeration code (Julia Lawall)
- Make PRP0001 device matching work as expected for ACPI device objects
using it as a _HID for board development and similar purposes (Kartik
Rajput)
- Use async schedule function in acpi_scan_clear_dep_fn() to avoid
races with user space initialization on some systems (Yicong Yang)
- Add a piece of documentation explaining why binding drivers directly
to ACPI device objects is not a good idea in general and why it is
desirable to convert drivers doing so into proper platform drivers
that use struct platform_driver for device binding (Rafael Wysocki)
- Convert multiple "core ACPI" drivers, including the NFIT ACPI device
driver, the generic ACPI button drivers, the generic ACPI thermal
zone driver, the ACPI hardware event device (HED) driver, the ACPI EC
driver, the ACPI SMBUS HC driver, the ACPI Smart Battery Subsystem
(SBS) driver, and the ACPI backlight (video) driver to proper platform
drivers that use struct platform_driver for device binding (Rafael
Wysocki)
- Use acpi_get_local_u64_address() in the ACPI backlight (video) driver
to evaluate _ADR instead of evaluating that object directly (Andy
Shevchenko)
- Convert the generic ACPI battery driver to a proper platform driver
using struct platform_driver for device binding (Rafael Wysocki)
- Fix incorrect charging status when current is zero in the generic
ACPI battery driver (Ata İlhan Köktürk)
- Use LIST_HEAD() for initializing a stack-allocated list in the
generic ACPI watchdog device driver (Can Peng)
- Rework the ACPI idle driver initialization to register it directly
from the common initialization code instead of doing that from a
CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
Wysocki)
- Fix a possible NULL pointer dereference in
acpi_processor_errata_piix4() (Tuo Li)
- Make read-only array non_mmio_desc[] static const (Colin Ian King)
- Prevent the APEI GHES support code on ARM from accessing memory out
of bounds or going past the ARM processor CPER record buffer (Mauro
Carvalho Chehab)
- Prevent cper_print_fw_err() from dumping the entire memory on systems
with defective firmware (Mauro Carvalho Chehab)
- Improve ghes_notify_nmi() status check to avoid unnecessary overhead
in the NMI handler by carrying out all of the requisite preparations
and the NMI registration time (Tony Luck)
- Refactor the GHES driver by extracting common functionality into
reusable helper functions to reduce code duplication and improve
the ghes_notify_sea() status check in analogy with the previous
ghes_notify_nmi() status check improvement (Shuai Xue)
- Make ELOG and GHES log and trace consistently and support the CPER
CXL protocol analogously (Fabio De Francesco)
- Disable KASAN instrumentation in the APEI GHES driver when compile
testing with clang < 18 (Nathan Chancellor)
- Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
systems by extending the platform detection list in the APEI GHES
driver (Tony W Wang-oc)
- Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
constants for clarity in the ACPI CPPC library (Sumit Gupta)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmmErOoSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1wCoH/iWkTc5/kTTmzi5Urh9lsUsOL8Oc9hrx
KlH4XdLf9TUUSGPJtPRrrtXiOotNUSaZpg2ULktJAZHxDWXnA83DO0FjwT3+2WbH
/5YGVEK24Sae+DI38ukEvOI6bXrxxqNPLnFRtvV7tclUJD2OEAdKcUd8Y98s9qNf
64bf2gKoQtbZj0eM33KHiotULYZUNoZ2SWKFAWXzKQr7EOkcQGKcuupeStgQ04Id
uumWr+lIgW5vYPOcPFCufE3+LM0MzDvruxBlMJQEJJFVkY4IEFwwJNGW/lo8hgE3
74pFKqOPhYsIFPM8rtKXS+JmjP22iCI49QjnHIsQdDG03GEKvKFFeCw=
=CG0j
-----END PGP SIGNATURE-----
Merge tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"This one is significantly larger than previous ACPI support pull
requests because several significant updates have coincided in it.
First, there is a routine ACPICA code update, to upstream version
20251212, but this time it covers new ACPI 6.6 material that has not
been covered yet. Among other things, it includes definitions of a few
new ACPI tables and updates of some others, like the GICv5 MADT
structures and ARM IORT IWB node definitions that are used for adding
GICv5 ACPI probing on ARM (that technically is IRQ subsystem material,
but it depends on the ACPICA changes, so it is included here). The
latter alone adds a few hundred lines of new code.
Second, there is an update of ACPI _OSC handling including a fix that
prevents failures from occurring in some corner cases due to careless
handling of _OSC error bits.
On top of that, the "system resource" ACPI device objects with the
PNP0C01 and PNP0C02 are now going to be handled by the ACPI core
device enumeration code instead of handing them over to the legacy PNP
system driver which causes device enumeration issues to occur. Some of
those issues have been worked around in device drivers and elsewhere
and those workarounds should not be necessary any more, so they are
going away.
Moreover, the time has come to convert all "core ACPI" device drivers
that were still using struct acpi_driver objects for device binding
into proper platform drivers that use struct platform_driver for this
purpose. These updates are accompanied by some requisite core ACPI
device enumeration code changes.
Next, there are ACPI APEI updates, including changes to avoid excess
overhead in the NMI handler and in SEA on the ARM side, changes to
unify ACPI-based HW error tracing and logging, and changes to prevent
APEI code from reaching out of its allocated memory.
There are also some ACPI power management updates, mostly related to
the ACPI cpuidle support in the processor driver, suspend-to-idle
handling on systems with ACPI support and to ACPI PM of devices.
In addition to the above, bugs are fixed and the code is cleaned up in
assorted places all over.
Specifics:
- Update the ACPICA code in the kernel to upstream version 20251212
which includes the following changes:
* Add support for new ACPI table DTPR (Michal Camacho Romero)
* Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
* Add UUIDs for Microsoft fan extensions and UUIDs associated with
TPM 2.0 devices (Armin Wolf)
* Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
(Alexey Simakov)
* Add KEYP ACPI table definition (Dave Jiang)
* Add support for the Microsoft display mux _OSI string (Armin
Wolf)
* Add definitions for the IOVT ACPI table (Xianglai Li)
* Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
* Include all fields in subtable type1 for PPTT (Ben Horgan)
* Add GICv5 MADT structures and Arm IORT IWB node definitions
(Jose Marinho)
* Update Parameter Block structure for RAS2 and add a new flag in
Memory Affinity Structure for SRAT (Pawel Chmielewski)
* Add _VDM (Voltage Domain) object (Pawel Chmielewski)
- Add support for GICv5 ACPI probing on ARM which is based on the
GICv5 MADT structures and ARM IORT IWB node definitions recently
added to ACPICA (Lorenzo Pieralisi)
- Rework ACPI PM notification setup for PCI root buses and modify the
ACPI PM setup for devices to register wakeup source objects under
physical (that is, PCI, platform, etc.) devices instead of doing
that under their ACPI companions (Rafael Wysocki)
- Adjust debug messages regarding postponed ACPI PM printed during
system resume to be more accurate (Rafael Wysocki)
- Remove dead code from lps0_device_attach() (Gergo Koteles)
- Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
systems with ACPI LPS0 support to address a functional issue on
Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
backlights fail to resume after suspend (Jakob Riemenschneider)
- Add sysfs attribute cid for exposing _CID lists under ACPI device
objects (Rafael Wysocki)
- Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
interface code (Sumeet Pawnikar)
- Use acpi_get_local_u64_address() in the code implementing ACPI
support for PCI to evaluate _ADR instead of evaluating that object
directly (Andy Shevchenko)
- Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
serial IRQs on that system (Ai Chao)
- Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
systems where _OSC error bits are set even though the _OSC return
buffer contains acknowledged feature bits (Rafael Wysocki)
- Clean up and rearrange \_SB._OSC handling for general platform
features and USB4 features to avoid code duplication and
unnecessary memory management overhead (Rafael Wysocki)
- Make the ACPI core device enumeration code handle PNP0C01 and
PNP0C02 ("system resource") device objects directly instead of
letting the legacy PNP system driver handle them to avoid device
enumeration issues on systems where PNP0C02 is present in the _CID
list under ACPI device objects with a _HID matching a proper device
driver in Linux (Rafael Wysocki)
- Drop workarounds for the known device enumeration issues related to
_CID lists containing PNP0C02 (Rafael Wysocki)
- Drop outdated comment regarding removed function in the ACPI-based
device enumeration code (Julia Lawall)
- Make PRP0001 device matching work as expected for ACPI device
objects using it as a _HID for board development and similar
purposes (Kartik Rajput)
- Use async schedule function in acpi_scan_clear_dep_fn() to avoid
races with user space initialization on some systems (Yicong Yang)
- Add a piece of documentation explaining why binding drivers
directly to ACPI device objects is not a good idea in general and
why it is desirable to convert drivers doing so into proper
platform drivers that use struct platform_driver for device binding
(Rafael Wysocki)
- Convert multiple "core ACPI" drivers, including the NFIT ACPI
device driver, the generic ACPI button drivers, the generic ACPI
thermal zone driver, the ACPI hardware event device (HED) driver,
the ACPI EC driver, the ACPI SMBUS HC driver, the ACPI Smart
Battery Subsystem (SBS) driver, and the ACPI backlight (video)
driver to proper platform drivers that use struct platform_driver
for device binding (Rafael Wysocki)
- Use acpi_get_local_u64_address() in the ACPI backlight (video)
driver to evaluate _ADR instead of evaluating that object directly
(Andy Shevchenko)
- Convert the generic ACPI battery driver to a proper platform driver
using struct platform_driver for device binding (Rafael Wysocki)
- Fix incorrect charging status when current is zero in the generic
ACPI battery driver (Ata İlhan Köktürk)
- Use LIST_HEAD() for initializing a stack-allocated list in the
generic ACPI watchdog device driver (Can Peng)
- Rework the ACPI idle driver initialization to register it directly
from the common initialization code instead of doing that from a
CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
Wysocki)
- Fix a possible NULL pointer dereference in
acpi_processor_errata_piix4() (Tuo Li)
- Make read-only array non_mmio_desc[] static const (Colin Ian King)
- Prevent the APEI GHES support code on ARM from accessing memory out
of bounds or going past the ARM processor CPER record buffer (Mauro
Carvalho Chehab)
- Prevent cper_print_fw_err() from dumping the entire memory on
systems with defective firmware (Mauro Carvalho Chehab)
- Improve ghes_notify_nmi() status check to avoid unnecessary
overhead in the NMI handler by carrying out all of the requisite
preparations and the NMI registration time (Tony Luck)
- Refactor the GHES driver by extracting common functionality into
reusable helper functions to reduce code duplication and improve
the ghes_notify_sea() status check in analogy with the previous
ghes_notify_nmi() status check improvement (Shuai Xue)
- Make ELOG and GHES log and trace consistently and support the CPER
CXL protocol analogously (Fabio De Francesco)
- Disable KASAN instrumentation in the APEI GHES driver when compile
testing with clang < 18 (Nathan Chancellor)
- Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
systems by extending the platform detection list in the APEI GHES
driver (Tony W Wang-oc)
- Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
constants for clarity in the ACPI CPPC library (Sumit Gupta)"
* tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (117 commits)
ACPI: battery: fix incorrect charging status when current is zero
ACPI: scan: Use async schedule function in acpi_scan_clear_dep_fn()
ACPI: x86: s2idle: Invoke Microsoft _DSM Function 9 (Turn On Display)
ACPI: APEI: GHES: Add ghes_edac support for __ZX__ and _BYO_ systems
ACPI: APEI: GHES: Disable KASAN instrumentation when compile testing with clang < 18
ACPI: sysfs: Replace sprintf() with sysfs_emit()
ACPI: CPPC: Rename EPP constants for clarity
ACPI: CPPC: Clean up cppc_perf_caps and cppc_perf_ctrls structs
ACPI: processor: idle: Rework the handling of acpi_processor_ffh_lpi_probe()
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_dev() to void
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_states() to void
irqchip/gic-v5: Add ACPI IWB probing
irqchip/gic-v5: Add ACPI ITS probing
irqchip/gic-v5: Add ACPI IRS probing
irqchip/gic-v5: Split IRS probing into OF and generic portions
PCI/MSI: Make the pci_msi_map_rid_ctlr_node() interface firmware agnostic
irqdomain: Add parent field to struct irqchip_fwid
ACPI: PCI: simplify code with acpi_get_local_u64_address()
ACPI: video: simplify code with acpi_get_local_u64_address()
ACPI: PM: Adjust messages regarding postponed ACPI PM
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGLwcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv+TD/48S2HTnMhmW6AtFYWErQ+sEKXpHrxbYe7S
+qR8/g/T+QSfhfqPwZEuagndFKtIP3LJfaXGSP1Lk1RfP9NLQy91v33Ibe4DjHkp
etWSfnMHA9MUAoWKmg8EvncB2G+ZQFiYCpjazj5tKHD9S2+psGMuL8kq6qzMJE83
uhpb8WutUl4aSIXbMSfyGlwBhI1MjjRbbWlIBmg4yC8BWt1sH8Qn2L2GNVylEIcX
U8At3KLgPGn0axSg4yGMAwTqtGhL/jwdDyeczbmRlXuAr4iVL9UX/yADCYkazt6U
ttQ2/H+cxCwfES84COx9EteAatlbZxo6wjGvZ3xOMiMJVTjYe1x6Gkcckq+LrZX6
tjofi2KK78qkrMXk1mZMkZjpyUWgRtCswhDllbQyqFs0SwzQtno2//Rk8HU9dhbt
pkpryDbGFki9X3upcNyEYp5TYflpW6YhAzShYgmE6KXim2fV8SeFLviy0erKOAl+
fwjTE6KQ5QoQv0s3WxkWa4lREm34O6IHrCUmbiPm5CruJnQDhqAN2QZIDgYC4WAf
0gu9cR/O4Vxu7TQXrumPs5q+gCyDU0u0B8C3mG2s+rIo+PI5cVZKs2OIZ8HiPo0F
x73kR/pX3DMe35ZQkQX22ymMuowV+aQouDLY9DTwakP5acdcg7h7GZKABk6VLB06
gUIsnxURiQ==
=jNzW
-----END PGP SIGNATURE-----
Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJ1kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpky8EAChIL3uJ5Vmv+oQTxT4EVb1wpc8U/XzXWU5
Q5F9IpZZCGO7+i015Y7iTTqDRixjblRaWpWzZZP8vflWDUS8LESNZLQdcoEnxaiv
P367KNPUGwxejcKsu8PvZvfnX6JWSQoNstcDmrwkCF0ND2UUfvvMZyn3uKhkbBRY
h5Ehcqkvqc1OJDAWC7+yPzYAmB01uRPQ6sc9/GeujznHPlfbvie4u6gBvvfXeirT
592zbVftINMrm6Twd6zl4n+HNAn+CUoyVMppeeddv5IcyFPm9uz/dLOZBXTz6552
jFYNmB0U4g+SxGXMyqp37YISTALnuY+57y5eXmEAtgkEeE3HrF+F/ZdxQHwXSpo3
T2Lb9IOqFyHtSvq678HZ37JB6aIYbBE/mZdNf8FFFpnPJGb5Ey7d50qPp/ywVq0H
p9CahbpkzGUBMsZ+koew0YHiFdWV9tww+/Bnk5dTtn2197uyaHsLdmbf4C36GWke
Bk5cwNgU+3DMFAfTiL9m+AIXYsJkBayRJn+hViTrF5AL7gcGiBryGF43FOSKoYuq
f0mniDnGSwvn86VZPuZQ6wBRHZPEMR3OlaUXn6XrUU6cYyvMg0pBZV+QHF7zlsSP
2sdfUbPL5TxexF3G8dsxlDIypz9Z6TCoUCfU0WiiUETnCrVNkXfIY846A+w08p0b
ejBjzrwRtQ==
=CqJq
-----END PGP SIGNATURE-----
Merge tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring bpf filters from Jens Axboe:
"This adds support for both cBPF filters for io_uring, as well as task
inherited restrictions and filters.
seccomp and io_uring don't play along nicely, as most of the
interesting data to filter on resides somewhat out-of-band, in the
submission queue ring.
As a result, things like containers and systemd that apply seccomp
filters, can't filter io_uring operations.
That leaves them with just one choice if filtering is critical -
filter the actual io_uring_setup(2) system call to simply disallow
io_uring. That's rather unfortunate, and has limited us because of it.
io_uring already has some filtering support. It requires the ring to
be setup in a disabled state, and then a filter set can be applied.
This filter set is completely bi-modal - an opcode is either enabled
or it's not. Once a filter set is registered, the ring can be enabled.
This is very restrictive, and it's not useful at all to systemd or
containers which really want both broader and more specific control.
This first adds support for cBPF filters for opcodes, which enables
tighter control over what exactly a specific opcode may do. As
examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
allowing filtering on resolve flags. And another example is added for
IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
are both common use cases. cBPF was chosen rather than eBPF, because
the latter is often restricted in containers as well.
These filters are run post the init phase of the request, which allows
filters to even dip into data that is being passed in struct in user
memory, as the init side of requests make that data stable by bringing
it into the kernel. This allows filtering without needing to copy this
data twice, or have filters etc know about the exact layout of the
user data. The filters get the already copied and sanitized data
passed.
On top of that support is added for per-task filters, meaning that any
ring created with a task that has a per-task filter will get those
filters applied when it's created. These filters are inherited across
fork as well. Once a filter has been registered, any further added
filters may only further restrict what operations are permitted.
Filters cannot change the return value of an operation, they can only
permit or deny it based on the contents"
* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: allow registration of per-task restrictions
io_uring: add task fork hook
io_uring/bpf_filter: add ref counts to struct io_bpf_filter
io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
io_uring/bpf_filter: allow filtering on contents of struct open_how
io_uring/net: allow filtering on IORING_OP_SOCKET data
io_uring: add support for BPF filtering for opcode restrictions
[mostly] sanitize struct filename hanling
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaYlcJgAKCRBZ7Krx/gZQ
6xlKAP9c9J13sJ/mcobsj1Ov7nSHISNbnYqvRRCu09Wq3UQvJgEApNQYOEdLtpff
zUnWOAQ0nOKY7w9VMLkRRustXpuGjAc=
=Fld4
-----END PGP SIGNATURE-----
Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'struct filename' updates from Al Viro:
"[Mostly] sanitize struct filename handling"
* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
sysfs(2): fs_index() argument is _not_ a pathname
alpha: switch osf_mount() to strndup_user()
ksmbd: use CLASS(filename_kernel)
mqueue: switch to CLASS(filename)
user_statfs(): switch to CLASS(filename)
statx: switch to CLASS(filename_maybe_null)
quotactl_block(): switch to CLASS(filename)
chroot(2): switch to CLASS(filename)
move_mount(2): switch to CLASS(filename_maybe_null)
namei.c: switch user pathname imports to CLASS(filename{,_flags})
namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
do_readlinkat(): switch to CLASS(filename_flags)
do_sys_truncate(): switch to CLASS(filename)
do_utimes_path(): switch to CLASS(filename_uflags)
chdir(2): unspaghettify a bit...
do_fchownat(): unspaghettify a bit...
fspick(2): use CLASS(filename_flags)
name_to_handle_at(): use CLASS(filename_uflags)
vfs_open_tree(): use CLASS(filename_uflags)
...
The tracing_max_latency shouldn't be limited if CONFIG_FSNOTIFY is defined
or not and it was moved out of that protection to be always available with
CONFIG_TRACER_MAX_TRACE. All was moved out except the dentry descriptor
for it (d_max_latency) and it failed to build on some configs.
Move that out of the CONFIG_FSNOTIFY protection too.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260209194631.788bfc85@fedora
Fixes: ba73713da5 ("tracing: Clean up use of trace_create_maxlat_file()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602092133.fTdojd95-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Please consider pulling these changes from the signed vfs-7.0-rc1.misc tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49QAKCRCRxhvAZXjc
ojrZAQD1VJzY46r5FnAVf4jlEHyjIbDnZCP/n+c4x6XnqpU6EQEAgB0yAtAGP6+u
SBuytElqHoTT5VtmEXTAabCNQ9Ks8wo=
=JwZz
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains a mix of VFS cleanups, performance improvements, API
fixes, documentation, and a deprecation notice.
Scalability and performance:
- Rework pid allocation to only take pidmap_lock once instead of
twice during alloc_pid(), improving thread creation/teardown
throughput by 10-16% depending on false-sharing luck. Pad the
namespace refcount to reduce false-sharing
- Track file lock presence via a flag in ->i_opflags instead of
reading ->i_flctx, avoiding false-sharing with ->i_readcount on
open/close hot paths. Measured 4-16% improvement on 24-core
open-in-a-loop benchmarks
- Use a consume fence in locks_inode_context() to match the
store-release/load-consume idiom, eliminating a hardware fence on
some architectures
- Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
false-sharing
- Remove a redundant DCACHE_MANAGED_DENTRY check in
__follow_mount_rcu() that never fires since the caller already
verifies it, eliminating a 100% mispredicted branch
- Fix a 100% mispredicted likely() in devcgroup_inode_permission()
that became wrong after a prior code reorder
Bug fixes and correctness:
- Make insert_inode_locked() wait for inode destruction instead of
skipping, fixing a corner case where two matching inodes could
exist in the hash
- Move f_mode initialization before file_ref_init() in alloc_file()
to respect the SLAB_TYPESAFE_BY_RCU ordering contract
- Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
no buffers attached, preventing a null pointer dereference when
AS_RELEASE_ALWAYS is set but no release_folio op exists
- Fix select restart_block to store end_time as timespec64, avoiding
truncation of tv_sec on 32-bit architectures
- Make dump_inode() use get_kernel_nofault() to safely access inode
and superblock fields, matching the dump_mapping() pattern
API modernization:
- Make posix_acl_to_xattr() allocate the buffer internally since
every single caller was doing it anyway. Reduces boilerplate and
unnecessary error checking across ~15 filesystems
- Replace deprecated simple_strtoul() with kstrtoul() for the
ihash_entries, dhash_entries, mhash_entries, and mphash_entries
boot parameters, adding proper error handling
- Convert chardev code to use guard(mutex) and __free(kfree) cleanup
patterns
- Replace min_t() with min() or umin() in VFS code to avoid silently
truncating unsigned long to unsigned int
- Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
already check the flag
Deprecation:
- Begin deprecating legacy BSD process accounting (acct(2)). The
interface has numerous footguns and better alternatives exist
(eBPF)
Documentation:
- Fix and complete kernel-doc for struct export_operations, removing
duplicated documentation between ReST and source
- Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()
Testing:
- Add a kunit test for initramfs cpio handling of entries with
filesize > PATH_MAX
Misc:
- Add missing <linux/init_task.h> include in fs_struct.c"
* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
posix_acl: make posix_acl_to_xattr() alloc the buffer
fs: make insert_inode_locked() wait for inode destruction
initramfs_test: kunit test for cpio.filesize > PATH_MAX
fs: improve dump_inode() to safely access inode fields
fs: add <linux/init_task.h> for 'init_fs'
docs: exportfs: Use source code struct documentation
fs: move initializing f_mode before file_ref_init()
exportfs: Complete kernel-doc for struct export_operations
exportfs: Mark struct export_operations functions at kernel-doc
exportfs: Fix kernel-doc output for get_name()
acct(2): begin the deprecation of legacy BSD process accounting
device_cgroup: remove branch hint after code refactor
VFS: fix __start_dirop() kernel-doc warnings
fs: Describe @isnew parameter in ilookup5_nowait()
fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
select: store end_time as timespec64 in restart block
chardev: Switch to guard(mutex) and __free(kfree)
namespace: Replace simple_strtoul with kstrtoul to parse boot params
dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmmCurkUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNDWA//RZxjjyY1I0GRDepJXJ8UFEVt4Fdr
VsnSKL3o7sf0SAsQj2HCJsJPiwD5fHm2C2gdxh9rFC0bPpMbTVAkwUL7WhP+nkAt
LA+UZKYurrk1XF6OctILoY3JcXmynb1Oe3lg6uVcWX5b1uEriqRgGKNcMYLb5fmr
D1vZ9LMuZe8WwGTScprQID9FMrZ0TDbdI/vqG7si1W/PCFH7630MPJkmzmjPWvnV
xJISKLOG+qbyWoNGLr+VaNjkmA+jPfsXAKWbfNXUGfikP8g/OHpFd70nIzJs8p7J
dxZD7w6/kqSGhauQjcX8ov0zKxn83Z2Xt0+4Ldl5vOCWI3r4T3Y8WdarmULbq65n
jIN8djDgmCJPqa5zuPmik+womaPk2GmSy1viEJdT4W0iHggTC1snOz1J+BbD+nkh
uEZkmcCZbaeEQmfefxIyHDirrFsJvrunWupGrkfxvfFr+QU8H1xNLfMd6CQzvtI4
P5p/KrnP2e58tJqvPxSY315ewUMy73kZU5DUl+Rq6Y4ai415R7vtwwEEkSKWnyja
LMdEumc9IrsiBMcLmsj8QwobCr7XJtdCQV5ohR8CPxxcsI/G0pR99e1pckD7l7Qm
OG461BKHntU3SFWSiZw+rNWlJuyPcSy5nmUxQvxQHP9pShZPu8rTfYX+CBzrHJk2
OFjAwNJn1N/NfYI=
=cCyp
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Unify the security_inode_listsecurity() calls in NFSv4
While looking at security_inode_listsecurity() with an eye towards
improving the interface, we realized that the NFSv4 code was making
multiple calls to the LSM hook that could be consolidated into one.
- Mark the LSM static branch keys as static - this helps resolve some
sparse warnings
- Add __rust_helper annotations to the LSM and cred wrapper functions
- Remove the unsused set_security_override_from_ctx() function
- Minor fixes to some of the LSM kdoc comment blocks
* tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lsm: make keys for static branch static
cred: remove unused set_security_override_from_ctx()
rust: security: add __rust_helper to helpers
rust: cred: add __rust_helper to helpers
nfs: unify security_inode_listsecurity() calls
lsm: fix kernel-doc struct member names
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmmCuoQUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMRFhAAntv4vmqRciFI4oEqxi5X8wmYmzc9
BUQV2XXcfO63IOHdGrXmYHByx3+mZZddAPpYMTqrzA0p2NCqi4svCVwspUHwUcTY
btl+xlppBJpBtUL5pmLiP6Q4u+zURYCwuA/OKfxuKa5Frm8D3kbkd5MpxJS15Mev
qqEhLT0aj6/rjQpYVwOFGMwehKfE7iuyc8XTBaetvUKHW38sj18ANSpLnN5bmiuE
3lz252kCjyDoOsu+vO0Saa8Rv8lVDjlSMn6mYr4L2fVygYwFDg2Gj7+bmB6LGYy9
YyIm6P+b23E8GOltEObpvrz8ItPR7nvKNiDMEeP1eqGzQ/Mc5OqqljVaNMNPmP+s
XN/jZt02XePKXlje+C08620mDVeIYp35TK1bY2/HrYMqySE0wwO1iSyBI4ftPFtu
CteM8XA8oH49pspFWbEKCHmtFFGxDVjfVM7YrHeDc+qw2tJjZ7R1GRk5hadP1Ou7
emxGLb6jfejT6NMNU8rM2RVmQNs1jcFh+8lHvDgqqQmaCXJd3AEgr+Om5w9kZ6fJ
FyEkh0f9HuZdcEn8tqWaIwCAZXTzECOThj6hhxZGiG9xXFXza1eIXxq7VtIE6fdO
ATAJ6cpcj0LIQmE7QWntS1NloPkWD1OSiWLNis0AgCiN97oRTk0oFJ5q7fD0bLUa
HGAQqd4OJKmNciw=
=pFeD
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
- Improve the NETFILTER_PKT audit records
Add source and destination ports to the NETFILTER_PKT audit records
while also consolidating a lot of the code into a new, singular
audit_log_nf_skb() function. This new approach to structuring the
NETFILTER_PKT record generation should eliminate some unnecessary
overhead when audit is not built into the kernel.
- Update the audit syscall classifier code
Add the listxattrat(), getxattrat(), and fchmodat2() syscall to the
audit code which classifies syscalls into categories of operations,
e.g. "read" or "change attributes".
- Move the syscall classifier declarations into audit_arch.h
Shuffle around some header file declarations to resolve some sparse
warnings.
* tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: move the compat_xxx_class[] extern declarations to audit_arch.h
audit: add missing syscalls to read class
audit: include source and destination ports to NETFILTER_PKT
audit: add audit_log_nf_skb helper function
audit: add fchmodat2() to change attributes class
RCU Tasks Trace:
Re-implement RCU tasks trace in term of SRCU-fast, not only more than 500 lines
of code are saved because of the reimplementation, a new set of API,
rcu_read_{,un}lock_tasks_trace(), becomes possible as well. Compared to the
previous rcu_read_{,un}lock_trace(), the new API avoid the task_struct accesses
thanks to the SRCU-fast semantics. As a result, the old
rcu_read{,un}lock_trace() API is now deprecated.
RCU Torture Test:
- Multiple improvements on kvm-series.sh (parallel run and progress showing
metrics)
- Add context checks to rcu_torture_timer().
- Make config2csv.sh properly handle comments in .boot files.
- Include commit discription in testid.txt.
Miscellaneous RCU changes:
- Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early.
- Use suitable gfp_flags for the init_srcu_struct_nodes().
- Fix rcu_read_unlock() deadloop due to softirq.
- Correctly compute probability to invoke ->exp_current() in rcutorture.
- Make expedited RCU CPU stall warnings detect stall-end races.
RCU nocb:
- Remove unnecessary WakeOvfIsDeferred wake path and callback overload
handling.
- Extract nocb_defer_wakeup_cancel() helper.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCAAvFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmmARZERHGJvcXVuQGtl
cm5lbC5vcmcACgkQSXnow7UH+rh8SAf+PDIBWAkdbGgs32EfgpFY42RB4CWygH47
YRup/M3+nU0JcBzNnona1srpHBXRySBJQbvRbsOdlM45VoNQ2wPjig/3vFVRUKYx
uqj9Tze00DS74IIGESoTGp0amZde9SS9JakNRoEfTr+Zpj8N6LFERQw0ywUwjR5b
RR6bz7q05TAl3u2BYUAgNdnf3VWWTmj4WYwArlQ+qRFAyGN+TVj8Ezra6+K5TJ7u
SQYrf7WmRGOhHbVVolvVEOVdACccI8dFl3ebJVE2Ky0gp1o3BLPkcDLJ6gBdTCoE
rRrbnkeqs5V7tOkPFDBeUhLPrm1QxrdEDxUQFWjSApbv161sx7AOZA==
=O9QQ
-----END PGP SIGNATURE-----
Merge tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Boqun Feng:
- RCU Tasks Trace:
Re-implement RCU tasks trace in term of SRCU-fast, not only more than
500 lines of code are saved because of the reimplementation, a new
set of API, rcu_read_{,un}lock_tasks_trace(), becomes possible as
well. Compared to the previous rcu_read_{,un}lock_trace(), the new
API avoid the task_struct accesses thanks to the SRCU-fast semantics.
As a result, the old rcu_read{,un}lock_trace() API is now deprecated.
- RCU Torture Test:
- Multiple improvements on kvm-series.sh (parallel run and
progress showing metrics)
- Add context checks to rcu_torture_timer()
- Make config2csv.sh properly handle comments in .boot files
- Include commit discription in testid.txt
- Miscellaneous RCU changes:
- Reduce synchronize_rcu() latency by reporting GP kthread's
CPU QS early
- Use suitable gfp_flags for the init_srcu_struct_nodes()
- Fix rcu_read_unlock() deadloop due to softirq
- Correctly compute probability to invoke ->exp_current()
in rcutorture
- Make expedited RCU CPU stall warnings detect stall-end races
- RCU nocb:
- Remove unnecessary WakeOvfIsDeferred wake path and callback
overload handling
- Extract nocb_defer_wakeup_cancel() helper
* tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (25 commits)
rcu/nocb: Extract nocb_defer_wakeup_cancel() helper
rcu/nocb: Remove dead callback overload handling
rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
rcu: Fix rcu_read_unlock() deadloop due to softirq
rcutorture: Correctly compute probability to invoke ->exp_current()
rcu: Make expedited RCU CPU stall warnings detect stall-end races
rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
rcutorture: Prevent concurrent kvm.sh runs on same source tree
torture: Include commit discription in testid.txt
torture: Make config2csv.sh properly handle comments in .boot files
torture: Make kvm-series.sh give run numbers and totals
torture: Make kvm-series.sh give build numbers and totals
torture: Parallelize kvm-series.sh guest-OS execution
rcutorture: Add context checks to rcu_torture_timer()
rcutorture: Test rcu_tasks_trace_expedite_current()
srcu: Create an rcu_tasks_trace_expedite_current() function
checkpatch: Deprecate rcu_read_{,un}lock_trace()
rcu: Update Requirements.rst for RCU Tasks Trace
...
The latency tracers (scheduler, irqsoff, etc) were created when tracing
was first added. These tracers required a "snapshot" buffer that was the
same size as the ring buffer being written to. When a new max latency was
hit, the main ring buffer would swap with the snapshot buffer so that the
trace leading up to the latency would be saved in the snapshot buffer (The
snapshot buffer is never written to directly and the data within it can be
viewed without fear of being overwritten).
Later, a new feature was added to allow snapshots to be taken by user
space or even event triggers. This created a "snapshot" file that allowed
users to trigger a snapshot from user space to save the current trace.
The config for this new feature (CONFIG_TRACER_SNAPSHOT) would select the
latency tracer config (CONFIG_TRACER_MAX_LATENCY) as it would need all the
functionality from it as it already existed. But this was incorrect. As
the snapshot feature is really what the latency tracers need and not the
other way around.
Have CONFIG_TRACER_MAX_TRACE select CONFIG_TRACER_SNAPSHOT where the
tracers that needs the max latency buffer selects the TRACE_MAX_TRACE
which will then select TRACER_SNAPSHOT.
Also, go through trace.c and trace.h and make the code that only needs the
TRACER_MAX_TRACE protected by that and the code that always requires the
snapshot to be protected by TRACER_SNAPSHOT.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.767870992@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Instead of having #ifdef CONFIG_TRACER_MAX_TRACE around every access to
the struct tracer's use_max_tr field, add a helper function for that
access and if CONFIG_TRACER_MAX_TRACE is not configured it just returns
false.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.599390238@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When tracing was first added, there were latency tracers that would take a
snapshot of the current trace when a new max latency was hit. This
snapshot buffer was called "max_buffer". Since then, a snapshot feature
was added that allowed user space or event triggers to trigger a snapshot
of the current buffer using the same max_buffer of the trace_array.
As this snapshot buffer now has a more generic use case, calling it
"max_buffer" is confusing. Rename it to snapshot_buffer.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.428446729@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace.c file was a dumping ground for most tracing code. Start
organizing it better by moving various functions out into their own files.
Move the PID filtering functions from trace.c into its own trace_pid.c
file.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.998330662@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Move the functions associated to the trace_printk operations out of trace.c and
into trace_printk.c.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.828744197@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function trace_printk_init_buffers() is used to expand tha
trace_printk buffers when trace_printk() is used within the kernel or in
modules. On kernel boot up, it holds off from starting the sched switch
cmdline recorder, but will start it immediately when it is added by a
module.
Currently it uses a trick to see if the global_trace buffer has been
allocated or not to know if it was called by module load or not. But this
is more of a hack, and can not be used when this code is moved out of
trace.c. Instead simply look at the system_state and if it is running then
it is know that it could only be called by module load.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.660237094@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.
Instead of testing the trace_array tr pointer to &global_trace, test the
tr->flags to see if TRACE_ARRAY_FL_GLOBAL set.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.491116245@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.
Have tracing_update_buffers() take NULL for its trace_array to denote it
should work on the global_trace top level trace_array allows that function
to be used outside of trace.c and still update the global_trace
trace_array.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.318864210@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The printk_trace is used to determine which trace_array trace_printk()
writes to. By making it a global variable among the tracing subsystem it
will allow the trace_printk functions to be moved out of trace.c and still
have direct access to that variable.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.144525891@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Make ftrace_trace_stack() into a static inline that tests if stack tracing
is enabled and if so to call __ftrace_trace_stack() to do the stack trace.
This keeps the test inlined in the fast paths and only does the function
call if stack tracing is enabled.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.974218132@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Move the __always_inline functions __trace_buffer_lock_reserve(),
__trace_buffer_unlock_commit() and trace_event_setup() into trace.h.
The trace.c file will be split up and these functions will be used in more
than one of these files. As they are already __always_inline they can
easily be moved into the trace.h header file.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.813550600@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Make the variable tracing_selftest_running global so that it can be used
by other files in the tracing subsystem and trace.c can be split up.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.648932796@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_disabled variable is set to one on boot up to prevent some
parts of tracing to access the tracing infrastructure before it is set up.
It also can be set after boot if an anomaly is discovered.
It is currently a static variable in trace.c and can be accessed via a
function call trace_is_disabled(). There's really no reason to use a
function call as the tracing subsystem should be able to access it
directly.
By making the variable accessed directly, code can be moved out of trace.c
without adding overhead of a function call to see if tracing is disabled
or not.
Make tracing_disabled global and remove the tracing_is_disabled() helper
function. Also add some "unlikely()"s around tracing_disabled where it's
checked in hot paths.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.483690153@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In trace.c, the function trace_create_maxlat_file() is defined behind the
#ifdef CONFIG_TRACER_MAX_TRACE block. The #else part defines it as:
#define trace_create_maxlat_file(tr, d_tracer) \
trace_create_file("tracing_max_latency", TRACE_MODE_WRITE, \
d_tracer, tr, &tracing_max_lat_fops)
But the one place that it it used has:
#ifdef CONFIG_TRACER_MAX_TRACE
trace_create_maxlat_file(tr, d_tracer);
#endif
Which is pointless and also wrong!
It only gets created when both CONFIG_TRACE_MAX_TRACE and CONFIG_FS_NOTIFY
is defined, but the file itself should not be dependent on
CONFIG_FS_NOTIFY. Always create that file when TRACE_MAX_TRACE is defined
regardless if FS_NOTIFY is or is not.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260207191101.0e014abd@robin
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function tracing_set_filter_buffering() is only used in
trace_events_hist.c. Move it to that file and make it static.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260206195936.617080218@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the triggers were first created, they may not have had a file
parameter passed to them and things needed to be done generically.
But today, all triggers have a file parameter passed to them. Remove the
generic code and add a "if (WARN_ON_ONCE(!file))" to each trigger.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260206101351.609d8906@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Simplify the hardlockup detector's probe path and remove its implicit
dependency on pinned per-cpu execution.
Refactor hardlockup_detector_event_create() to be stateless. Return the
created perf_event pointer to the caller instead of directly modifying the
per-cpu 'watchdog_ev' variable. This allows the probe path to safely
manage a temporary event without the risk of leaving stale pointers should
task migration occur.
Link: https://lkml.kernel.org/r/20260129022629.2201331-1-realwujing@gmail.com
Signed-off-by: Shouxin Sun <sunshx@chinatelecom.cn>
Signed-off-by: Junnan Zhang <zhangjn11@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Jinchao Wang <wangjinchao600@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Song Liu <song@kernel.org>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cpustat_tail indexes cpustat_util[], which is a NUM_SAMPLE_PERIODS-sized
ring buffer. need_counting_irqs() currently wraps the index using
NUM_HARDIRQ_REPORT, which only happens to match NUM_SAMPLE_PERIODS.
Use NUM_SAMPLE_PERIODS for the wrap to keep the ring math correct even if
the NUM_HARDIRQ_REPORT or NUM_SAMPLE_PERIODS changes.
Link: https://lkml.kernel.org/r/tencent_7068189CB6D6689EB353F3D17BF5A5311A07@qq.com
Fixes: e9a9292e23 ("watchdog/softlockup: Report the most frequent interrupts")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhang Run <zhang.run@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This function returns NULL if kho_restore_page() returns NULL, which
happens in a couple of corner cases. It never returns an error code.
Link: https://lkml.kernel.org/r/20260123190506.1058669-1-tycho@kernel.org
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce an in-kernel test module to validate the core logic of the Live
Update Orchestrator's File-Lifecycle-Bound feature. This provides a
low-level, controlled environment to test FLB registration and callback
invocation without requiring userspace interaction or actual kexec
reboots.
The test is enabled by the CONFIG_LIVEUPDATE_TEST Kconfig option.
Link: https://lkml.kernel.org/r/20251218155752.3045808-6-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a mechanism for managing global kernel state whose lifecycle is
tied to the preservation of one or more files. This is necessary for
subsystems where multiple preserved file descriptors depend on a single,
shared underlying resource.
An example is HugeTLB, where multiple file descriptors such as memfd and
guest_memfd may rely on the state of a single HugeTLB subsystem.
Preserving this state for each individual file would be redundant and
incorrect. The state should be preserved only once when the first file is
preserved, and restored/finished only once the last file is handled.
This patch introduces File-Lifecycle-Bound (FLB) objects to solve this
problem. An FLB is a global, reference-counted object with a defined set
of operations:
- A file handler (struct liveupdate_file_handler) declares a dependency
on one or more FLBs via a new registration function,
liveupdate_register_flb().
- When the first file depending on an FLB is preserved, the FLB's
.preserve() callback is invoked to save the shared global state. The
reference count is then incremented for each subsequent file.
- Conversely, when the last file is unpreserved (before reboot) or
finished (after reboot), the FLB's .unpreserve() or .finish() callback
is invoked to clean up the global resource.
The implementation includes:
- A new set of ABI definitions (luo_flb_ser, luo_flb_head_ser) and a
corresponding FDT node (luo-flb) to serialize the state of all active
FLBs and pass them via Kexec Handover.
- Core logic in luo_flb.c to manage FLB registration, reference
counting, and the invocation of lifecycle callbacks.
- An API (liveupdate_flb_get/_incoming/_outgoing) for other kernel
subsystems to safely access the live object managed by an FLB, both
before and after the live update.
This framework provides the necessary infrastructure for more complex
subsystems like IOMMU, VFIO, and KVM to integrate with the Live Update
Orchestrator.
Link: https://lkml.kernel.org/r/20251218155752.3045808-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch LUO to use the private list iterators.
Link: https://lkml.kernel.org/r/20251218155752.3045808-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The custom definition of 'struct timespec64' is incompatible with both the
kernel's internal definition and the glibc type, at least on big-endian
targets that have the tv_nsec field in a different place, and the
definition clashes with any userspace that also defines a timespec64
structure.
Running the header check with -Wpadding enabled produces this output that
warns about the incorrect padding:
usr/include/linux/taskstats.h:25:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded]
Remove the hack and instead use the regular __kernel_timespec type that is
meant to be used in uapi definitions.
Link: https://lkml.kernel.org/r/20260202095906.1344100-1-arnd@kernel.org
Fixes: 29b63f6eff0e ("delayacct: add timestamp of delay max")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
performance regressions in the recent rewrite
of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path
by removing an unintended transition phase.
- Fix performance/scalability regression on a
thread-pool benchmark by optimizing transitional
CIDs when scheduling out.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmHDvQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hdPBAAgnl/L09wF8WCQLSoLrhr71FmS6fZApDB
Rvov2be8tGJR0BsrJF5uOKTNjulqUIr0mfO73fdHZftdFuhm/WLnWjBO62GhCKMg
d8kXOVZ7PudFN+QwL17pOAub8voh9s9/mceE/hZ3M5eNjXlG4sAcpyGvnrTLLYru
rfzO48NOpy5NMfbxU5/f9nojfr2t8fhnpX2QjquOhEPpl/BeYzexTZK7h2IJXqTK
tkU6IY9X8fT7y8LkKbTCIMJvEuWawHj1DSW2EiWNPJZkX+Hk5ZHttg28JjROavEy
orgairCSCT/cOETKugfToFd0Z4WlmemY6Nk5Kyx//WiFQ/u0HHlFVgMJoJfQEovV
MtIxLVygVbEoQyTszZyFUlTQjrnH8uKxXYhh1mX5wSj9lyDfpfJZycFFA2RpE4Rw
/+pvH08BfR4FgpqTfojfgOnuK/575VsomaVghritoNW3bAie1kpnWIeBaXS8lL4O
0pkK7XX8ng6hXuZTMxgXXfkfUB6oM1Yp1OZJAEzUvftsK0FQ5q3e0WxD+pdVza2s
PfQPaA7bT/G7y8k4LIXm59/tPX2QWPwe0yci00NbyfWiOdxHSgS7crQO8E1+VAiq
TcLGZNj/wFL6B5ghaiUIi22Mo+WnLX8fW+aiIjSiUQILmbNZXYmwtfEFsvsahh9W
/RkE/WQ492E=
=/PkF
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous MMCID fixes to address bugs and performance regressions
in the recent rewrite of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path by removing an
unintended transition phase
- Fix performance/scalability regression on a thread-pool benchmark
by optimizing transitional CIDs when scheduling out"
* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Optimize transitional CIDs when scheduling out
sched/mmcid: Drop per CPU CID immediately when switching to per task mode
sched/mmcid: Protect transition on weakly ordered systems
sched/mmcid: Prevent live lock on task to CPU mode transition
Replace BUG_ON() with panic() in panic_on_wq_watchdog(). This is not
a bug condition but a deliberate forced panic requested by the user
via module parameters to crash the system for debugging purposes.
Using panic() instead of BUG_ON() makes this intent clearer and provides
more informative output about which threshold was exceeded and the actual
values, making it easier to diagnose the stall condition from crash dumps.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a new module parameter 'panic_on_stall_time' that triggers a panic
when a workqueue stall persists for longer than the specified duration
in seconds.
Unlike 'panic_on_stall' which counts accumulated stall events, this
parameter triggers based on the duration of a single continuous stall.
This is useful for catching truly stuck workqueues rather than
accumulating transient stalls.
Usage:
workqueue.panic_on_stall_time=120
This would panic if any workqueue pool has been stalled for 120 seconds
or more.
The stall duration is measured from the workqueue last progress
(poll_ts) which accounts for legitimate system stalls.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Bump up the Clang minimum version requirements
for livepatch builds, due to Clang assembler
section handling bugs causing silent
miscompilations.
- Strip livepatching symbol artifacts from
non-livepatch modules.
- Fix livepatch build warnings when certain
Clang LTO options are enabled.
- Fix livepatch build error when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmHCvwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hkTQ/8DhLI5m9CqhMaouuR5Vm9POKgFOXAe6uz
eHTuKhpJlw+anhNjeUA7PtYbnkrj0j+aNo5SmrfD4Yx8CW7dCt+oO3Y5ziVhkPHw
46Q/9KbmcT11uPbYywp/G4b15FF8YYu9slzhGau/Wa9H+oqU/WPoGapJsMPpUtBo
s0qGxdr2G3WZyD9H/wgCyhMwCOkAYMJ0sHxpGgRajLefDutsRvtlau/ktYU79vI3
nUFteD3YDIAUblBtZPogsCP36QJlx7TWCUNK02vPeOYRh3xPjf3iG+vgf1+sjZHV
P20psekpDwhh1KyeVziUyihUy8TmEVVozRvsrUKVlXmEqLqDtNrysqKTSp5/yRdT
MqgNwDrvv2wW/DKYJhefbuttx3ppAErrnJ3zC9TYSmdn27feKPDJcD1OmdS3BIpH
x9u/eVbOS8xbsOc/t3/Al7CRazvjLU0+OXsMJWbmAaO3tE7SwHq2aOnGbLGjvwWC
Ts1AYfNp4H41CLLFnmKR8q2t/DOBhefW8p3cR5U+cVQ7PdqKRT+TwKWVnCrbrBcJ
71702IrqoghwUrmhtxdZR0jZLtwb80s8zqdbrHojXSbXUFfYwwHNNaW1NSEpdSr4
W8xYfSPrK71OGZ6oh/v1Wce6D+mxqb6kYD8DRLgOdbxrnvLwUhGXxxNDjfXA0f4B
abjGHlAyoCs=
=cdu2
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar::
- Bump up the Clang minimum version requirements for livepatch
builds, due to Clang assembler section handling bugs causing
silent miscompilations
- Strip livepatching symbol artifacts from non-livepatch modules
- Fix livepatch build warnings when certain Clang LTO options
are enabled
- Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool/klp: Fix unexported static call key access for manually built livepatch modules
objtool/klp: Fix symbol correlation for orphaned local symbols
livepatch: Free klp_{object,func}_ext data after initialization
livepatch: Fix having __klp_objects relics in non-livepatch modules
livepatch/klp-build: Require Clang assembler >= 20
- Export irq_domain_free_irqs() to allow PCI/MSI drivers that tear down
MSI domains to be built as modules (Aaron Kling)
- Export tegra_cpuidle_pcie_irqs_in_use(), which disables Tegra CC6 while
PCI IRQs are in use, so pci-tegra can be built as a module (Aaron Kling)
- Allow pci-tegra to be built as a module (Aaron Kling)
* pci/controller/tegra:
PCI: tegra: Allow building as a module
cpuidle: tegra: Export tegra_cpuidle_pcie_irqs_in_use()
irqdomain: Export irq_domain_free_irqs()
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_nofail().
Both functions iterate their own RCU-protected list of selems and call
bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when
both map_free() and destroy() fail to remove a selem from b->list
(extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(),
also switch to hlist_for_each_entry_rcu() since we no longer iterate
local_storage->list under local_storage->lock.
bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths
so reuse_now should always be false. Remove it from the argument and
hardcode it.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
Introduce bpf_selem_unlink_nofail() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy() where the operation must succeeds.
The idea of bpf_selem_unlink_nofail() is to allow an selem to be
partially linked and use atomic operation on a bit field, selem->state,
to determine when and who can free the selem if any unlink under lock
fails. An selem initially is fully linked to a map and a local storage.
Under normal circumstances, bpf_selem_unlink_nofail() will be able to
grab locks and unlink a selem from map and local storage in sequeunce,
just like bpf_selem_unlink(), and then free it after an RCU grace period.
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the
caller. Then, after both map_free() and destroy() see the selem and the
state becomes SELEM_UNLINKED, one of two racing caller can succeed in
cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no
double free or memory leak.
To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.
To make sure uncharging memory is done only when the owner is still
present in map_free(), block destroy() from returning until there is no
pending map_free().
Since smap may not be valid in destroy(), bpf_selem_unlink_nofail()
skips bpf_selem_unlink_storage_nolock_misc() when called from destroy().
This is okay as bpf_local_storage_destroy() will return the remaining
amount of memory charge tracked by mem_charge to the owner to uncharge.
It is also safe to skip clearing local_storage->owner and owner_storage
as the owner is being freed and no users or bpf programs should be able
to reference the owner and using local_storage.
Finally, access of selem, SDATA(selem)->smap and selem->local_storage
are racy. Callers will protect these fields with RCU.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com
The next patch will introduce bpf_selem_unlink_nofail() to handle
rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be
partially unlinked from map or local storage. Save memory allocation
method in selem so that later an selem can be correctly freed even when
SDATA(selem)->smap is init to NULL.
In addition, keep track of memory charge to the owner in local storage
so that later bpf_selem_unlink_nofail() can return the correct memory
charge to the owner. Updating local_storage->mem_charge is protected by
local_storage->lock.
Finally, extract miscellaneous tasks performed when unlinking an selem
from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will
be reused by bpf_selem_unlink_nofail().
This patch also takes the chance to remove local_storage->smap, which
is no longer used since commit f484f4a3e0 ("bpf: Replace bpf memory
allocator with kmalloc_nolock() in local storage").
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-8-ameryhung@gmail.com
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-7-ameryhung@gmail.com
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock
from raw_spin_lock to rqspinlock.
Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.
In bpf_local_storage_destroy(), ignore return from
raw_res_spin_lock_irqsave() for now. A later patch will correctly
handle errors correctly in bpf_local_storage_destroy() so that it can
unlink selems even when failing to acquire locks.
For __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.
Open code bpf_selem_unlink_storage() in the only caller,
bpf_selem_unlink(), since unlink_map and unlink_storage must be done
together after all the necessary locks are acquired.
For bpf_local_storage_map_free(), ignore the return from
bpf_selem_unlink() for now. A later patch will allow it to unlink selems
even when failing to acquire locks.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.
Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:
- bpf_local_storage_alloc()
The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
fails, which should not fail. Therefore, hold b->lock when linking
until allocation complete. Helpers that assume b->lock is held by
callers are introduced: bpf_selem_link_map_nolock() and
bpf_selem_unlink_map_nolock().
- bpf_local_storage_update()
The three step update process: link_map(new_selem),
link_storage(new_selem), and unlink_map(old_selem) should not fail in
the middle.
In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.
Remove the selem_linked_to_map_lockless() check as an selem in the
common paths (not bpf_local_storage_map_free() or
bpf_local_storage_destroy()), will be unlinked under b->lock and
local_storage->lock and therefore no other threads can unlink the selem
from map at the same time.
In bpf_local_storage_destroy(), ignore the return of
bpf_selem_unlink_map() for now. A later patch will allow
bpf_local_storage_destroy() to unlink selems even when failing to
acquire locks.
Note that while this patch removes all callers of selem_linked_to_map(),
a later patch that introduces bpf_selem_unlink_nofail() will use it
again.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-3-ameryhung@gmail.com