- Fix use-after-free in ccp.
- Fix bug when SEV is disabled in ccp.
- Fix tfm_count leak in atmel-sha204a.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmmqQ0cACgkQxycdCkmx
i6eUzg/+PHIyLgwiTcFljyxlJo8g0jDkbaaA+U3E+QImkgCgh+LdhvWp9g1nU2m/
dTcSAue2Vkgvnht7QOm2rQ/CWX+yjs9QOlztpyl7imajFIoTxQ/wRbJ+YO1T07Vr
ZQmEO6B/N/GqIlSGvUXVVy9TZN5DhgXol7jWZtmQU+EYjFrLsWUMZd2Q3fDT6eoF
3QdbJBymblv78WxJ0PtMkEpqCF2ikcDCrX4mokrrXZXyakp263pzH7pbWl1xLXR5
fUZoTsGcyh2Me2Fcsanip8SOqj8DK80UguDZZki8tcHjH4tbVPyjJ0TqQwq8cMMb
rCkGCfbV1g1+YNRuZb4YbMUeSXDqwf7G2FaI/DaxWKPYPik9yUnRt0WXAUXAX9wK
NQ4V+PTlihkT+r4WqjVH1E8EpWSoS0oCWOE+1NwsW1idpo7UHNZPoR5BVrzp5jJ/
iW/hO28djHFebzQ1DZmiDCWvMKAHdUVN9gAFB0uyOoYgdWn1TMbuk83glGhYqHHV
mCP62GuFJrZov4MO4EpiwlJi5pN1uRS69ZOFkHrM9db/3xnBO9zE48d9ePPLu+VM
ng5N1Nwh1VXeR5mwk/OBSvHHN3vI76NU2fcfgLOQy5Z0YD9kk6jyydcKhmZP+BBN
zXiJrsOhxUOpe2/TLgYCL+RNqzGMpuAtftcdi817TwopEXkheE4=
=P2SQ
-----END PGP SIGNATURE-----
Merge tag 'v7.0-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
- Fix use-after-free in ccp
- Fix bug when SEV is disabled in ccp
- Fix tfm_count leak in atmel-sha204a
* tag 'v7.0-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: atmel-sha204a - Fix OOM ->tfm_count leak
crypto: ccp - Fix use-after-free on error path
crypto: ccp - allow callers to use HV-Fixed page API when SEV is disabled
- Fix a problem where the deferred non-NCQ command would incorrectly get
completed as a failed command, if there was another command that timed
out. Found by Gemini. (Guenter)
- The deferred non-NCQ command work is only supposed to run after the
last NCQ command finishes. However, because the work was never canceled
on error (e.g. a timeout), the work could incorrectly run when commands
were still in flight. Found by syzbot. (me)
- Add a quirk to make sure that QEMU harddrives can potentially use up to
32 MiB I/Os. (Pedro)
- Add a quirk to disable LPM on Seagate ST1000DM010-2EP102. (Maximilian)
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRN+ES/c4tHlMch3DzJZDGjmcZNcgUCaarKRAAKCRDJZDGjmcZN
cvqoAQCU9tcNQZcVRtR+QpWmk9ZDqPFpI0qde9TWxVxByx1lUQEAubgAkf0F0Dp3
4eBP43Tniv/7SlPCKjbhEWEq6rvX7gA=
=YBsd
-----END PGP SIGNATURE-----
Merge tag 'ata-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Niklas Cassel:
- Fix a problem where the deferred non-NCQ command would incorrectly
get completed as a failed command, if there was another command that
timed out. Found by Gemini (Guenter)
- The deferred non-NCQ command work is only supposed to run after the
last NCQ command finishes. However, because the work was never
canceled on error (e.g. a timeout), the work could incorrectly run
when commands were still in flight. Found by syzbot (me)
- Add a quirk to make sure that QEMU harddrives can potentially use up
to 32 MiB I/Os (Pedro)
- Add a quirk to disable LPM on Seagate ST1000DM010-2EP102 (Maximilian)
* tag 'ata-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-eh: Fix detection of deferred qc timeouts
ata: libata-core: Add BRIDGE_OK quirk for QEMU drives
ata: libata: cancel pending work after clearing deferred_qc
ata: libata-core: Disable LPM on ST1000DM010-2EP102
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmqPRMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplf5D/9uOsBr+OGXtkLUJtD6MiwoJUsYgYF2dMIx
epcp+8RdMaOGtigtx69QXzTP5aPjA+AvBLAMYM+QDQDAPMWbRPsD7LaCYHy7ekwA
OL68R3QRTMYPPgpuf7pKyhif7olozAvoWAnRaoWlo67rbK+mTzZsTIsgTwF4zUu6
T0dL9thbWqtJMxKSuUk+DywggvGyNZWICJ3rAZ6os2htruH0fPhsJNGVFgNXMnpe
Cy2OvWxBWRQkZnpDEocZUdYyCRVhHr7hu311j6nSLNXufqpgFmWLGO4C3vetOlgx
ulEHfGNINcSLcw9R8pNWRxU14V6iw8Oy4nU9RtZhUpF32Iasvxb4H0w76Dp9Ukq1
/DuoSkWg/Ahn24xSYxJwwZpOEE8L92pn0M2ukCfC6h7ytmDjjEL1AQ2kyFHV4mR3
nc/3FkQ0abe3HHk8Rit6+txe3sSQo5no1z8kFlb9yp2MwAmonxCCQ9N1s7pxeeP+
iLaPbGMaZ7Ra1GswD/vzxFQtkglsxLuM5D0JkjHe99a54ZnF0vF3y9jeDVOQbV1C
H6/bU/2DI3SQ8xqv6tIXQ22reyRen3ao5VKLSrmrT/tDQVoEBV5SMnJFO1J8jBP4
QST03wiu8ShHSyZ98KefwlsndrTX02V9UVD4FVj+TZXwCWltulnIR4dVYFdySWwW
d613iUsWJw==
=NNcQ
-----END PGP SIGNATURE-----
Merge tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Improve quirk visibility and configurability (Maurizio)
- Fix runtime user modification to queue setup (Keith)
- Fix multipath leak on try_module_get failure (Keith)
- Ignore ambiguous spec definitions for better atomics support
(John)
- Fix admin queue leak on controller reset (Ming)
- Fix large allocation in persistent reservation read keys
(Sungwoo Kim)
- Fix fcloop callback handling (Justin)
- Securely free DHCHAP secrets (Daniel)
- Various cleanups and typo fixes (John, Wilfred)
- Avoid a circular lock dependency issue in the sysfs nr_requests or
scheduler store handling
- Fix a circular lock dependency with the pcpu mutex and the queue
freeze lock
- Cleanup for bio_copy_kern(), using __bio_add_page() rather than the
bio_add_page(), as adding a page here cannot fail. The exiting code
had broken cleanup for the error condition, so make it clear that the
error condition cannot happen
- Fix for a __this_cpu_read() in preemptible context splat
* tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: use trylock to avoid lockdep circular dependency in sysfs
nvme: fix memory allocation in nvme_pr_read_keys()
block: use __bio_add_page in bio_copy_kern
block: break pcpu_alloc_mutex dependency on freeze_lock
blktrace: fix __this_cpu_read/write in preemptible context
nvme-multipath: fix leak on try_module_get failure
nvmet-fcloop: Check remoteport port_state before calling done callback
nvme-pci: do not try to add queue maps at runtime
nvme-pci: cap queue creation to used queues
nvme-pci: ensure we're polling a polled queue
nvme: fix memory leak in quirks_param_set()
nvme: correct comment about nvme_ns_remove()
nvme: stop setting namespace gendisk device driver data
nvme: add support for dynamic quirk configuration via module parameter
nvme: fix admin queue leak on controller reset
nvme-fabrics: use kfree_sensitive() for DHCHAP secrets
nvme: stop using AWUPF
nvme: expose active quirks in sysfs
nvme/host: fixup some typos
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmqPTAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpueIEACHF0uws+uZSEsy9LUyC9ha8+5YN9szIJ3K
QUGxa9pmCnQG5K50KpxyEYP6buaJDy1smJGgD2obkeGncC4w6xKK2kQTQV1U+C1C
YA+7B/3HLhz5AWS6GIbRy6VZ599I4evlF8W79dX8BTnF8Y1ddkSuUnKx//q0AoQZ
hr3foglcFlchy8JuQ2/MpxzfOouvNMdMmeUN4O+t8iXDrmePFYIOgrLcT+ObgC5D
SXWx2cc3hMJ35hcSzedMWEBFcXnkX9nfh8Hd/+uPRcKsIwS8kCo6z01GoT/BCPRA
jdrxAfoYSL16HPfq6GU52n6iCaRd+5NK+tt/ECCzTxGL32Hadrr+nxw4O7g3Q96u
07zeaqHSoTGUchtlqrGjALQLP2yxdACEjxMh3rfdStRv3x3bbbVVDdioVEzPukCr
EBA+AbqaaG3LIYXwcY+15zx5NrAfeBAP1RjLgoV0s2ch4ghEqvnZGY4NLBDkcQ2R
97tM9+OdecBrsnlQr5GBoDbwpqc2pDEqSjkYDwoXqvqXs0DrMRq2MQw1Hjjh7Z7G
FZx1KNTiLB/YQ0sSyMcUKnH+qBA0FxwN/C6dDnRjj4dH5YsoeG/GhsS3B00a+0yE
S3MKrsf+uN21OYLVPSTEN6qS+02ZvK6E/Aw7/fk2IV60JMeM5KvCccmxa53dKls8
iyEJ7nVLOg==
=xyKA
-----END PGP SIGNATURE-----
Merge tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Fix a typo in the mock_file help text
- Fix a comment regarding IORING_SETUP_TASKRUN_FLAG in the
io_uring.h UAPI header
- Use READ_ONCE() for reading refill queue entries
- Reject SEND_VECTORIZED for fixed buffer sends, as it isn't
implemented. Currently this flag is silently ignored
This is in preparation for making these work, but first we
need a fixup so that older kernels will correctly reject them
- Ensure "0" means default for the rx page size
* tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/zcrx: use READ_ONCE with user shared RQEs
io_uring/mock: Fix typo in help text
io_uring/net: reject SEND_VECTORIZED when unsupported
io_uring: correct comment for IORING_SETUP_TASKRUN_FLAG
io_uring/zcrx: don't set rx_page_size when not requested
kthread_exit became a macro to do_exit in commit 28aaa9c399
("kthread: consolidate kthread exit paths to prevent use-after-free"),
so there is no kthread_exit function BTF ID to resolve. Remove it from
noreturn_deny to avoid resolve_btfids unresolved symbol warnings.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the ata_qc_for_each_raw() loop finishes without finding a matching SCSI
command for any QC, the variable qc will hold a pointer to the last element
examined, which has the tag i == ATA_MAX_QUEUE - 1. This qc can match the
port deferred QC (ap->deferred_qc).
If that happens, the condition qc == ap->deferred_qc evaluates to true
despite the loop not breaking with a match on the SCSI command for this QC.
In that case, the error handler mistakenly intercepts a command that has
not been issued yet and that has not timed out, and thus erroneously
returning a timeout error.
Fix the problem by checking for i < ATA_MAX_QUEUE in addition to
qc == ap->deferred_qc.
The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.
Assisted-by: Gemini:gemini-3.1-pro
Fixes: eddb98ad93 ("ata: libata-eh: correctly handle deferred qc timeouts")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
[cassel: modified commit log as suggested by Damien]
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Prevent CONFIG_FS_VERITY from being enabled when the page size is 256K,
since it doesn't work in that case.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaaj+ThQcZWJpZ2dlcnNA
a2VybmVsLm9yZwAKCRDzXCl4vpKOK9PdAP0bl5KEYBjtUmgg4Olv+GOLegWCZRsq
WvGJgIUDDZQfyAEAta6nN5CZ4LbngtWtPsNWVihyoaME2b7yrF/kz7l07g8=
=/rYw
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux
Pull fsverity fix from Eric Biggers:
"Prevent CONFIG_FS_VERITY from being enabled when the page size is
256K, since it doesn't work in that case"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: add dependency on 64K or smaller pages
- Several test fixes:
- Fix flakiness in the interrupt context tests in certain VMs.
- Make the lib/crypto/ KUnit tests depend on the corresponding
library options rather than selecting them. This follows the
standard KUnit convention, and it fixes an issue where enabling
CONFIG_KUNIT_ALL_TESTS pulled in all the crypto library code.
- Add a kunitconfig file for lib/crypto/.
- Fix a couple stale references to "aes-generic" that made it in
concurrently with the rename to "aes-lib".
- Update the help text for several CRYPTO kconfig options to remove
outdated information about users that now use the library instead.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaaj9YRQcZWJpZ2dlcnNA
a2VybmVsLm9yZwAKCRDzXCl4vpKOKyFZAP0b6KbEjGnQGE00nh6ChNrWs8RgyNx9
y+gtR2EMKnUf1wEA/q/VBPl5UgO3z4TrmJuSrpyMLzz+/XwUGJHKPEqTtwU=
=iJZE
-----END PGP SIGNATURE-----
Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library fixes from Eric Biggers:
- Several test fixes:
- Fix flakiness in the interrupt context tests in certain VMs
- Make the lib/crypto/ KUnit tests depend on the corresponding
library options rather than selecting them. This follows the
standard KUnit convention, and it fixes an issue where enabling
CONFIG_KUNIT_ALL_TESTS pulled in all the crypto library code
- Add a kunitconfig file for lib/crypto/
- Fix a couple stale references to "aes-generic" that made it in
concurrently with the rename to "aes-lib"
- Update the help text for several CRYPTO kconfig options to remove
outdated information about users that now use the library instead
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
crypto: testmgr - Fix stale references to aes-generic
crypto: Clean up help text for CRYPTO_CRC32
crypto: Clean up help text for CRYPTO_CRC32C
crypto: Clean up help text for CRYPTO_XXHASH
crypto: Clean up help text for CRYPTO_SHA256
crypto: Clean up help text for CRYPTO_BLAKE2B
lib/crypto: tests: Add a .kunitconfig file
lib/crypto: tests: Depend on library options rather than selecting them
kunit: irq: Ensure timer doesn't fire too frequently
- Revert a commit related to ACPI device power management that was
not supposed to make any functional difference, but it did so and
introduced a regression (Rafael Wysocki)
- Update the _CPC object definition in ACPICA to match ACPI 6.6 and
prevent the kernel from printing a false-positive warning regarding
_CPC output package format on platforms shipping with firmware based
on ACPI 6.6 (Saket Dumbre)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmmpxGcSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1JK0H/0/nuSCYO/mtbdS9fwRsXrSd+Yphlkij
F+mF9NQx5HPzW4PbHT4erqQe02Kohyb40t57C+4LiaenbKXftPbmgO+d5RvN1AkK
L78f/pdxhpsN/MisUl466UgFwPiuv4XnMSrF9VpZAF16g4zqwWpyvZ0Z3qp0a1LW
3zBPl3Jag37lnKqCaPti/84R15uxn2wMpTGeaNZGNdG6lJwKCAVhKy3bvSp+6iQ6
QMUao6bWSCT2qq0kNWBxfWhrYhAYz3lCyu1qlSumB7qBt51eEXEmTW7O38DIDXMb
fVjlsNFDntIJl2vXn9zyM5kbOnpZ7EiOytKNli/LcxqtrY/PUWNQU6Y=
=CEPs
-----END PGP SIGNATURE-----
Merge tag 'acpi-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI support fixes from Rafael Wysocki:
- Revert a commit related to ACPI device power management that was
not supposed to make any functional difference, but it did so and
introduced a regression (Rafael Wysocki)
- Update the _CPC object definition in ACPICA to match ACPI 6.6 and
prevent the kernel from printing a false-positive warning regarding
_CPC output package format on platforms shipping with firmware based
on ACPI 6.6 (Saket Dumbre)
* tag 'acpi-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "ACPI: PM: Let acpi_dev_pm_attach() skip devices without ACPI PM"
ACPICA: Update the _CPC definition to match ACPI 6.6
Current release - new code bugs:
- sched: cake: fixup cake_mq rate adjustment for diffserv config
- wifi: fix missing ieee80211_eml_params member initialization
Previous releases - regressions:
- tcp: give up on stronger sk_rcvbuf checks (for now)
Previous releases - always broken:
- net: fix rcu_tasks stall in threaded busypoll
- sched: fq: clear q->band_pkt_count[] in fq_reset()
- sched: only allow act_ct to bind to clsact/ingress qdiscs and
shared blocks
- bridge: check relevant per-VLAN options in VLAN range grouping
- xsk: fix fragment node deletion to prevent buffer leak
Misc:
- spring cleanup of inactive maintainers
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmmptYEACgkQMUZtbf5S
Irsraw/+L+L512Sbh1UlmbZjhT+AQkERHNkkfUMfXAeVb4uwHOqaydVdffvqRlTT
zOK8Oqzqf5ojRezDZ02skXnJTh39MF9IFiugF9JHApxwT2ALv0S7PXPFUJeRQeAY
+OiLT5+iy8wMfM6eryL6OtpM9PC8zwzH32oCYd5m4Ixf90Woj5G7x8Vooz7wUg1n
0cAliam8QLIRBrKXqctf7J8n23AK+WcrLcAt58J+qWCGqiiXdJXMvWXv1PjQ7vs/
KZysy0QaGwh3rw+5SquXmXwjhNIvvs58v3NV/4QbBdIKfJ5uYpTpyVgXJBQ6B4Jv
8SATHNwGbuUHuZl8OHn9ysaPCE3ZuD5pMnHbLnbKR6fyic95GxMIx/BNAOVvvwOH
l+GWEqch8hy6r+BVAJsoSEJzIf9aqUAlEhy0wEhVOP15yn5RWfMRQKpAaD6JKQYm
0Q6i+PsdS8xaANcUzi1Ec6aqyaX+iIBY6srE/twU3PW23Uv2ejqAG89x4s7t9LPu
GdMQ+iAEsR8Auph8Y5mshs4e9MrdlD3jzPCiFhkrqncWl/UcPpBgmHlD80vkTa1/
miMyYG5wq3g9pAFT43aAuoE85K6ZdIW0xGp3wGYMiW8Zy6Ea5EdnM2Wg8kbi/om0
W0pjfcI/2FInsZqK0g/PDeccWFKxl8C1SnfNDvy9rJHBwMkZHm4=
=XGBM
-----END PGP SIGNATURE-----
Merge tag 'net-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from CAN, netfilter and wireless.
Current release - new code bugs:
- sched: cake: fixup cake_mq rate adjustment for diffserv config
- wifi: fix missing ieee80211_eml_params member initialization
Previous releases - regressions:
- tcp: give up on stronger sk_rcvbuf checks (for now)
Previous releases - always broken:
- net: fix rcu_tasks stall in threaded busypoll
- sched:
- fq: clear q->band_pkt_count[] in fq_reset()
- only allow act_ct to bind to clsact/ingress qdiscs and shared
blocks
- bridge: check relevant per-VLAN options in VLAN range grouping
- xsk: fix fragment node deletion to prevent buffer leak
Misc:
- spring cleanup of inactive maintainers"
* tag 'net-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits)
xdp: produce a warning when calculated tailroom is negative
net: enetc: use truesize as XDP RxQ info frag_size
libeth, idpf: use truesize as XDP RxQ info frag_size
i40e: use xdp.frame_sz as XDP RxQ info frag_size
i40e: fix registering XDP RxQ info
ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
ice: fix rxq info registering in mbuf packets
xsk: introduce helper to determine rxq->frag_size
xdp: use modulo operation to calculate XDP frag tailroom
selftests/tc-testing: Add tests exercising act_ife metalist replace behaviour
net/sched: act_ife: Fix metalist update behavior
selftests: net: add test for IPv4 route with loopback IPv6 nexthop
net: ipv6: fix panic when IPv4 route references loopback IPv6 nexthop
net: vxlan: fix nd_tbl NULL dereference when IPv6 is disabled
net: bridge: fix nd_tbl NULL dereference when IPv6 is disabled
MAINTAINERS: remove Thomas Falcon from IBM ibmvnic
MAINTAINERS: remove Claudiu Manoil and Alexandre Belloni from Ocelot switch
MAINTAINERS: replace Taras Chornyi with Elad Nachman for Marvell Prestera
MAINTAINERS: remove Jonathan Lemon from OpenCompute PTP
MAINTAINERS: replace Clark Wang with Frank Li for Freescale FEC
...
Merge a fix updating the _CPC object definition in ACPICA to avoid
printing a false-positive output package format warning on new
platforms (Saket Dumbre)
* acpica:
ACPICA: Update the _CPC definition to match ACPI 6.6
- Fix thresh_return of function graph tracer
The update to store data on the shadow stack removed the abuse of
using the task recursion word as a way to keep track of what functions
to ignore. The trace_graph_return() was updated to handle this, but
when function_graph tracer is using a threshold (only trace functions
that took longer than a specified time), it uses
trace_graph_thresh_return() instead. This function was still incorrectly
using the task struct recursion word causing the function graph tracer to
permanently set all functions to "notrace"
- Fix thresh_return nosleep accounting
When the calltime was moved to the shadow stack storage instead of being
on the fgraph descriptor, the calculations for the amount of sleep time
was updated. The calculation was done in the trace_graph_thresh_return()
function, which also called the trace_graph_return(), which did the
calculation again, causing the time to be doubled.
Remove the call to trace_graph_return() as what it needed to do wasn't
that much, and just do the work in trace_graph_thresh_return().
- Fix syscall trace event activation on boot up
The syscall trace events are pseudo events attached to the raw_syscall
tracepoints. When the first syscall event is enabled, it enables the
raw_syscall tracepoint and doesn't need to do anything when a second
syscall event is also enabled.
When events are enabled via the kernel command line, syscall events
are partially enabled as the enabling is called before rcu_init.
This is due to allow early events to be enabled immediately. Because
kernel command line events do not distinguish between different
types of events, the syscall events are enabled here but are not fully
functioning. After rcu_init, they are disabled and re-enabled so that
they can be fully enabled. The problem happened is that this
"disable-enable" is done one at a time. If more than one syscall event
is specified on the command line, by disabling them one at a time,
the counter never gets to zero, and the raw_syscall is not disabled and
enabled, keeping the syscall events in their non-fully functional state.
Instead, disable all events and re-enabled them all, as that will ensure
the raw_syscall event is also disabled and re-enabled.
- Disable preemption in ftrace pid filtering
The ftrace pid filtering attaches to the fork and exit tracepoints to
add or remove pids that should be traced. They access variables protected
by RCU (preemption disabled). Now that tracepoint callbacks are called with
preemption enabled, this protection needs to be added explicitly, and
not depend on the functions being called with preemption disabled.
- Disable preemption in event pid filtering
The event pid filtering needs the same preemption disabling guards as
ftrace pid filtering.
- Fix accounting of the memory mapped ring buffer on fork
Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But
this does not prevent the application from calling madvise(MADVISE_DOFORK).
This causes the mapping to be copied on fork. After the first tasks exits,
the mapping is considered unmapped by everyone. But when he second task
exits, the counter goes below zero and triggers a WARN_ON.
Since nothing prevents two separate tasks from mmapping the ftrace ring
buffer (although two mappings may mess each other up), there's no reason
to stop the memory from being copied on fork.
Update the vm_operations to have an ".open" handler to update the
accounting and let the ring buffer know someone else has it mapped.
- Add all ftrace headers in MAINTAINERS file
The MAINTAINERS file only specifies include/linux/ftrace.h But misses
ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get all
*ftrace* files.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaamiIBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qulnAP9ZO6iChQL0hX/Xuu2VyRhVz0Svf8Sg
iq2IUHP48twOogEApR4zeelMORxdKqkLR+BajZUVFR1PukVbMaszPr9GoQw=
=H9pj
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix thresh_return of function graph tracer
The update to store data on the shadow stack removed the abuse of
using the task recursion word as a way to keep track of what
functions to ignore. The trace_graph_return() was updated to handle
this, but when function_graph tracer is using a threshold (only trace
functions that took longer than a specified time), it uses
trace_graph_thresh_return() instead.
This function was still incorrectly using the task struct recursion
word causing the function graph tracer to permanently set all
functions to "notrace"
- Fix thresh_return nosleep accounting
When the calltime was moved to the shadow stack storage instead of
being on the fgraph descriptor, the calculations for the amount of
sleep time was updated. The calculation was done in the
trace_graph_thresh_return() function, which also called the
trace_graph_return(), which did the calculation again, causing the
time to be doubled.
Remove the call to trace_graph_return() as what it needed to do
wasn't that much, and just do the work in
trace_graph_thresh_return().
- Fix syscall trace event activation on boot up
The syscall trace events are pseudo events attached to the
raw_syscall tracepoints. When the first syscall event is enabled, it
enables the raw_syscall tracepoint and doesn't need to do anything
when a second syscall event is also enabled.
When events are enabled via the kernel command line, syscall events
are partially enabled as the enabling is called before rcu_init. This
is due to allow early events to be enabled immediately. Because
kernel command line events do not distinguish between different types
of events, the syscall events are enabled here but are not fully
functioning. After rcu_init, they are disabled and re-enabled so that
they can be fully enabled.
The problem happened is that this "disable-enable" is done one at a
time. If more than one syscall event is specified on the command
line, by disabling them one at a time, the counter never gets to
zero, and the raw_syscall is not disabled and enabled, keeping the
syscall events in their non-fully functional state.
Instead, disable all events and re-enabled them all, as that will
ensure the raw_syscall event is also disabled and re-enabled.
- Disable preemption in ftrace pid filtering
The ftrace pid filtering attaches to the fork and exit tracepoints to
add or remove pids that should be traced. They access variables
protected by RCU (preemption disabled). Now that tracepoint callbacks
are called with preemption enabled, this protection needs to be added
explicitly, and not depend on the functions being called with
preemption disabled.
- Disable preemption in event pid filtering
The event pid filtering needs the same preemption disabling guards as
ftrace pid filtering.
- Fix accounting of the memory mapped ring buffer on fork
Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY.
But this does not prevent the application from calling
madvise(MADVISE_DOFORK). This causes the mapping to be copied on
fork. After the first tasks exits, the mapping is considered unmapped
by everyone. But when he second task exits, the counter goes below
zero and triggers a WARN_ON.
Since nothing prevents two separate tasks from mmapping the ftrace
ring buffer (although two mappings may mess each other up), there's
no reason to stop the memory from being copied on fork.
Update the vm_operations to have an ".open" handler to update the
accounting and let the ring buffer know someone else has it mapped.
- Add all ftrace headers in MAINTAINERS file
The MAINTAINERS file only specifies include/linux/ftrace.h But misses
ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get
all *ftrace* files.
* tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Add MAINTAINERS entries for all ftrace headers
tracing: Fix WARN_ON in tracing_buffers_mmap_close
tracing: Disable preemption in the tracepoint callbacks handling filtered pids
ftrace: Disable preemption in the tracepoint callbacks handling filtered pids
tracing: Fix syscall events activation by ensuring refcount hits zero
fgraph: Fix thresh_return nosleeptime double-adjust
fgraph: Fix thresh_return clear per-task notrace
Larysa Zaremba says:
====================
Address XDP frags having negative tailroom
Aside from the issue described below, tailroom calculation does not account
for pages being split between frags, e.g. in i40e, enetc and
AF_XDP ZC with smaller chunks. These series address the problem by
calculating modulo (skb_frag_off() % rxq->frag_size) in order to get
data offset within a smaller block of memory. Please note, xskxceiver
tail grow test passes without modulo e.g. in xdpdrv mode on i40e,
because there is not enough descriptors to get to flipped buffers.
Many ethernet drivers report xdp Rx queue frag size as being the same as
DMA write size. However, the only user of this field, namely
bpf_xdp_frags_increase_tail(), clearly expects a truesize.
Such difference leads to unspecific memory corruption issues under certain
circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
all DMA-writable space in 2 buffers. This would be fine, if only
rxq->frag_size was properly set to 4K, but value of 3K results in a
negative tailroom, because there is a non-zero page offset.
We are supposed to return -EINVAL and be done with it in such case,
but due to tailroom being stored as an unsigned int, it is reported to be
somewhere near UINT_MAX, resulting in a tail being grown, even if the
requested offset is too much(it is around 2K in the abovementioned test).
This later leads to all kinds of unspecific calltraces.
[ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
[ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
[ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
[ 7340.339230] in xskxceiver[42b5,400000+69000]
[ 7340.340300] likely on CPU 6 (core 0, socket 6)
[ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
[ 7340.340888] likely on CPU 3 (core 0, socket 3)
[ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
[ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
[ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
[ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
[ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
[ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
[ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
[ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
[ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
[ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
[ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
[ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
[ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
[ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
[ 7340.421237] PKRU: 55555554
[ 7340.421623] Call Trace:
[ 7340.421987] <TASK>
[ 7340.422309] ? softleaf_from_pte+0x77/0xa0
[ 7340.422855] swap_pte_batch+0xa7/0x290
[ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
[ 7340.424102] zap_pte_range+0x281/0x580
[ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
[ 7340.425177] unmap_page_range+0x24d/0x420
[ 7340.425714] unmap_vmas+0xa1/0x180
[ 7340.426185] exit_mmap+0xe1/0x3b0
[ 7340.426644] __mmput+0x41/0x150
[ 7340.427098] exit_mm+0xb1/0x110
[ 7340.427539] do_exit+0x1b2/0x460
[ 7340.427992] do_group_exit+0x2d/0xc0
[ 7340.428477] get_signal+0x79d/0x7e0
[ 7340.428957] arch_do_signal_or_restart+0x34/0x100
[ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
[ 7340.430159] do_syscall_64+0x188/0x6b0
[ 7340.430672] ? __do_sys_clone3+0xd9/0x120
[ 7340.431212] ? switch_fpu_return+0x4e/0xd0
[ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
[ 7340.432498] ? do_syscall_64+0xbb/0x6b0
[ 7340.433015] ? __handle_mm_fault+0x445/0x690
[ 7340.433582] ? count_memcg_events+0xd6/0x210
[ 7340.434151] ? handle_mm_fault+0x212/0x340
[ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
[ 7340.435271] ? clear_bhb_loop+0x30/0x80
[ 7340.435788] ? clear_bhb_loop+0x30/0x80
[ 7340.436299] ? clear_bhb_loop+0x30/0x80
[ 7340.436812] ? clear_bhb_loop+0x30/0x80
[ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 7340.437973] RIP: 0033:0x7f4161b14169
[ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
[ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
[ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
[ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
[ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
[ 7340.444586] </TASK>
[ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
[ 7340.449650] ---[ end trace 0000000000000000 ]---
The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
drivers to not do this. Therefore, make tailroom a signed int and produce a
warning when it is negative to prevent such mistakes in the future.
The issue can also be easily reproduced with ice driver, by applying
the following diff to xskxceiver and enjoying a kernel panic in xdpdrv mode:
diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 5af28f359cfd..042d587fa7ef 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2541,8 +2541,8 @@ int testapp_adjust_tail_grow_mb(struct test_spec *test)
{
test->mtu = MAX_ETH_JUMBO_SIZE;
/* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last fragment */
- return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
- XSK_UMEM__LARGE_FRAME_SIZE * 2);
+ return testapp_adjust_tail(test, XSK_UMEM__MAX_FRAME_SIZE * 100,
+ 6912);
}
int testapp_tx_queue_consumer(struct test_spec *test)
If we print out the values involved in the tailroom calculation:
tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
4294967040 = 3456 - 3456 - 256
I personally reproduced and verified the issue in ice and i40e,
aside from WiP ixgbevf implementation.
====================
Link: https://patch.msgid.link/20260305111253.2317394-1-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in enetc driver configuration lead to
negative tailroom.
Set frag_size to the same value as frame_sz.
Fixes: 2768b2e2f7 ("net: enetc: register XDP RX queues with frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-9-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in idpf driver configuration lead
to negative tailroom.
To make it worse, buffer sizes are not actually uniform in idpf when
splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size
is meaningless in this case.
Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable
growing tail for regular splitq.
Fixes: ac8a861f63 ("idpf: prepare structures to support XDP")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-8-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in i40e driver configuration lead
to negative tailroom.
Set frag_size to the same value as frame_sz in shared pages mode, use new
helper to set frag_size when AF_XDP ZC is active.
Fixes: a045d2f2d0 ("i40e: set xdp_rxq_info::frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-7-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current way of handling XDP RxQ info in i40e has a problem, where frag_size
is not updated when xsk_buff_pool is detached or when MTU is changed, this
leads to growing tail always failing for multi-buffer packets.
Couple XDP RxQ info registering with buffer allocations and unregistering
with cleaning the ring.
Fixes: a045d2f2d0 ("i40e: set xdp_rxq_info::frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-6-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buff size instead
of DMA write size. Different assumptions in ice driver configuration lead
to negative tailroom.
This allows to trigger kernel panic, when using
XDP_ADJUST_TAIL_GROW_MULTI_BUFF xskxceiver test and changing packet size to
6912 and the requested offset to a huge value, e.g.
XSK_UMEM__MAX_FRAME_SIZE * 100.
Due to other quirks of the ZC configuration in ice, panic is not observed
in ZC mode, but tailroom growing still fails when it should not.
Use fill queue buffer truesize instead of DMA write size in XDP RxQ info.
Fix ZC mode too by using the new helper.
Fixes: 2fba7dc515 ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-5-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XDP RxQ info contains frag_size, which depends on the MTU. This makes the
old way of registering RxQ info before calculating new buffer sizes
invalid. Currently, it leads to frag_size being outdated, making it
sometimes impossible to grow tailroom in a mbuf packet. E.g. fragments are
actually 3K+, but frag size is still as if MTU was 1500.
Always register new XDP RxQ info after reconfiguring memory pools.
Fixes: 2fba7dc515 ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-4-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxq->frag_size is basically a step between consecutive strictly aligned
frames. In ZC mode, chunk size fits exactly, but if chunks are unaligned,
there is no safe way to determine accessible space to grow tailroom.
Report frag_size to be zero, if chunks are unaligned, chunk_size otherwise.
Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-3-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current formula for calculating XDP tailroom in mbuf packets works only
if each frag has its own page (if rxq->frag_size is PAGE_SIZE), this
defeats the purpose of the parameter overall and without any indication
leads to negative calculated tailroom on at least half of frags, if shared
pages are used.
There are not many drivers that set rxq->frag_size. Among them:
* i40e and enetc always split page uniformly between frags, use shared
pages
* ice uses page_pool frags via libeth, those are power-of-2 and uniformly
distributed across page
* idpf has variable frag_size with XDP on, so current API is not applicable
* mlx5, mtk and mvneta use PAGE_SIZE or 0 as frag_size for page_pool
As for AF_XDP ZC, only ice, i40e and idpf declare frag_size for it. Modulo
operation yields good results for aligned chunks, they are all power-of-2,
between 2K and PAGE_SIZE. Formula without modulo fails when chunk_size is
2K. Buffers in unaligned mode are not distributed uniformly, so modulo
operation would not work.
To accommodate unaligned buffers, we could define frag_size as
data + tailroom, and hence do not subtract offset when calculating
tailroom, but this would necessitate more changes in the drivers.
Define rxq->frag_size as an even portion of a page that fully belongs to a
single frag. When calculating tailroom, locate the data start within such
portion by performing a modulo operation on page offset.
Fixes: bf25146a55 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-2-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add 2 test cases to exercise fix in act_ife's internal metalist
behaviour.
- Update decode ife action into encode with tcindex metadata
- Update decode ife action into encode with multiple metadata
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260304140603.76500-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiayuan Chen says:
====================
net: ipv6: fix panic when IPv4 route references loopback IPv6 nexthop and add selftest
syzbot reported a kernel panic [1] when an IPv4 route references
a loopback IPv6 nexthop object:
BUG: unable to handle page fault for address: ffff8d069e7aa000
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 6aa01067 P4D 6aa01067 PUD 0
Oops: Oops: 0000 [#1] SMP PTI
CPU: 2 UID: 0 PID: 530 Comm: ping Not tainted 6.19.0+ #193 PREEMPT
RIP: 0010:ip_route_output_key_hash_rcu+0x578/0x9e0
RSP: 0018:ffffd2ffc1573918 EFLAGS: 00010286
RAX: ffff8d069e7aa000 RBX: ffffd2ffc1573988 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffd2ffc1573978 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d060d496000
R13: 0000000000000000 R14: ffff8d060399a600 R15: ffff8d06019a6ab8
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8d069e7aa000 CR3: 0000000106eb0001 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
ip_route_output_key_hash+0x86/0x1a0
__ip4_datagram_connect+0x2b5/0x4e0
udp_connect+0x2c/0x60
inet_dgram_connect+0x88/0xd0
__sys_connect_file+0x56/0x90
__sys_connect+0xa8/0xe0
__x64_sys_connect+0x18/0x30
x64_sys_call+0xfb9/0x26e0
do_syscall_64+0xd3/0x1510
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Reproduction:
ip -6 nexthop add id 100 dev lo
ip route add 172.20.20.0/24 nhid 100
ping -c1 172.20.20.1 # kernel crash
Problem Description
When a standalone IPv6 nexthop object is created with a loopback device,
fib6_nh_init() misclassifies it as a reject route. Nexthop objects have
no destination prefix (fc_dst=::), so fib6_is_reject() always matches
any loopback nexthop. The reject path skips fib_nh_common_init(), leaving
nhc_pcpu_rth_output unallocated. When an IPv4 route later references
this nexthop and triggers a route lookup, __mkroute_output() calls
raw_cpu_ptr(nhc->nhc_pcpu_rth_output) on a NULL pointer, causing a page
fault.
The reject classification was designed for regular IPv6 routes to prevent
kernel routing loops, but nexthop objects should not be subject to this
check since they carry no destination information. Loop prevention is
handled separately when the route itself is created.
[1] https://syzkaller.appspot.com/bug?extid=334190e097a98a1b81bb
====================
Link: https://patch.msgid.link/20260304113817.294966-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a regression test for a kernel panic that occurs when an IPv4 route
references an IPv6 nexthop object created on the loopback device.
The test creates an IPv6 nexthop on lo, binds an IPv4 route to it, then
triggers a route lookup via ping to verify the kernel does not crash.
./fib_nexthops.sh
Tests passed: 249
Tests failed: 0
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260304113817.294966-3-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a standalone IPv6 nexthop object is created with a loopback device
(e.g., "ip -6 nexthop add id 100 dev lo"), fib6_nh_init() misclassifies
it as a reject route. This is because nexthop objects have no destination
prefix (fc_dst=::), causing fib6_is_reject() to match any loopback
nexthop. The reject path skips fib_nh_common_init(), leaving
nhc_pcpu_rth_output unallocated. If an IPv4 route later references this
nexthop, __mkroute_output() dereferences NULL nhc_pcpu_rth_output and
panics.
Simplify the check in fib6_nh_init() to only match explicit reject
routes (RTF_REJECT) instead of using fib6_is_reject(). The loopback
promotion heuristic in fib6_is_reject() is handled separately by
ip6_route_info_create_nh(). After this change, the three cases behave
as follows:
1. Explicit reject route ("ip -6 route add unreachable 2001:db8::/64"):
RTF_REJECT is set, enters reject path, skips fib_nh_common_init().
No behavior change.
2. Implicit loopback reject route ("ip -6 route add 2001:db8::/32 dev lo"):
RTF_REJECT is not set, takes normal path, fib_nh_common_init() is
called. ip6_route_info_create_nh() still promotes it to reject
afterward. nhc_pcpu_rth_output is allocated but unused, which is
harmless.
3. Standalone nexthop object ("ip -6 nexthop add id 100 dev lo"):
RTF_REJECT is not set, takes normal path, fib_nh_common_init() is
called. nhc_pcpu_rth_output is properly allocated, fixing the crash
when IPv4 routes reference this nexthop.
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Fixes: 493ced1ac4 ("ipv4: Allow routes to use nexthop objects")
Reported-by: syzbot+334190e097a98a1b81bb@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/698f8482.a70a0220.2c38d7.00ca.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260304113817.294966-2-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called
which initializes it. If an IPv6 packet is injected into the interface,
route_shortcircuit() is called and a NULL pointer dereference happens on
neigh_lookup().
BUG: kernel NULL pointer dereference, address: 0000000000000380
Oops: Oops: 0000 [#1] SMP NOPTI
[...]
RIP: 0010:neigh_lookup+0x20/0x270
[...]
Call Trace:
<TASK>
vxlan_xmit+0x638/0x1ef0 [vxlan]
dev_hard_start_xmit+0x9e/0x2e0
__dev_queue_xmit+0xbee/0x14e0
packet_sendmsg+0x116f/0x1930
__sys_sendto+0x1f5/0x200
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x12f/0x1590
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix this by adding an early check on route_shortcircuit() when protocol
is ETH_P_IPV6. Note that ipv6_mod_enabled() cannot be used here because
VXLAN can be built-in even when IPv6 is built as a module.
Fixes: e15a00aafa ("vxlan: add ipv6 route short circuit support")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260304120357.9778-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called
which initializes it. Then, if neigh_suppress is enabled and an ICMPv6
Neighbor Discovery packet reaches the bridge, br_do_suppress_nd() will
dereference ipv6_stub->nd_tbl which is NULL, passing it to
neigh_lookup(). This causes a kernel NULL pointer dereference.
BUG: kernel NULL pointer dereference, address: 0000000000000268
Oops: 0000 [#1] PREEMPT SMP NOPTI
[...]
RIP: 0010:neigh_lookup+0x16/0xe0
[...]
Call Trace:
<IRQ>
? neigh_lookup+0x16/0xe0
br_do_suppress_nd+0x160/0x290 [bridge]
br_handle_frame_finish+0x500/0x620 [bridge]
br_handle_frame+0x353/0x440 [bridge]
__netif_receive_skb_core.constprop.0+0x298/0x1110
__netif_receive_skb_one_core+0x3d/0xa0
process_backlog+0xa0/0x140
__napi_poll+0x2c/0x170
net_rx_action+0x2c4/0x3a0
handle_softirqs+0xd0/0x270
do_softirq+0x3f/0x60
Fix this by replacing IS_ENABLED(IPV6) call with ipv6_mod_enabled() in
the callers. This is in essence disabling NS/NA suppression when IPv6 is
disabled.
Fixes: ed842faeb2 ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Reported-by: Guruprasad C P <gurucp2005@gmail.com>
Closes: https://lore.kernel.org/netdev/CAHXs0ORzd62QOG-Fttqa2Cx_A_VFp=utE2H2VTX5nqfgs7LDxQ@mail.gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260304120357.9778-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, whenever you boot with a QEMU drive over an AHCI interface,
you get:
[ 1.632121] ata1.00: applying bridge limits
This happens due to the kernel not believing the given drive is SATA,
since word 93 of IDENTIFY (ATA_ID_HW_CONFIG) is non-zero. The result is
a pretty severe limit in max_hw_sectors_kb, which limits our IO sizes.
QEMU has set word 93 erroneously for SATA drives but does not, in any
way, emulate any of these real hardware details. There is no PATA
drive and no SATA cable.
As such, add a BRIDGE_OK quirk for QEMU HARDDISK. Special care is taken
to limit this quirk to "2.5+", to allow for fixed future versions.
This results in the max_hw_sectors being limited solely by the
controller interface's limits. Which, for AHCI controllers, takes it
from 128KB to 32767KB.
Cc: stable@vger.kernel.org
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Jakub Kicinski says:
====================
MAINTAINERS: annual cleanup of inactive maintainers
Annual cleanup of inactive maintainers under networking.
The goal is to make sure MAINTAINERS reflect reality for
code which is relatively actively changed (at least 70 commits
in the last 2 years or at least 120 commits in the last 5 years).
Those who either:
- were the initial author / "upstreamer" of the driver; or
- authored at least 1/3rd of the exiting code base (per git blame); or
- authored at least 25% of commits before becoming inactive
are moved to CREDITS.
The discovery of inactive maintainers was done using gitdm tools,
with a bunch of ad-hoc scripts on top to do the rest. I tried to
double check the results but this is mostly a scripted cleanup
so please report inaccuracies if any.
====================
Link: https://patch.msgid.link/20260303215339.2333548-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen emails or tags from Thomas's IBM address
(tlfalcon@linux.ibm.com) in over 5 years. Looks like Thomas
is active in perf tooling at Intel (thomas.falcon@intel.com).
Subsystem IBM Power SRIOV Virtual NIC Device Driver
Changes 49 / 134 (36%)
Last activity: 2025-08-26
Haren Myneni <haren@linux.ibm.com>:
Tags 3c14917953 2025-08-26 00:00:00 2
Rick Lindsley <ricklind@linux.ibm.com>:
Nick Child <nnac123@linux.ibm.com>:
Author d93a6caab5 2025-03-25 00:00:00 14
Tags d93a6caab5 2025-03-25 00:00:00 16
Thomas Falcon <tlfalcon@linux.ibm.com>:
Top reviewers:
[22]: drt@linux.ibm.com
[13]: horms@kernel.org
[9]: ricklind@linux.vnet.ibm.com
[3]: davemarq@linux.ibm.com
INACTIVE MAINTAINER Thomas Falcon <tlfalcon@linux.ibm.com>
Move Thomas to CREDITS as the initial author of ibmvnic.
Acked-by: Thomas Falcon <thomas.falcon@intel.com>
Link: https://patch.msgid.link/20260303215339.2333548-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from Claudiu for the Ocelot switch driver
in over 5 years. He is active upstream in other NXP subsystems
(ENETC, gianfar), with 46 emails on lore since 2024.
We have not seen tags from Alexandre for the Ocelot switch driver
in over 5 years. He is very active upstream in other subsystems
(RTC, I3C, Atmel/Microchip SoC), with over 1,200 emails on lore
since 2024.
Vladimir Oltean is active.
Subsystem OCELOT ETHERNET SWITCH DRIVER
Changes 180 / 494 (36%)
Last activity: 2026-02-12
Vladimir Oltean <vladimir.oltean@nxp.com>:
Author c22ba07c82 2026-02-10 00:00:00 33
Tags 026f6513c5 2026-02-12 00:00:00 39
Claudiu Manoil <claudiu.manoil@nxp.com>:
Alexandre Belloni <alexandre.belloni@bootlin.com>:
Top reviewers:
[49]: f.fainelli@gmail.com
[19]: horms@kernel.org
[10]: richardcochran@gmail.com
[9]: jacob.e.keller@intel.com
[8]: colin.foster@in-advantage.com
INACTIVE MAINTAINER Claudiu Manoil <claudiu.manoil@nxp.com>
Acked-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://patch.msgid.link/20260303215339.2333548-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen emails or tags from Jonathan in over 5 years,
and there is no recent mailing list activity.
Vadim Fedorenko is active.
Subsystem OPENCOMPUTE PTP CLOCK DRIVER
Changes 49 / 130 (37%)
Last activity: 2025-11-25
Jonathan Lemon <jonathan.lemon@gmail.com>:
Vadim Fedorenko <vadim.fedorenko@linux.dev>:
Author d3ca2ef0c9 2025-09-19 00:00:00 5
Tags 648282e2d1 2025-11-25 00:00:00 20
Top reviewers:
[7]: horms@kernel.org
[4]: jiri@nvidia.com
[3]: richardcochran@gmail.com
[2]: aleksandr.loktionov@intel.com
INACTIVE MAINTAINER Jonathan Lemon <jonathan.lemon@gmail.com>
Add Jonathan to CREDITS as the initial author of ptp_ocp.
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260303215339.2333548-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from Clark for FEC in over 5 years.
He has some limited recent activity on the mailing list in other
NXP subsystems (stmmac, phy). Wei Fang and Shenwei Wang are active,
with decent review coverage (61%).
Frank Li has been reviewing code actively more recenty, let's
make it official.
Subsystem FREESCALE IMX / MXC FEC DRIVER
Changes 57 / 92 (61%)
Last activity: 2026-02-10
Wei Fang <wei.fang@nxp.com>:
Author 25eb3058eb 2026-02-10 00:00:00 33
Tags 25eb3058eb 2026-02-10 00:00:00 61
Shenwei Wang <shenwei.wang@nxp.com>:
Author d466c16026 2025-09-14 00:00:00 6
Tags d466c16026 2025-09-14 00:00:00 6
Clark Wang <xiaoning.wang@nxp.com>:
Top reviewers:
[23]: Frank.Li@nxp.com
[17]: andrew@lunn.ch
[4]: csokas.bence@prolan.hu
[3]: horms@kernel.org
[2]: maxime.chevallier@bootlin.com
INACTIVE MAINTAINER Clark Wang <xiaoning.wang@nxp.com>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260303215339.2333548-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from DENG Qingfang for the MediaTek
switch driver in over 5 years. He is active upstream with
PPP/PPPoE patches in net-next. Chester and Daniel are active.
Subsystem MEDIATEK SWITCH DRIVER
Changes 26 / 70 (37%)
Last activity: 2025-12-01
Chester A. Unal <chester.a.unal@arinc9.com>:
Tags 585943b7ad 2025-12-01 00:00:00 7
Daniel Golle <daniel@makrotopia.org>:
Author 497041d763 2025-04-23 00:00:00 2
Tags 3b87e60d21 2025-12-01 00:00:00 14
DENG Qingfang <dqfext@gmail.com>:
Sean Wang <sean.wang@mediatek.com>:
Top reviewers:
[4]: andrew@lunn.ch
[4]: florian.fainelli@broadcom.com
[4]: arinc.unal@arinc9.com
[2]: olteanv@gmail.com
INACTIVE MAINTAINER DENG Qingfang <dqfext@gmail.com>
Acked-by: Chester A. Unal <chester.a.unal@arinc9.com>
Link: https://patch.msgid.link/20260303215339.2333548-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from Sean in over 5 years,
with only one mailing list post since 2024.
Felix and Lorenzo are active for the Ethernet driver,
and Chester, Daniel and DENG Qingfang are active for
the switch driver.
Subsystem MEDIATEK ETHERNET DRIVER
Changes 55 / 113 (48%)
Last activity: 2025-10-12
Felix Fietkau <nbd@nbd.name>:
Author d473673711 2025-09-02 00:00:00 3
Tags d473673711 2025-09-02 00:00:00 4
Sean Wang <sean.wang@mediatek.com>:
Lorenzo Bianconi <lorenzo@kernel.org>:
Author 96326447d4 2025-08-13 00:00:00 35
Tags 3abc0e55ea 2025-10-12 00:00:00 40
Top reviewers:
[26]: horms@kernel.org
[5]: andrew@lunn.ch
[4]: jacob.e.keller@intel.com
[3]: shannon.nelson@amd.com
[3]: michal.swiatkowski@linux.intel.com
INACTIVE MAINTAINER Sean Wang <sean.wang@mediatek.com>
Subsystem MEDIATEK SWITCH DRIVER
Changes 26 / 70 (37%)
Last activity: 2025-12-01
Chester A. Unal <chester.a.unal@arinc9.com>:
Tags 585943b7ad 2025-12-01 00:00:00 7
Daniel Golle <daniel@makrotopia.org>:
Author 497041d763 2025-04-23 00:00:00 2
Tags 3b87e60d21 2025-12-01 00:00:00 14
DENG Qingfang <dqfext@gmail.com>:
Sean Wang <sean.wang@mediatek.com>:
Top reviewers:
[4]: andrew@lunn.ch
[4]: florian.fainelli@broadcom.com
[4]: arinc.unal@arinc9.com
[2]: olteanv@gmail.com
INACTIVE MAINTAINER Sean Wang <sean.wang@mediatek.com>
Acked-by: Chester A. Unal <chester.a.unal@arinc9.com>
Link: https://patch.msgid.link/20260303215339.2333548-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen emails or tags from Johan in over 5 years,
and there is no recent mailing list activity.
Marcel Holtmann hasn't provided any tags in the Bluetooth
subsystem in over 5 years, but he is active on the Bluetooth
mailing list, providing informal review.
Luiz Augusto von Dentz is very active, handling essentially
all commits and reviews (12% coverage, but Luiz is the sole
active committer).
Subsystem BLUETOOTH SUBSYSTEM
Changes 50 / 411 (12%)
Last activity: 2026-02-23
Marcel Holtmann <marcel@holtmann.org>:
Johan Hedberg <johan.hedberg@gmail.com>:
Luiz Augusto von Dentz <luiz.dentz@gmail.com>:
Author 138d7eca44 2026-02-23 00:00:00 164
Committer 138d7eca44 2026-02-23 00:00:00 361
Tags 138d7eca44 2026-02-23 00:00:00 362
Top reviewers:
[15]: pmenzel@molgen.mpg.de
[8]: keescook@chromium.org
[5]: willemb@google.com
[4]: horms@kernel.org
[3]: kuniyu@amazon.com
[3]: luiz.von.dentz@intel.com
INACTIVE MAINTAINER Johan Hedberg <johan.hedberg@gmail.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Link: https://patch.msgid.link/20260303215339.2333548-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The tun UDP tunnel GSO fixture contains XFAIL-marked variants intended to
exercise failure paths (e.g. EMSGSIZE / "Message too long").
Using ASSERT_EQ() in these tests aborts the subtest, which prevents the
harness from classifying them as XFAIL and can make the overall net: tun
test fail.
Switch the relevant ASSERT_EQ() checks to EXPECT_EQ() so the subtests
continue running and the failures are correctly reported and accounted
as XFAIL where applicable.
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://patch.msgid.link/20260225111451.347923-2-sun.jian.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TEST_F() allocates and registers its struct __test_metadata via mmap()
inside its constructor, and only then assigns the
_##fixture_##test##_object pointer.
XFAIL_ADD() runs in a constructor too and reads
_##fixture_##test##_object to initialize xfail->test. If XFAIL_ADD runs
first, xfail->test can be NULL and the expected failure will be reported
as FAIL.
Use constructor priorities to ensure TEST_F registration runs before
XFAIL_ADD, without adding extra state or runtime lookups.
Fixes: 2709473c93 ("selftests: kselftest_harness: support using xfail")
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://patch.msgid.link/20260225111451.347923-1-sun.jian.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmpdZAbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gB5Bw//
QIK9Lop0FLMMIHnDdYmfsJiYnuV7aFKvys5XHvTdJEtjN8yS8ImcG8MknWHiFTbC
iEDHSNMaM75RfGFeMnWKZGe7OJ+3tliCCefDscngxKFug0DctwSY0ogYWnkgOh/U
v6RwoO5tZb4w8cEaqh6sprTH9wp3uld4ShvlW1zm18uo1ytIlgi6i70Spenfmel3
WsnxYq9O17ewSNlzGPZU16ktEQy5mhWVFFeq/29vr/nC0g9PWzhjbzO+Jvb6Pg1o
RpXzVkoOEle0tPdPDLBIYuu5VI3sSg0qBPeW/pAU2K8llFEl8SP2vuNOWhRe/cZ2
bzcDOKRriWMZ9/sDr2bYRE9KglrdZHJhzJwQNR/LakP1wbB07vV29ENo7tnDWRve
6+WxlB/ZOfqJ3V20M+0tHA2gOvKrQ0PqGLP6i4EaLfBB/xWNdLfKLBln+YnR2EE9
8zW5U+j8WbsGVfmt81FuoWU7keD/8VOfHwN8aE0dPy+Spez84dQHxPMDc+jhiqkP
/DObV5F0uNHAHZiA+T3d95ys8UmxRm3E/zeqJ5d+3/+fP+RsqWVQmZ/QY3wJvZsd
B+Qfrv2uK/coZwGs/3BlpzBVZMCV2ep7KDVTAezItP1CcVHkS54TFEC8myAeQQii
7SAhEeLrh6w3xVERdXbkF6wKph53DC2W8iFcdjbFB4w=
=3hB9
-----END PGP SIGNATURE-----
Merge tag 'nf-26-03-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter: updates for net
1) Inseo An reported a bug with the set element handling in nf_tables:
When set cannot accept more elements, we unlink and immediately free
an element that was inserted into a public data structure, freeing it
without waiting for RCU grace period. Fix this by doing the
increment earlier and by deferring possible unlink-and-free to the
existing abort path, which performs the needed synchronize_rcu before
free. From Pablo Neira Ayuso. This is an ancient bug, dating back to
kernel 4.10.
2) syzbot reported WARN_ON() splat in nf_tables that occurs on memory
allocation failure. Fix this by a new iterator annotation:
The affected walker does not need to clone the data structure and
can just use the live version if no clone exists yet.
Also from Pablo. This bug existed since 6.10 days.
3) Ancient forever bug in nft_pipapo data structure:
The garbage collection logic to remove expired elements is broken.
We must unlink from data structure and can only hand the freeing
to call_rcu after the clone/live pointers of the data structures
have been swapped. Else, readers can observe the free'd element.
Reported by Yiming Qian.
* tag 'nf-26-03-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_set_pipapo: split gc into unlink and reclaim phase
netfilter: nf_tables: clone set on flush only
netfilter: nf_tables: unconditionally bump set->nelems before insertion
====================
Link: https://patch.msgid.link/20260305122635.23525-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is currently no entry for ftrace_irq.h and ftrace_regs.h. Add a
generic entry for all *ftrace* headers to include them and prevent
overlooking future ftrace headers.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260305093117.853700-1-jmarchan@redhat.com
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reset eBPF program pointer to old_prog and do not decrease its ref-count
if mtk_open routine in mtk_xdp_setup() fails.
Fixes: 7c26c20da5 ("net: ethernet: mtk_eth_soc: add basic XDP support")
Suggested-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260303-mtk-xdp-prog-ptr-fix-v2-1-97b6dbbe240f@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yiming Qian reports Use-after-free in the pipapo set type:
Under a large number of expired elements, commit-time GC can run for a very
long time in a non-preemptible context, triggering soft lockup warnings and
RCU stall reports (local denial of service).
We must split GC in an unlink and a reclaim phase.
We cannot queue elements for freeing until pointers have been swapped.
Expired elements are still exposed to both the packet path and userspace
dumpers via the live copy of the data structure.
call_rcu() does not protect us: dump operations or element lookups starting
after call_rcu has fired can still observe the free'd element, unless the
commit phase has made enough progress to swap the clone and live pointers
before any new reader has picked up the old version.
This a similar approach as done recently for the rbtree backend in commit
35f83a7552 ("netfilter: nft_set_rbtree: don't gc elements on insert").
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Syzbot with fault injection triggered a failing memory allocation with
GFP_KERNEL which results in a WARN splat:
iter.err
WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992
Modules linked in:
CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845
Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43
+80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9
RSP: 0018:ffffc900045af780 EFLAGS: 00010293
RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40
RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000
RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff
R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0
R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920
FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0
Call Trace:
<TASK>
__nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115
nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187
notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85
blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380
netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761
__sock_release net/socket.c:662 [inline]
sock_close+0xc3/0x240 net/socket.c:1455
Restrict set clone to the flush set command in the preparation phase.
Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree
and pipapo backends to only clone the set when this iteration type is
used.
As for the existing NFT_ITER_UPDATE type, update the pipapo backend to
use the existing set clone if available, otherwise use the existing set
representation. After this update, there is no need to clone a set that
is being deleted, this includes bound anonymous set.
An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone
interface and call it from the flush set path.
Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com
Fixes: 3f1d886cc7 ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
In case that the set is full, a new element gets published then removed
without waiting for the RCU grace period, while RCU reader can be
walking over it already.
To address this issue, add the element transaction even if set is full,
but toggle the set_full flag to report -ENFILE so the abort path safely
unwinds the set to its previous state.
As for element updates, decrement set->nelems to restore it.
A simpler fix is to call synchronize_rcu() in the error path.
However, with a large batch adding elements to already maxed-out set,
this could cause noticeable slowdown of such batches.
Fixes: 35d0ac9070 ("netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL")
Reported-by: Inseo An <y0un9sa@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
After acquiring netdev_queue::_xmit_lock the number of the CPU owning
the lock is recorded in netdev_queue::xmit_lock_owner. This works as
long as the BH context is not preemptible.
On PREEMPT_RT the softirq context is preemptible and without the
softirq-lock it is possible to have multiple user in __dev_queue_xmit()
submitting a skb on the same CPU. This is fine in general but this means
also that the current CPU is recorded as netdev_queue::xmit_lock_owner.
This in turn leads to the recursion alert and the skb is dropped.
Instead checking the for CPU number, that owns the lock, PREEMPT_RT can
check if the lockowner matches the current task.
Add netif_tx_owned() which returns true if the current context owns the
lock by comparing the provided CPU number with the recorded number. This
resembles the current check by negating the condition (the current check
returns true if the lock is not owned).
On PREEMPT_RT use rt_mutex_owner() to return the lock owner and compare
the current task against it.
Use the new helper in __dev_queue_xmit() and netif_local_xmit_active()
which provides a similar check.
Update comments regarding pairing READ_ONCE().
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20260216134333.412332-1-spasswolf@web.de
Fixes: 3253cb49cb ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Bert Karwatzki <spasswolf@web.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260302162631.uGUyIqDT@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>