Add 2 test cases to exercise fix in act_ife's internal metalist
behaviour.
- Update decode ife action into encode with tcindex metadata
- Update decode ife action into encode with multiple metadata
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260304140603.76500-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiayuan Chen says:
====================
net: ipv6: fix panic when IPv4 route references loopback IPv6 nexthop and add selftest
syzbot reported a kernel panic [1] when an IPv4 route references
a loopback IPv6 nexthop object:
BUG: unable to handle page fault for address: ffff8d069e7aa000
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 6aa01067 P4D 6aa01067 PUD 0
Oops: Oops: 0000 [#1] SMP PTI
CPU: 2 UID: 0 PID: 530 Comm: ping Not tainted 6.19.0+ #193 PREEMPT
RIP: 0010:ip_route_output_key_hash_rcu+0x578/0x9e0
RSP: 0018:ffffd2ffc1573918 EFLAGS: 00010286
RAX: ffff8d069e7aa000 RBX: ffffd2ffc1573988 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffd2ffc1573978 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d060d496000
R13: 0000000000000000 R14: ffff8d060399a600 R15: ffff8d06019a6ab8
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8d069e7aa000 CR3: 0000000106eb0001 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
ip_route_output_key_hash+0x86/0x1a0
__ip4_datagram_connect+0x2b5/0x4e0
udp_connect+0x2c/0x60
inet_dgram_connect+0x88/0xd0
__sys_connect_file+0x56/0x90
__sys_connect+0xa8/0xe0
__x64_sys_connect+0x18/0x30
x64_sys_call+0xfb9/0x26e0
do_syscall_64+0xd3/0x1510
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Reproduction:
ip -6 nexthop add id 100 dev lo
ip route add 172.20.20.0/24 nhid 100
ping -c1 172.20.20.1 # kernel crash
Problem Description
When a standalone IPv6 nexthop object is created with a loopback device,
fib6_nh_init() misclassifies it as a reject route. Nexthop objects have
no destination prefix (fc_dst=::), so fib6_is_reject() always matches
any loopback nexthop. The reject path skips fib_nh_common_init(), leaving
nhc_pcpu_rth_output unallocated. When an IPv4 route later references
this nexthop and triggers a route lookup, __mkroute_output() calls
raw_cpu_ptr(nhc->nhc_pcpu_rth_output) on a NULL pointer, causing a page
fault.
The reject classification was designed for regular IPv6 routes to prevent
kernel routing loops, but nexthop objects should not be subject to this
check since they carry no destination information. Loop prevention is
handled separately when the route itself is created.
[1] https://syzkaller.appspot.com/bug?extid=334190e097a98a1b81bb
====================
Link: https://patch.msgid.link/20260304113817.294966-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a regression test for a kernel panic that occurs when an IPv4 route
references an IPv6 nexthop object created on the loopback device.
The test creates an IPv6 nexthop on lo, binds an IPv4 route to it, then
triggers a route lookup via ping to verify the kernel does not crash.
./fib_nexthops.sh
Tests passed: 249
Tests failed: 0
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260304113817.294966-3-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a standalone IPv6 nexthop object is created with a loopback device
(e.g., "ip -6 nexthop add id 100 dev lo"), fib6_nh_init() misclassifies
it as a reject route. This is because nexthop objects have no destination
prefix (fc_dst=::), causing fib6_is_reject() to match any loopback
nexthop. The reject path skips fib_nh_common_init(), leaving
nhc_pcpu_rth_output unallocated. If an IPv4 route later references this
nexthop, __mkroute_output() dereferences NULL nhc_pcpu_rth_output and
panics.
Simplify the check in fib6_nh_init() to only match explicit reject
routes (RTF_REJECT) instead of using fib6_is_reject(). The loopback
promotion heuristic in fib6_is_reject() is handled separately by
ip6_route_info_create_nh(). After this change, the three cases behave
as follows:
1. Explicit reject route ("ip -6 route add unreachable 2001:db8::/64"):
RTF_REJECT is set, enters reject path, skips fib_nh_common_init().
No behavior change.
2. Implicit loopback reject route ("ip -6 route add 2001:db8::/32 dev lo"):
RTF_REJECT is not set, takes normal path, fib_nh_common_init() is
called. ip6_route_info_create_nh() still promotes it to reject
afterward. nhc_pcpu_rth_output is allocated but unused, which is
harmless.
3. Standalone nexthop object ("ip -6 nexthop add id 100 dev lo"):
RTF_REJECT is not set, takes normal path, fib_nh_common_init() is
called. nhc_pcpu_rth_output is properly allocated, fixing the crash
when IPv4 routes reference this nexthop.
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Fixes: 493ced1ac4 ("ipv4: Allow routes to use nexthop objects")
Reported-by: syzbot+334190e097a98a1b81bb@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/698f8482.a70a0220.2c38d7.00ca.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260304113817.294966-2-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called
which initializes it. If an IPv6 packet is injected into the interface,
route_shortcircuit() is called and a NULL pointer dereference happens on
neigh_lookup().
BUG: kernel NULL pointer dereference, address: 0000000000000380
Oops: Oops: 0000 [#1] SMP NOPTI
[...]
RIP: 0010:neigh_lookup+0x20/0x270
[...]
Call Trace:
<TASK>
vxlan_xmit+0x638/0x1ef0 [vxlan]
dev_hard_start_xmit+0x9e/0x2e0
__dev_queue_xmit+0xbee/0x14e0
packet_sendmsg+0x116f/0x1930
__sys_sendto+0x1f5/0x200
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x12f/0x1590
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix this by adding an early check on route_shortcircuit() when protocol
is ETH_P_IPV6. Note that ipv6_mod_enabled() cannot be used here because
VXLAN can be built-in even when IPv6 is built as a module.
Fixes: e15a00aafa ("vxlan: add ipv6 route short circuit support")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260304120357.9778-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called
which initializes it. Then, if neigh_suppress is enabled and an ICMPv6
Neighbor Discovery packet reaches the bridge, br_do_suppress_nd() will
dereference ipv6_stub->nd_tbl which is NULL, passing it to
neigh_lookup(). This causes a kernel NULL pointer dereference.
BUG: kernel NULL pointer dereference, address: 0000000000000268
Oops: 0000 [#1] PREEMPT SMP NOPTI
[...]
RIP: 0010:neigh_lookup+0x16/0xe0
[...]
Call Trace:
<IRQ>
? neigh_lookup+0x16/0xe0
br_do_suppress_nd+0x160/0x290 [bridge]
br_handle_frame_finish+0x500/0x620 [bridge]
br_handle_frame+0x353/0x440 [bridge]
__netif_receive_skb_core.constprop.0+0x298/0x1110
__netif_receive_skb_one_core+0x3d/0xa0
process_backlog+0xa0/0x140
__napi_poll+0x2c/0x170
net_rx_action+0x2c4/0x3a0
handle_softirqs+0xd0/0x270
do_softirq+0x3f/0x60
Fix this by replacing IS_ENABLED(IPV6) call with ipv6_mod_enabled() in
the callers. This is in essence disabling NS/NA suppression when IPv6 is
disabled.
Fixes: ed842faeb2 ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Reported-by: Guruprasad C P <gurucp2005@gmail.com>
Closes: https://lore.kernel.org/netdev/CAHXs0ORzd62QOG-Fttqa2Cx_A_VFp=utE2H2VTX5nqfgs7LDxQ@mail.gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260304120357.9778-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
MAINTAINERS: annual cleanup of inactive maintainers
Annual cleanup of inactive maintainers under networking.
The goal is to make sure MAINTAINERS reflect reality for
code which is relatively actively changed (at least 70 commits
in the last 2 years or at least 120 commits in the last 5 years).
Those who either:
- were the initial author / "upstreamer" of the driver; or
- authored at least 1/3rd of the exiting code base (per git blame); or
- authored at least 25% of commits before becoming inactive
are moved to CREDITS.
The discovery of inactive maintainers was done using gitdm tools,
with a bunch of ad-hoc scripts on top to do the rest. I tried to
double check the results but this is mostly a scripted cleanup
so please report inaccuracies if any.
====================
Link: https://patch.msgid.link/20260303215339.2333548-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen emails or tags from Thomas's IBM address
(tlfalcon@linux.ibm.com) in over 5 years. Looks like Thomas
is active in perf tooling at Intel (thomas.falcon@intel.com).
Subsystem IBM Power SRIOV Virtual NIC Device Driver
Changes 49 / 134 (36%)
Last activity: 2025-08-26
Haren Myneni <haren@linux.ibm.com>:
Tags 3c14917953 2025-08-26 00:00:00 2
Rick Lindsley <ricklind@linux.ibm.com>:
Nick Child <nnac123@linux.ibm.com>:
Author d93a6caab5 2025-03-25 00:00:00 14
Tags d93a6caab5 2025-03-25 00:00:00 16
Thomas Falcon <tlfalcon@linux.ibm.com>:
Top reviewers:
[22]: drt@linux.ibm.com
[13]: horms@kernel.org
[9]: ricklind@linux.vnet.ibm.com
[3]: davemarq@linux.ibm.com
INACTIVE MAINTAINER Thomas Falcon <tlfalcon@linux.ibm.com>
Move Thomas to CREDITS as the initial author of ibmvnic.
Acked-by: Thomas Falcon <thomas.falcon@intel.com>
Link: https://patch.msgid.link/20260303215339.2333548-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from Claudiu for the Ocelot switch driver
in over 5 years. He is active upstream in other NXP subsystems
(ENETC, gianfar), with 46 emails on lore since 2024.
We have not seen tags from Alexandre for the Ocelot switch driver
in over 5 years. He is very active upstream in other subsystems
(RTC, I3C, Atmel/Microchip SoC), with over 1,200 emails on lore
since 2024.
Vladimir Oltean is active.
Subsystem OCELOT ETHERNET SWITCH DRIVER
Changes 180 / 494 (36%)
Last activity: 2026-02-12
Vladimir Oltean <vladimir.oltean@nxp.com>:
Author c22ba07c82 2026-02-10 00:00:00 33
Tags 026f6513c5 2026-02-12 00:00:00 39
Claudiu Manoil <claudiu.manoil@nxp.com>:
Alexandre Belloni <alexandre.belloni@bootlin.com>:
Top reviewers:
[49]: f.fainelli@gmail.com
[19]: horms@kernel.org
[10]: richardcochran@gmail.com
[9]: jacob.e.keller@intel.com
[8]: colin.foster@in-advantage.com
INACTIVE MAINTAINER Claudiu Manoil <claudiu.manoil@nxp.com>
Acked-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://patch.msgid.link/20260303215339.2333548-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen emails or tags from Jonathan in over 5 years,
and there is no recent mailing list activity.
Vadim Fedorenko is active.
Subsystem OPENCOMPUTE PTP CLOCK DRIVER
Changes 49 / 130 (37%)
Last activity: 2025-11-25
Jonathan Lemon <jonathan.lemon@gmail.com>:
Vadim Fedorenko <vadim.fedorenko@linux.dev>:
Author d3ca2ef0c9 2025-09-19 00:00:00 5
Tags 648282e2d1 2025-11-25 00:00:00 20
Top reviewers:
[7]: horms@kernel.org
[4]: jiri@nvidia.com
[3]: richardcochran@gmail.com
[2]: aleksandr.loktionov@intel.com
INACTIVE MAINTAINER Jonathan Lemon <jonathan.lemon@gmail.com>
Add Jonathan to CREDITS as the initial author of ptp_ocp.
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260303215339.2333548-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from Clark for FEC in over 5 years.
He has some limited recent activity on the mailing list in other
NXP subsystems (stmmac, phy). Wei Fang and Shenwei Wang are active,
with decent review coverage (61%).
Frank Li has been reviewing code actively more recenty, let's
make it official.
Subsystem FREESCALE IMX / MXC FEC DRIVER
Changes 57 / 92 (61%)
Last activity: 2026-02-10
Wei Fang <wei.fang@nxp.com>:
Author 25eb3058eb 2026-02-10 00:00:00 33
Tags 25eb3058eb 2026-02-10 00:00:00 61
Shenwei Wang <shenwei.wang@nxp.com>:
Author d466c16026 2025-09-14 00:00:00 6
Tags d466c16026 2025-09-14 00:00:00 6
Clark Wang <xiaoning.wang@nxp.com>:
Top reviewers:
[23]: Frank.Li@nxp.com
[17]: andrew@lunn.ch
[4]: csokas.bence@prolan.hu
[3]: horms@kernel.org
[2]: maxime.chevallier@bootlin.com
INACTIVE MAINTAINER Clark Wang <xiaoning.wang@nxp.com>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260303215339.2333548-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from DENG Qingfang for the MediaTek
switch driver in over 5 years. He is active upstream with
PPP/PPPoE patches in net-next. Chester and Daniel are active.
Subsystem MEDIATEK SWITCH DRIVER
Changes 26 / 70 (37%)
Last activity: 2025-12-01
Chester A. Unal <chester.a.unal@arinc9.com>:
Tags 585943b7ad 2025-12-01 00:00:00 7
Daniel Golle <daniel@makrotopia.org>:
Author 497041d763 2025-04-23 00:00:00 2
Tags 3b87e60d21 2025-12-01 00:00:00 14
DENG Qingfang <dqfext@gmail.com>:
Sean Wang <sean.wang@mediatek.com>:
Top reviewers:
[4]: andrew@lunn.ch
[4]: florian.fainelli@broadcom.com
[4]: arinc.unal@arinc9.com
[2]: olteanv@gmail.com
INACTIVE MAINTAINER DENG Qingfang <dqfext@gmail.com>
Acked-by: Chester A. Unal <chester.a.unal@arinc9.com>
Link: https://patch.msgid.link/20260303215339.2333548-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen tags from Sean in over 5 years,
with only one mailing list post since 2024.
Felix and Lorenzo are active for the Ethernet driver,
and Chester, Daniel and DENG Qingfang are active for
the switch driver.
Subsystem MEDIATEK ETHERNET DRIVER
Changes 55 / 113 (48%)
Last activity: 2025-10-12
Felix Fietkau <nbd@nbd.name>:
Author d473673711 2025-09-02 00:00:00 3
Tags d473673711 2025-09-02 00:00:00 4
Sean Wang <sean.wang@mediatek.com>:
Lorenzo Bianconi <lorenzo@kernel.org>:
Author 96326447d4 2025-08-13 00:00:00 35
Tags 3abc0e55ea 2025-10-12 00:00:00 40
Top reviewers:
[26]: horms@kernel.org
[5]: andrew@lunn.ch
[4]: jacob.e.keller@intel.com
[3]: shannon.nelson@amd.com
[3]: michal.swiatkowski@linux.intel.com
INACTIVE MAINTAINER Sean Wang <sean.wang@mediatek.com>
Subsystem MEDIATEK SWITCH DRIVER
Changes 26 / 70 (37%)
Last activity: 2025-12-01
Chester A. Unal <chester.a.unal@arinc9.com>:
Tags 585943b7ad 2025-12-01 00:00:00 7
Daniel Golle <daniel@makrotopia.org>:
Author 497041d763 2025-04-23 00:00:00 2
Tags 3b87e60d21 2025-12-01 00:00:00 14
DENG Qingfang <dqfext@gmail.com>:
Sean Wang <sean.wang@mediatek.com>:
Top reviewers:
[4]: andrew@lunn.ch
[4]: florian.fainelli@broadcom.com
[4]: arinc.unal@arinc9.com
[2]: olteanv@gmail.com
INACTIVE MAINTAINER Sean Wang <sean.wang@mediatek.com>
Acked-by: Chester A. Unal <chester.a.unal@arinc9.com>
Link: https://patch.msgid.link/20260303215339.2333548-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have not seen emails or tags from Johan in over 5 years,
and there is no recent mailing list activity.
Marcel Holtmann hasn't provided any tags in the Bluetooth
subsystem in over 5 years, but he is active on the Bluetooth
mailing list, providing informal review.
Luiz Augusto von Dentz is very active, handling essentially
all commits and reviews (12% coverage, but Luiz is the sole
active committer).
Subsystem BLUETOOTH SUBSYSTEM
Changes 50 / 411 (12%)
Last activity: 2026-02-23
Marcel Holtmann <marcel@holtmann.org>:
Johan Hedberg <johan.hedberg@gmail.com>:
Luiz Augusto von Dentz <luiz.dentz@gmail.com>:
Author 138d7eca44 2026-02-23 00:00:00 164
Committer 138d7eca44 2026-02-23 00:00:00 361
Tags 138d7eca44 2026-02-23 00:00:00 362
Top reviewers:
[15]: pmenzel@molgen.mpg.de
[8]: keescook@chromium.org
[5]: willemb@google.com
[4]: horms@kernel.org
[3]: kuniyu@amazon.com
[3]: luiz.von.dentz@intel.com
INACTIVE MAINTAINER Johan Hedberg <johan.hedberg@gmail.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Link: https://patch.msgid.link/20260303215339.2333548-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The tun UDP tunnel GSO fixture contains XFAIL-marked variants intended to
exercise failure paths (e.g. EMSGSIZE / "Message too long").
Using ASSERT_EQ() in these tests aborts the subtest, which prevents the
harness from classifying them as XFAIL and can make the overall net: tun
test fail.
Switch the relevant ASSERT_EQ() checks to EXPECT_EQ() so the subtests
continue running and the failures are correctly reported and accounted
as XFAIL where applicable.
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://patch.msgid.link/20260225111451.347923-2-sun.jian.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TEST_F() allocates and registers its struct __test_metadata via mmap()
inside its constructor, and only then assigns the
_##fixture_##test##_object pointer.
XFAIL_ADD() runs in a constructor too and reads
_##fixture_##test##_object to initialize xfail->test. If XFAIL_ADD runs
first, xfail->test can be NULL and the expected failure will be reported
as FAIL.
Use constructor priorities to ensure TEST_F registration runs before
XFAIL_ADD, without adding extra state or runtime lookups.
Fixes: 2709473c93 ("selftests: kselftest_harness: support using xfail")
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://patch.msgid.link/20260225111451.347923-1-sun.jian.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJdBAABCABHFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmmpdZAbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyDRxmd0BzdHJsZW4uZGUACgkQcJGo2a1f9gB5Bw//
QIK9Lop0FLMMIHnDdYmfsJiYnuV7aFKvys5XHvTdJEtjN8yS8ImcG8MknWHiFTbC
iEDHSNMaM75RfGFeMnWKZGe7OJ+3tliCCefDscngxKFug0DctwSY0ogYWnkgOh/U
v6RwoO5tZb4w8cEaqh6sprTH9wp3uld4ShvlW1zm18uo1ytIlgi6i70Spenfmel3
WsnxYq9O17ewSNlzGPZU16ktEQy5mhWVFFeq/29vr/nC0g9PWzhjbzO+Jvb6Pg1o
RpXzVkoOEle0tPdPDLBIYuu5VI3sSg0qBPeW/pAU2K8llFEl8SP2vuNOWhRe/cZ2
bzcDOKRriWMZ9/sDr2bYRE9KglrdZHJhzJwQNR/LakP1wbB07vV29ENo7tnDWRve
6+WxlB/ZOfqJ3V20M+0tHA2gOvKrQ0PqGLP6i4EaLfBB/xWNdLfKLBln+YnR2EE9
8zW5U+j8WbsGVfmt81FuoWU7keD/8VOfHwN8aE0dPy+Spez84dQHxPMDc+jhiqkP
/DObV5F0uNHAHZiA+T3d95ys8UmxRm3E/zeqJ5d+3/+fP+RsqWVQmZ/QY3wJvZsd
B+Qfrv2uK/coZwGs/3BlpzBVZMCV2ep7KDVTAezItP1CcVHkS54TFEC8myAeQQii
7SAhEeLrh6w3xVERdXbkF6wKph53DC2W8iFcdjbFB4w=
=3hB9
-----END PGP SIGNATURE-----
Merge tag 'nf-26-03-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter: updates for net
1) Inseo An reported a bug with the set element handling in nf_tables:
When set cannot accept more elements, we unlink and immediately free
an element that was inserted into a public data structure, freeing it
without waiting for RCU grace period. Fix this by doing the
increment earlier and by deferring possible unlink-and-free to the
existing abort path, which performs the needed synchronize_rcu before
free. From Pablo Neira Ayuso. This is an ancient bug, dating back to
kernel 4.10.
2) syzbot reported WARN_ON() splat in nf_tables that occurs on memory
allocation failure. Fix this by a new iterator annotation:
The affected walker does not need to clone the data structure and
can just use the live version if no clone exists yet.
Also from Pablo. This bug existed since 6.10 days.
3) Ancient forever bug in nft_pipapo data structure:
The garbage collection logic to remove expired elements is broken.
We must unlink from data structure and can only hand the freeing
to call_rcu after the clone/live pointers of the data structures
have been swapped. Else, readers can observe the free'd element.
Reported by Yiming Qian.
* tag 'nf-26-03-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_set_pipapo: split gc into unlink and reclaim phase
netfilter: nf_tables: clone set on flush only
netfilter: nf_tables: unconditionally bump set->nelems before insertion
====================
Link: https://patch.msgid.link/20260305122635.23525-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reset eBPF program pointer to old_prog and do not decrease its ref-count
if mtk_open routine in mtk_xdp_setup() fails.
Fixes: 7c26c20da5 ("net: ethernet: mtk_eth_soc: add basic XDP support")
Suggested-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260303-mtk-xdp-prog-ptr-fix-v2-1-97b6dbbe240f@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yiming Qian reports Use-after-free in the pipapo set type:
Under a large number of expired elements, commit-time GC can run for a very
long time in a non-preemptible context, triggering soft lockup warnings and
RCU stall reports (local denial of service).
We must split GC in an unlink and a reclaim phase.
We cannot queue elements for freeing until pointers have been swapped.
Expired elements are still exposed to both the packet path and userspace
dumpers via the live copy of the data structure.
call_rcu() does not protect us: dump operations or element lookups starting
after call_rcu has fired can still observe the free'd element, unless the
commit phase has made enough progress to swap the clone and live pointers
before any new reader has picked up the old version.
This a similar approach as done recently for the rbtree backend in commit
35f83a7552 ("netfilter: nft_set_rbtree: don't gc elements on insert").
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Syzbot with fault injection triggered a failing memory allocation with
GFP_KERNEL which results in a WARN splat:
iter.err
WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992
Modules linked in:
CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845
Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43
+80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9
RSP: 0018:ffffc900045af780 EFLAGS: 00010293
RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40
RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000
RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff
R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0
R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920
FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0
Call Trace:
<TASK>
__nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115
nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187
notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85
blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380
netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761
__sock_release net/socket.c:662 [inline]
sock_close+0xc3/0x240 net/socket.c:1455
Restrict set clone to the flush set command in the preparation phase.
Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree
and pipapo backends to only clone the set when this iteration type is
used.
As for the existing NFT_ITER_UPDATE type, update the pipapo backend to
use the existing set clone if available, otherwise use the existing set
representation. After this update, there is no need to clone a set that
is being deleted, this includes bound anonymous set.
An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone
interface and call it from the flush set path.
Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com
Fixes: 3f1d886cc7 ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
In case that the set is full, a new element gets published then removed
without waiting for the RCU grace period, while RCU reader can be
walking over it already.
To address this issue, add the element transaction even if set is full,
but toggle the set_full flag to report -ENFILE so the abort path safely
unwinds the set to its previous state.
As for element updates, decrement set->nelems to restore it.
A simpler fix is to call synchronize_rcu() in the error path.
However, with a large batch adding elements to already maxed-out set,
this could cause noticeable slowdown of such batches.
Fixes: 35d0ac9070 ("netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL")
Reported-by: Inseo An <y0un9sa@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
After acquiring netdev_queue::_xmit_lock the number of the CPU owning
the lock is recorded in netdev_queue::xmit_lock_owner. This works as
long as the BH context is not preemptible.
On PREEMPT_RT the softirq context is preemptible and without the
softirq-lock it is possible to have multiple user in __dev_queue_xmit()
submitting a skb on the same CPU. This is fine in general but this means
also that the current CPU is recorded as netdev_queue::xmit_lock_owner.
This in turn leads to the recursion alert and the skb is dropped.
Instead checking the for CPU number, that owns the lock, PREEMPT_RT can
check if the lockowner matches the current task.
Add netif_tx_owned() which returns true if the current context owns the
lock by comparing the provided CPU number with the recorded number. This
resembles the current check by negating the condition (the current check
returns true if the lock is not owned).
On PREEMPT_RT use rt_mutex_owner() to return the lock owner and compare
the current task against it.
Use the new helper in __dev_queue_xmit() and netif_local_xmit_active()
which provides a similar check.
Update comments regarding pairing READ_ONCE().
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20260216134333.412332-1-spasswolf@web.de
Fixes: 3253cb49cb ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Bert Karwatzki <spasswolf@web.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260302162631.uGUyIqDT@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ovidiu Panait says:
====================
net: stmmac: Fix VLAN handling when interface is down
VLAN register accesses on the MAC side require the PHY RX clock to be
active. When the network interface is down, the PHY is suspended and
the RX clock is unavailable, causing VLAN operations to fail with
timeouts.
The VLAN core automatically removes VID 0 after the interface goes down
and re-adds it when it comes back up, so these timeouts happen during
normal interface down/up:
# ip link set end1 down
renesas-gbeth 15c40000.ethernet end1: Timeout accessing MAC_VLAN_Tag_Filter
renesas-gbeth 15c40000.ethernet end1: failed to kill vid 0081/0
Adding VLANs while the interface is down also fails:
# ip link add link end1 name end1.10 type vlan id 10
renesas-gbeth 15c40000.ethernet end1: Timeout accessing MAC_VLAN_Tag_Filter
RTNETLINK answers: Device or resource busy
Patch 4 fixes this by adding checks in the VLAN paths for netif_running(),
and skipping register accesses if the interface is down. Only the software
state is updated in this case. When the interface is brought up, the VLAN
state is restored to hardware.
Patches 1-3 fix some issues in the existing VLAN implementation.
====================
Link: https://patch.msgid.link/20260303145828.7845-1-ovidiu.panait.rb@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
VLAN register accesses on the MAC side require the PHY RX clock to be
active. When the network interface is down, the PHY is suspended and
the RX clock is unavailable, causing VLAN operations to fail with
timeouts.
The VLAN core automatically removes VID 0 after the interface goes down
and re-adds it when it comes back up, so these timeouts happen during
normal interface down/up:
# ip link set end1 down
renesas-gbeth 15c40000.ethernet end1: Timeout accessing MAC_VLAN_Tag_Filter
renesas-gbeth 15c40000.ethernet end1: failed to kill vid 0081/0
Adding VLANs while the interface is down also fails:
# ip link add link end1 name end1.10 type vlan id 10
renesas-gbeth 15c40000.ethernet end1: Timeout accessing MAC_VLAN_Tag_Filter
RTNETLINK answers: Device or resource busy
To fix this, check if the interface is up before accessing VLAN registers.
The software state is always kept up to date regardless of interface state.
When the interface is brought up, stmmac_vlan_restore() is called
to write the VLAN state to hardware.
Fixes: ed64639bc1 ("net: stmmac: Add support for VLAN Rx filtering")
Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Link: https://patch.msgid.link/20260303145828.7845-5-ovidiu.panait.rb@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the network interface is opened or resumed, a DMA reset is performed,
which resets all hardware state, including VLAN state. Currently, only
the resume path is restoring the VLAN state via
stmmac_restore_hw_vlan_rx_fltr(), but that is incomplete: the VLAN hash
table and the VLAN_TAG control bits are not restored.
Therefore, add stmmac_vlan_restore(), which restores the full VLAN
state by updating both the HW filter entries and the hash table, and
call it from both the open and resume paths.
The VLAN restore is moved outside of phylink_rx_clk_stop_block/unblock
in the resume path because receive clock stop is already disabled when
stmmac supports VLAN.
Also, remove the hash readback code in vlan_restore_hw_rx_fltr() that
attempts to restore VTHM by reading VLAN_HASH_TABLE, as it always reads
zero after DMA reset, making it dead code.
Fixes: 3cd1cfcba2 ("net: stmmac: Implement VLAN Hash Filtering in XGMAC")
Fixes: ed64639bc1 ("net: stmmac: Add support for VLAN Rx filtering")
Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Link: https://patch.msgid.link/20260303145828.7845-4-ovidiu.panait.rb@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The double VLAN bits (EDVLP, ESVL, DOVLTC) are handled inconsistently
between the two vlan_update_hash() implementations:
- dwxgmac2_update_vlan_hash() explicitly clears the double VLAN bits when
is_double is false, meaning that adding a 802.1Q VLAN will disable
double VLAN mode:
$ ip link add link eth0 name eth0.200 type vlan id 200 protocol 802.1ad
$ ip link add link eth0 name eth0.100 type vlan id 100
# Double VLAN bits no longer set
- vlan_update_hash() sets these bits and only clears them when the last
VLAN has been removed, so double VLAN mode remains enabled even after all
802.1AD VLANs are removed.
Address both issues by tracking the number of active 802.1AD VLANs in
priv->num_double_vlans. Pass this count to stmmac_vlan_update() so both
implementations correctly set the double VLAN bits when any 802.1AD
VLAN is active, and clear them only when none remain.
Also update vlan_update_hash() to explicitly clear the double VLAN bits
when is_double is false, matching the dwxgmac2 behavior.
Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Link: https://patch.msgid.link/20260303145828.7845-3-ovidiu.panait.rb@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
stmmac_vlan_rx_add_vid() updates active_vlans and the VLAN hash
register before writing the HW filter entry. If the filter write
fails, it leaves a stale VID in active_vlans and the hash register.
stmmac_vlan_rx_kill_vid() has the reverse problem: it clears
active_vlans before removing the HW filter. On failure, the VID is
gone from active_vlans but still present in the HW filter table.
To fix this, reorder the operations to update the hash table first,
then attempt the HW filter operation. If the HW filter fails, roll
back both the active_vlans bitmap and the hash table by calling
stmmac_vlan_update() again.
Fixes: ed64639bc1 ("net: stmmac: Add support for VLAN Rx filtering")
Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Link: https://patch.msgid.link/20260303145828.7845-2-ovidiu.panait.rb@renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-03-03 (ice, libie, iavf, igb, igc)
Larysa removes VF restriction for LLDP filters on ice to allow for LLDP
traffic to reach the correct destination.
Jakub adds retry mechanism for AdminQ Read/Write SFF EEPROM call to
follow hardware specification on ice.
Zilin Guan adds cleanup path to free XDP rings on failure in
ice_set_ringparam().
Michal bypasses firmware logging unroll in libie when it isn't supported.
Kohei Enju fixes iavf to take into account hardware MTU support when
setting max MTU values.
Vivek Behera fixes issues on igb and igc using incorrect IRQs when Tx/Rx
queues do not share the same IRQ.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: Fix trigger of incorrect irq in igc_xsk_wakeup function
igb: Fix trigger of incorrect irq in igb_xsk_wakeup
iavf: fix netdev->max_mtu to respect actual hardware limit
libie: don't unroll if fwlog isn't supported
ice: Fix memory leak in ice_set_ringparam()
ice: fix retry for AQ command 0x06EE
ice: reintroduce retry mechanism for indirect AQ
ice: fix adding AQ LLDP filter for VF
====================
Link: https://patch.msgid.link/20260303231155.2895065-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts says:
====================
mptcp: misc fixes for v7.0-rc2
Here are various unrelated fixes:
- Patch 1: avoid bufferbloat in simult_flows selftest which can cause
instabilities. A fix for v5.10.
- Patches 2-3: reduce RM_ADDR lost by not sending it over the same
subflow as the one being removed, if possible. A fix for v5.13.
- Patches 4-5: avoid a WARN when using signal + subflow endpoints with a
subflow limit of 0, and removing such endpoints during an active
connection. A fix for v5.17.
====================
Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-0-4b5462b6f016@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This validates the previous commit: endpoints with both the signal and
subflow flags should always be marked as used even if it was not
possible to create new subflows due to the MPTCP PM limits.
For this test, an extra endpoint is created with both the signal and the
subflow flags, and limits are set not to create extra subflows. In this
case, an ADD_ADDR is sent, but no subflows are created. Still, the local
endpoint is marked as used, and no warning is fired when removing the
endpoint, after having sent a RM_ADDR.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: 85df533a78 ("mptcp: pm: do not ignore 'subflow' if 'signal' flag is also set")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-5-4b5462b6f016@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller managed to find a combination of actions that was generating
this warning:
msk->pm.local_addr_used == 0
WARNING: net/mptcp/pm_kernel.c:1071 at __mark_subflow_endp_available net/mptcp/pm_kernel.c:1071 [inline], CPU#1: syz.2.17/961
WARNING: net/mptcp/pm_kernel.c:1071 at mptcp_nl_remove_subflow_and_signal_addr net/mptcp/pm_kernel.c:1103 [inline], CPU#1: syz.2.17/961
WARNING: net/mptcp/pm_kernel.c:1071 at mptcp_pm_nl_del_addr_doit+0x81d/0x8f0 net/mptcp/pm_kernel.c:1210, CPU#1: syz.2.17/961
Modules linked in:
CPU: 1 UID: 0 PID: 961 Comm: syz.2.17 Not tainted 6.19.0-08368-gfafda3b4b06b #22 PREEMPT(full)
Hardware name: QEMU Ubuntu 25.10 PC v2 (i440FX + PIIX, + 10.1 machine, 1996), BIOS 1.17.0-debian-1.17.0-1build1 04/01/2014
RIP: 0010:__mark_subflow_endp_available net/mptcp/pm_kernel.c:1071 [inline]
RIP: 0010:mptcp_nl_remove_subflow_and_signal_addr net/mptcp/pm_kernel.c:1103 [inline]
RIP: 0010:mptcp_pm_nl_del_addr_doit+0x81d/0x8f0 net/mptcp/pm_kernel.c:1210
Code: 89 c5 e8 46 30 6f fe e9 21 fd ff ff 49 83 ed 80 e8 38 30 6f fe 4c 89 ef be 03 00 00 00 e8 db 49 df fe eb ac e8 24 30 6f fe 90 <0f> 0b 90 e9 1d ff ff ff e8 16 30 6f fe eb 05 e8 0f 30 6f fe e8 9a
RSP: 0018:ffffc90001663880 EFLAGS: 00010293
RAX: ffffffff82de1a6c RBX: 0000000000000000 RCX: ffff88800722b500
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff8880158b22d0 R08: 0000000000010425 R09: ffffffffffffffff
R10: ffffffff82de18ba R11: 0000000000000000 R12: ffff88800641a640
R13: ffff8880158b1880 R14: ffff88801ec3c900 R15: ffff88800641a650
FS: 00005555722c3500(0000) GS:ffff8880f909d000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f66346e0f60 CR3: 000000001607c000 CR4: 0000000000350ef0
Call Trace:
<TASK>
genl_family_rcv_msg_doit+0x117/0x180 net/netlink/genetlink.c:1115
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x3a8/0x3f0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x16d/0x240 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x3e9/0x4c0 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x4aa/0x5b0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0xc9/0xf0 net/socket.c:742
____sys_sendmsg+0x272/0x3b0 net/socket.c:2592
___sys_sendmsg+0x2de/0x320 net/socket.c:2646
__sys_sendmsg net/socket.c:2678 [inline]
__do_sys_sendmsg net/socket.c:2683 [inline]
__se_sys_sendmsg net/socket.c:2681 [inline]
__x64_sys_sendmsg+0x110/0x1a0 net/socket.c:2681
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x143/0x440 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f66346f826d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc83d8bdc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f6634985fa0 RCX: 00007f66346f826d
RDX: 00000000040000b0 RSI: 0000200000000740 RDI: 0000000000000007
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6634985fa8
R13: 00007f6634985fac R14: 0000000000000000 R15: 0000000000001770
</TASK>
The actions that caused that seem to be:
- Set the MPTCP subflows limit to 0
- Create an MPTCP endpoint with both the 'signal' and 'subflow' flags
- Create a new MPTCP connection from a different address: an ADD_ADDR
linked to the MPTCP endpoint will be sent ('signal' flag), but no
subflows is initiated ('subflow' flag)
- Remove the MPTCP endpoint
In this case, msk->pm.local_addr_used has been kept to 0 -- because no
subflows have been created -- but the corresponding bit in
msk->pm.id_avail_bitmap has been cleared when the ADD_ADDR has been
sent. This later causes a splat when removing the MPTCP endpoint because
msk->pm.local_addr_used has been kept to 0.
Now, if an endpoint has both the signal and subflow flags, but it is not
possible to create subflows because of the limits or the c-flag case,
then the local endpoint counter is still incremented: the endpoint is
used at the end. This avoids issues later when removing the endpoint and
calling __mark_subflow_endp_available(), which expects
msk->pm.local_addr_used to have been previously incremented if the
endpoint was marked as used according to msk->pm.id_avail_bitmap.
Note that signal_and_subflow variable is reset to false when the limits
and the c-flag case allows subflows creation. Also, local_addr_used is
only incremented for non ID0 subflows.
Fixes: 85df533a78 ("mptcp: pm: do not ignore 'subflow' if 'signal' flag is also set")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/613
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-4-4b5462b6f016@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This validates the previous commit: RM_ADDR were sent over the first
found active subflow which could be the same as the one being removed.
It is more likely to loose this notification.
For this check, RM_ADDR are explicitly dropped when trying to send them
over the initial subflow, when removing the endpoint attached to it. If
it is dropped, the test will complain because some RM_ADDR have not been
received.
Note that only the RM_ADDR are dropped, to allow the linked subflow to
be quickly and cleanly closed. To only drop those RM_ADDR, a cBPF byte
code is used. If the IPTables commands fail, that's OK, the tests will
continue to pass, but not validate this part. This can be ignored:
another subtest fully depends on such command, and will be marked as
skipped.
The 'Fixes' tag here below is the same as the one from the previous
commit: this patch here is not fixing anything wrong in the selftests,
but it validates the previous fix for an issue introduced by this commit
ID.
Fixes: 8dd5efb1f9 ("mptcp: send ack for rm_addr")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-3-4b5462b6f016@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RM_ADDR are sent over an active subflow, the first one in the subflows
list. There is then a high chance the initial subflow is picked. With
the in-kernel PM, when an endpoint is removed, a RM_ADDR is sent, then
linked subflows are closed. This is done for each active MPTCP
connection.
MPTCP endpoints are likely removed because the attached network is no
longer available or usable. In this case, it is better to avoid sending
this RM_ADDR over the subflow that is going to be removed, but prefer
sending it over another active and non stale subflow, if any.
This modification avoids situations where the other end is not notified
when a subflow is no longer usable: typically when the endpoint linked
to the initial subflow is removed, especially on the server side.
Fixes: 8dd5efb1f9 ("mptcp: send ack for rm_addr")
Cc: stable@vger.kernel.org
Reported-by: Frank Lorenz <lorenz-frank@web.de>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/612
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-2-4b5462b6f016@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
By default, the netem qdisc can keep up to 1000 packets under its belly
to deal with the configured rate and delay. The simult flows test-case
simulates very low speed links, to avoid problems due to slow CPUs and
the TCP stack tend to transmit at a slightly higher rate than the
(virtual) link constraints.
All the above causes a relatively large amount of packets being enqueued
in the netem qdiscs - the longer the transfer, the longer the queue -
producing increasingly high TCP RTT samples and consequently increasingly
larger receive buffer size due to DRS.
When the receive buffer size becomes considerably larger than the needed
size, the tests results can flake, i.e. because minimal inaccuracy in the
pacing rate can lead to a single subflow usage towards the end of the
connection for a considerable amount of data.
Address the issue explicitly setting netem limits suitable for the
configured link speeds and unflake all the affected tests.
Fixes: 1a418cb8e8 ("mptcp: simult flow self-tests")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260303-net-mptcp-misc-fixes-7-0-rc2-v1-1-4b5462b6f016@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
nfc: fix leaks and races surfaced by NIPA
I recently added the nci test to NIPA. Somewhat surprisingly it runs
without much settup but hits kmemleaks fairly often. Fix a handful of
issues to make the test pass in a stable way.
====================
Link: https://patch.msgid.link/20260303162346.2071888-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In rawsock_release(), cancel any pending tx_work and purge the write
queue before orphaning the socket. rawsock_tx_work runs on the system
workqueue and calls nfc_data_exchange which dereferences the NCI
device. Without synchronization, tx_work can race with socket and
device teardown when a process is killed (e.g. by SIGKILL), leading
to use-after-free or leaked references.
Set SEND_SHUTDOWN first so that if tx_work is already running it will
see the flag and skip transmitting, then use cancel_work_sync to wait
for any in-progress execution to finish, and finally purge any
remaining queued skbs.
Fixes: 23b7869c0f ("NFC: add the NFC socket raw protocol")
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260303162346.2071888-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move clear_bit(NCI_DATA_EXCHANGE) before invoking the data exchange
callback in nci_data_exchange_complete().
The callback (e.g. rawsock_data_exchange_complete) may immediately
schedule another data exchange via schedule_work(tx_work). On a
multi-CPU system, tx_work can run and reach nci_transceive() before
the current nci_data_exchange_complete() clears the flag, causing
test_and_set_bit(NCI_DATA_EXCHANGE) to return -EBUSY and the new
transfer to fail.
This causes intermittent flakes in nci/nci_dev in NIPA:
# # RUN NCI.NCI1_0.t4t_tag_read ...
# # t4t_tag_read: Test terminated by timeout
# # FAIL NCI.NCI1_0.t4t_tag_read
# not ok 3 NCI.NCI1_0.t4t_tag_read
Fixes: 38f04c6b1b ("NFC: protect nci_data_exchange transactions")
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260303162346.2071888-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
digital_in_send() takes ownership of the skb passed by the caller
(nfc_data_exchange), make sure it's freed on all error paths.
Found looking around the real driver for similar bugs to the one
just fixed in nci.
Fixes: 2c66daecc4 ("NFC Digital: Add NFC-A technology support")
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260303162346.2071888-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
binding->dev is protected on the write-side in
mp_dmabuf_devmem_uninstall() against concurrent writes, but due to the
concurrent bare reads in net_devmem_get_binding() and
validate_xmit_unreadable_skb() it should be wrapped in a
READ_ONCE/WRITE_ONCE pair to make sure no compiler optimizations play
with the underlying register in unforeseen ways.
Doesn't present a critical bug because the known compiler optimizations
don't result in bad behavior. There is no tearing on u64, and load
omissions/invented loads would only break if additional binding->dev
references were inlined together (they aren't right now).
This just more strictly follows the linux memory model (i.e.,
"Lock-Protected Writes With Lockless Reads" in
tools/memory-model/Documentation/access-marking.txt).
Fixes: bd61848900 ("net: devmem: Implement TX path")
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260302-devmem-membar-fix-v2-1-5b33c9cbc28b@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When/if a NIC resets, queues are deactivated by dev_deactivate_many(),
then reactivated when the reset operation completes.
fq_reset() removes all the skbs from various queues.
If we do not clear q->band_pkt_count[], these counters keep growing
and can eventually reach sch->limit, preventing new packets to be queued.
Many thanks to Praveen for discovering the root cause.
Fixes: 29f834aa32 ("net_sched: sch_fq: add 3 bands and WRR scheduling")
Diagnosed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260304015640.961780-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts 28ee1b746f ("secure_seq: downgrade to per-host timestamp offsets")
tcp_tw_recycle went away in 2017.
Zhouyan Deng reported off-path TCP source port leakage via
SYN cookie side-channel that can be fixed in multiple ways.
One of them is to bring back TCP ports in TS offset randomization.
As a bonus, we perform a single siphash() computation
to provide both an ISN and a TS offset.
Fixes: 28ee1b746f ("secure_seq: downgrade to per-host timestamp offsets")
Reported-by: Zhouyan Deng <dengzhouyan_nwpu@163.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260302205527.1982836-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When shrinking the number of real tx queues,
netif_set_real_num_tx_queues() calls qdisc_reset_all_tx_gt() to flush
qdiscs for queues which will no longer be used.
qdisc_reset_all_tx_gt() currently serializes qdisc_reset() with
qdisc_lock(). However, for lockless qdiscs, the dequeue path is
serialized by qdisc_run_begin/end() using qdisc->seqlock instead, so
qdisc_reset() can run concurrently with __qdisc_run() and free skbs
while they are still being dequeued, leading to UAF.
This can easily be reproduced on e.g. virtio-net by imposing heavy
traffic while frequently changing the number of queue pairs:
iperf3 -ub0 -c $peer -t 0 &
while :; do
ethtool -L eth0 combined 1
ethtool -L eth0 combined 2
done
With KASAN enabled, this leads to reports like:
BUG: KASAN: slab-use-after-free in __qdisc_run+0x133f/0x1760
...
Call Trace:
<TASK>
...
__qdisc_run+0x133f/0x1760
__dev_queue_xmit+0x248f/0x3550
ip_finish_output2+0xa42/0x2110
ip_output+0x1a7/0x410
ip_send_skb+0x2e6/0x480
udp_send_skb+0xb0a/0x1590
udp_sendmsg+0x13c9/0x1fc0
...
</TASK>
Allocated by task 1270 on cpu 5 at 44.558414s:
...
alloc_skb_with_frags+0x84/0x7c0
sock_alloc_send_pskb+0x69a/0x830
__ip_append_data+0x1b86/0x48c0
ip_make_skb+0x1e8/0x2b0
udp_sendmsg+0x13a6/0x1fc0
...
Freed by task 1306 on cpu 3 at 44.558445s:
...
kmem_cache_free+0x117/0x5e0
pfifo_fast_reset+0x14d/0x580
qdisc_reset+0x9e/0x5f0
netif_set_real_num_tx_queues+0x303/0x840
virtnet_set_channels+0x1bf/0x260 [virtio_net]
ethnl_set_channels+0x684/0xae0
ethnl_default_set_doit+0x31a/0x890
...
Serialize qdisc_reset_all_tx_gt() against the lockless dequeue path by
taking qdisc->seqlock for TCQ_F_NOLOCK qdiscs, matching the
serialization model already used by dev_reset_queue().
Additionally clear QDISC_STATE_NON_EMPTY after reset so the qdisc state
reflects an empty queue, avoiding needless re-scheduling.
Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260228145307.3955532-1-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iIkEABYKADEWIQSl+MghEFFAdY3pYJLMOmT6rpmt0gUCaaVwehMcbWtsQHBlbmd1
dHJvbml4LmRlAAoJEMw6ZPquma3SqFUA/ihDNaZuD1HDNZ6tFugz4gcvytH4LT+R
CRZXS+a1FRLyAQCuTiN1k080l4pj0sVDNlkymjxcn7a8RZ+Dk/Wy3b7JDg==
=e56S
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-7.0-20260302' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2026-03-02
The first 2 patches are by Oliver Hartkopp. The first fixes the
locking for CAN Broadcast Manager op runtime updates, the second fixes
the packet statisctics for the CAN dummy driver.
Alban Bedel's patch fixes a potential problem in the error path of the
mcp251x's ndo_open callback.
A patch by Ziyi Guo add USB endpoint type validation to the esd_usb
driver.
The next 6 patches are by Greg Kroah-Hartman and fix URB data parsing
for the ems_usb and ucan driver, fix URB anchoring in the etas_es58x,
and in the f81604 driver fix URB data parsing, add URB error handling
and fix URB anchoring.
A patch by me targets the gs_usb driver and fixes interoperability
with the CANable-2.5 firmware by always configuring the bit rate
before starting the device.
The last patch is by Frank Li and fixes a CHECK_DTBS warning for the
nxp,sja1000 dt-binding.
* tag 'linux-can-fixes-for-7.0-20260302' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
dt-bindings: net: can: nxp,sja1000: add reference to mc-peripheral-props.yaml
can: gs_usb: gs_can_open(): always configure bitrates before starting device
can: usb: f81604: correctly anchor the urb in the read bulk callback
can: usb: f81604: handle bulk write errors properly
can: usb: f81604: handle short interrupt urb messages properly
can: usb: etas_es58x: correctly anchor the urb in the read bulk callback
can: ucan: Fix infinite loop from zero-length messages
can: ems_usb: ems_usb_read_bulk_callback(): check the proper length of a message
can: esd_usb: add endpoint type validation
can: mcp251x: fix deadlock in error path of mcp251x_open
can: dummy_can: dummy_can_init(): fix packet statistics
can: bcm: fix locking for bcm_op runtime updates
====================
Link: https://patch.msgid.link/20260302152755.1700177-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>