linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-08 04:04:43 +01:00

Author	SHA1	Message	Date
Naveen Anandhan	fbdfa8da05	selftests: tc-testing: fix list_categories() crash on list type list_categories() builds a set directly from the 'category' field of each test case. Since 'category' is a list, set(map(...)) attempts to insert lists into a set, which raises: TypeError: unhashable type: 'list' Flatten category lists and collect unique category names using set.update() instead. Signed-off-by: Naveen Anandhan <mr.navi8680@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2026-03-04 05:42:57 +00:00
Eric Biggers	46d0d6f50d	net/tcp-md5: Fix MAC comparison to be constant-time To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: `cfb6eeb4c8` ("[TCP]: MD5 Signature Option (RFC2385) support.") Fixes: `658ddaaf66` ("tcp: md5: RST: getting md5 key from listener") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20260302203409.13388-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 18:39:43 -08:00
Stephen Hemminger	7f5d8e63f3	MAINTAINERS: update the skge/sky2 maintainers Mark the skge and sky2 drivers as orphan. I no longer have any Marvell/SysKonnect boards to test with and mail to Mirko Lindner bounced because Marvell sold off that divsion. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://patch.msgid.link/20260302195120.187183-1-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 18:39:27 -08:00
Raju Rangoju	e2f27363aa	amd-xgbe: fix sleep while atomic on suspend/resume The xgbe_powerdown() and xgbe_powerup() functions use spinlocks (spin_lock_irqsave) while calling functions that may sleep: - napi_disable() can sleep waiting for NAPI polling to complete - flush_workqueue() can sleep waiting for pending work items This causes a "BUG: scheduling while atomic" error during suspend/resume cycles on systems using the AMD XGBE Ethernet controller. The spinlock protection in these functions is unnecessary as these functions are called from suspend/resume paths which are already serialized by the PM core Fix this by removing the spinlock. Since only code that takes this lock is xgbe_powerdown() and xgbe_powerup(), remove it completely. Fixes: `c5aa9e3b81` ("amd-xgbe: Initial AMD 10GbE platform driver") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260302042124.1386445-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:24:38 -08:00
Breno Leitao	5af6e8b549	netconsole: fix sysdata_release_enabled_show checking wrong flag sysdata_release_enabled_show() checks SYSDATA_TASKNAME instead of SYSDATA_RELEASE, causing the configfs release_enabled attribute to reflect the taskname feature state rather than the release feature state. This is a copy-paste error from the adjacent sysdata_taskname_enabled_show() function. The corresponding _store function already uses the correct SYSDATA_RELEASE flag. Fixes: `343f902270` ("netconsole: implement configfs for release_enabled") Signed-off-by: Breno Leitao <leitao@debian.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260302-sysdata_release_fix-v1-1-e5090f677c7c@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:23:50 -08:00
Yung Chih Su	4ee7fa6cf7	net: ipv4: fix ARM64 alignment fault in multipath hash seed `struct sysctl_fib_multipath_hash_seed` contains two u32 fields (user_seed and mp_seed), making it an 8-byte structure with a 4-byte alignment requirement. In `fib_multipath_hash_from_keys()`, the code evaluates the entire struct atomically via `READ_ONCE()`: mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed; While this silently works on GCC by falling back to unaligned regular loads which the ARM64 kernel tolerates, it causes a fatal kernel panic when compiled with Clang and LTO enabled. Commit `e35123d83e` ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs under Clang LTO. Since the macro evaluates the full 8-byte struct, Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly requires `ldar` to be naturally aligned, thus executing it on a 4-byte aligned address triggers a strict Alignment Fault (FSC = 0x21). Fix the read side by moving the `READ_ONCE()` directly to the `u32` member, which emits a safe 32-bit `ldar Wn`. Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis shows that Clang splits this 8-byte write into two separate 32-bit `str` instructions. While this avoids an alignment fault, it destroys atomicity and exposes a tear-write vulnerability. Fix this by explicitly splitting the write into two 32-bit `WRITE_ONCE()` operations. Finally, add the missing `READ_ONCE()` when reading `user_seed` in `proc_fib_multipath_hash_seed()` to ensure proper pairing and concurrency safety. Fixes: `4ee2a8cace` ("net: ipv4: Add a sysctl to set multipath hash seed") Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:20:37 -08:00
Eric Biggers	67edfec516	net/tcp-ao: Fix MAC comparison to be constant-time To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: `0a3a809089` ("net/tcp: Verify inbound TCP-AO signed segments") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20260302203600.13561-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:16:54 -08:00
Jakub Kicinski	2ffb4f5c2c	ipv6: fix NULL pointer deref in ip6_rt_get_dev_rcu() l3mdev_master_dev_rcu() can return NULL when the slave device is being un-slaved from a VRF. All other callers deal with this, but we lost the fallback to loopback in ip6_rt_pcpu_alloc() -> ip6_rt_get_dev_rcu() with commit `4832c30d54` ("net: ipv6: put host and anycast routes on device with address"). KASAN: null-ptr-deref in range [0x0000000000000108-0x000000000000010f] RIP: 0010:ip6_rt_pcpu_alloc (net/ipv6/route.c:1418) Call Trace: ip6_pol_route (net/ipv6/route.c:2318) fib6_rule_lookup (net/ipv6/fib6_rules.c:115) ip6_route_output_flags (net/ipv6/route.c:2607) vrf_process_v6_outbound (drivers/net/vrf.c:437) I was tempted to rework the un-slaving code to clear the flag first and insert synchronize_rcu() before we remove the upper. But looks like the explicit fallback to loopback_dev is an established pattern. And I guess avoiding the synchronize_rcu() is nice, too. Fixes: `4832c30d54` ("net: ipv6: put host and anycast routes on device with address") Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260301194548.927324-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-03 17:14:48 -08:00
YiFei Zhu	1a86a1f7d8	net: Fix rcu_tasks stall in threaded busypoll I was debugging a NIC driver when I noticed that when I enable threaded busypoll, bpftrace hangs when starting up. dmesg showed: rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 10658 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 40793 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 131273 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 402058 jiffies old. INFO: rcu_tasks detected stalls on tasks: 00000000769f52cd: .N nvcsw: 2/2 holdout: 1 idle_cpu: -1/64 task:napi/eth2-8265 state:R running task stack:0 pid:48300 tgid:48300 ppid:2 task_flags:0x208040 flags:0x00004000 Call Trace: <TASK> ? napi_threaded_poll_loop+0x27c/0x2c0 ? __pfx_napi_threaded_poll+0x10/0x10 ? napi_threaded_poll+0x26/0x80 ? kthread+0xfa/0x240 ? __pfx_kthread+0x10/0x10 ? ret_from_fork+0x31/0x50 ? __pfx_kthread+0x10/0x10 ? ret_from_fork_asm+0x1a/0x30 </TASK> The cause is that in threaded busypoll, the main loop is in napi_threaded_poll rather than napi_threaded_poll_loop, where the latter rarely iterates more than once within its loop. For rcu_softirq_qs_periodic inside napi_threaded_poll_loop to report its qs state, the last_qs must be 100ms behind, and this can't happen because napi_threaded_poll_loop rarely iterates in threaded busypoll, and each time napi_threaded_poll_loop is called last_qs is reset to latest jiffies. This patch changes so that in threaded busypoll, last_qs is saved in the outer napi_threaded_poll, and whether busy_poll_last_qs is NULL indicates whether napi_threaded_poll_loop is called for busypoll. This way last_qs would not reset to latest jiffies on each invocation of napi_threaded_poll_loop. Fixes: `c18d4b190a` ("net: Extend NAPI threaded polling to allow kthread based busy polling") Cc: stable@vger.kernel.org Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20260227221937.1060857-1-zhuyifei@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 13:44:28 +01:00
Allison Henderson	6a877ececd	net/rds: Fix circular locking dependency in rds_tcp_tune syzbot reported a circular locking dependency in rds_tcp_tune() where sk_net_refcnt_upgrade() is called while holding the socket lock: ====================================================== WARNING: possible circular locking dependency detected ====================================================== kworker/u10:8/15040 is trying to acquire lock: ffffffff8e9aaf80 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x4b/0x6f0 but task is already holding lock: ffff88805a3c1ce0 (k-sk_lock-AF_INET6){+.+.}-{0:0}, at: rds_tcp_tune+0xd7/0x930 The issue occurs because sk_net_refcnt_upgrade() performs memory allocation (via get_net_track() -> ref_tracker_alloc()) while the socket lock is held, creating a circular dependency with fs_reclaim. Fix this by moving sk_net_refcnt_upgrade() outside the socket lock critical section. This is safe because the fields modified by the sk_net_refcnt_upgrade() call (sk_net_refcnt, ns_tracker) are not accessed by any concurrent code path at this point. v2: - Corrected fixes tag - check patch line wrap nits - ai commentary nits Reported-by: syzbot+2e2cf5331207053b8106@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2e2cf5331207053b8106 Fixes: `3a58f13a88` ("net: rds: acquire refcount on TCP sockets") Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260227202336.167757-1-achender@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 12:57:06 +01:00
Eric Dumazet	710f5c7658	indirect_call_wrapper: do not reevaluate function pointer We have an increasing number of READ_ONCE(xxx->function) combined with INDIRECT_CALL_[1234]() helpers. Unfortunately this forces INDIRECT_CALL_[1234]() to read xxx->function many times, which is not what we wanted. Fix these macros so that xxx->function value is not reloaded. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 0/0 grow/shrink: 1/65 up/down: 122/-1084 (-962) Function old new delta ip_push_pending_frames 59 181 +122 ip6_finish_output 687 681 -6 __udp_enqueue_schedule_skb 1078 1072 -6 ioam6_output 2319 2312 -7 xfrm4_rcv_encap_finish2 64 56 -8 xfrm4_output 297 289 -8 vrf_ip_local_out 278 270 -8 vrf_ip6_local_out 278 270 -8 seg6_input_finish 64 56 -8 rpl_output 700 692 -8 ipmr_forward_finish 124 116 -8 ip_forward_finish 143 135 -8 ip6mr_forward2_finish 100 92 -8 ip6_forward_finish 73 65 -8 input_action_end_bpf 1091 1083 -8 dst_input 52 44 -8 __xfrm6_output 801 793 -8 __xfrm4_output 83 75 -8 bpf_input 500 491 -9 __tcp_check_space 530 521 -9 input_action_end_dt6 291 280 -11 vti6_tnl_xmit 1634 1622 -12 bpf_xmit 1203 1191 -12 rpl_input 497 483 -14 rawv6_send_hdrinc 1355 1341 -14 ndisc_send_skb 1030 1016 -14 ipv6_srh_rcv 1377 1363 -14 ip_send_unicast_reply 1253 1239 -14 ip_rcv_finish 226 212 -14 ip6_rcv_finish 300 286 -14 input_action_end_x_core 205 191 -14 input_action_end_x 355 341 -14 input_action_end_t 205 191 -14 input_action_end_dx6_finish 127 113 -14 input_action_end_dx4_finish 373 359 -14 input_action_end_dt4 426 412 -14 input_action_end_core 186 172 -14 input_action_end_b6_encap 292 278 -14 input_action_end_b6 198 184 -14 igmp6_send 1332 1318 -14 ip_sublist_rcv 864 848 -16 ip6_sublist_rcv 1091 1075 -16 ipv6_rpl_srh_rcv 1937 1920 -17 xfrm_policy_queue_process 1246 1228 -18 seg6_output_core 903 885 -18 mld_sendpack 856 836 -20 NF_HOOK 756 736 -20 vti_tunnel_xmit 1447 1426 -21 input_action_end_dx6 664 642 -22 input_action_end 1502 1480 -22 sock_sendmsg_nosec 134 111 -23 ip6mr_forward2 388 364 -24 sock_recvmsg_nosec 134 109 -25 seg6_input_core 836 810 -26 ip_send_skb 172 146 -26 ip_local_out 140 114 -26 ip6_local_out 140 114 -26 __sock_sendmsg 162 136 -26 __ip_queue_xmit 1196 1170 -26 __ip_finish_output 405 379 -26 ipmr_queue_fwd_xmit 373 346 -27 sock_recvmsg 173 145 -28 ip6_xmit 1635 1607 -28 xfrm_output_resume 1418 1389 -29 ip_build_and_send_pkt 625 591 -34 dst_output 504 432 -72 Total: Before=25217686, After=25216724, chg -0.00% Fixes: `283c16a2df` ("indirect call wrappers: helpers to speed-up indirect calls of builtin") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260227172603.1700433-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 12:41:29 +01:00
Paolo Abeni	699f3b2e51	Merge branch 'avoid-compiler-and-iq-oq-reordering' Vimlesh Kumar says: ==================== avoid compiler and IQ/OQ reordering Utilize READ_ONCE and WRITE_ONCE APIs to prevent compiler optimization and reordering. Ensure IO queue OUT/IN_CNT registers are flushed. Relocate IQ/OQ IN/OUT_CNTS updates to occur before NAPI completion, and replace napi_complete with napi_complete_done. ==================== Link: https://patch.msgid.link/20260227091402.1773833-1-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:22 +01:00
Vimlesh Kumar	6c73126ecd	octeon_ep_vf: avoid compiler and IQ/OQ reordering Utilize READ_ONCE and WRITE_ONCE APIs for IO queue Tx/Rx variable access to prevent compiler optimization and reordering. Additionally, ensure IO queue OUT/IN_CNT registers are flushed by performing a read-back after writing. The compiler could reorder reads/writes to pkts_pending, last_pkt_count, etc., causing stale values to be used when calculating packets to process or register updates to send to hardware. The Octeon hardware requires a read-back after writing to OUT_CNT/IN_CNT registers to ensure the write has been flushed through any posted write buffers before the interrupt resend bit is set. Without this, we have observed cases where the hardware didn't properly update its internal state. wmb/rmb only provides ordering guarantees but doesn't prevent the compiler from performing optimizations like caching in registers, load tearing etc. Fixes: `1cd3b40797` ("octeon_ep_vf: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-5-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Vimlesh Kumar	2ae7d20fb2	octeon_ep_vf: Relocate counter updates before NAPI Relocate IQ/OQ IN/OUT_CNTS updates to occur before NAPI completion. Moving the IQ/OQ counter updates before napi_complete_done ensures 1. Counter registers are updated before re-enabling interrupts. 2. Prevents a race where new packets arrive but counters aren't properly synchronized. Fixes: `1cd3b40797` ("octeon_ep_vf: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-4-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Vimlesh Kumar	43b3160cb6	octeon_ep: avoid compiler and IQ/OQ reordering Utilize READ_ONCE and WRITE_ONCE APIs for IO queue Tx/Rx variable access to prevent compiler optimization and reordering. Additionally, ensure IO queue OUT/IN_CNT registers are flushed by performing a read-back after writing. The compiler could reorder reads/writes to pkts_pending, last_pkt_count, etc., causing stale values to be used when calculating packets to process or register updates to send to hardware. The Octeon hardware requires a read-back after writing to OUT_CNT/IN_CNT registers to ensure the write has been flushed through any posted write buffers before the interrupt resend bit is set. Without this, we have observed cases where the hardware didn't properly update its internal state. wmb/rmb only provides ordering guarantees but doesn't prevent the compiler from performing optimizations like caching in registers, load tearing etc. Fixes: `37d79d0596` ("octeon_ep: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-3-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Vimlesh Kumar	18c04a808c	octeon_ep: Relocate counter updates before NAPI Relocate IQ/OQ IN/OUT_CNTS updates to occur before NAPI completion, and replace napi_complete with napi_complete_done. Moving the IQ/OQ counter updates before napi_complete_done ensures 1. Counter registers are updated before re-enabling interrupts. 2. Prevents a race where new packets arrive but counters aren't properly synchronized. napi_complete_done (vs napi_complete) allows for better interrupt coalescing. Fixes: `37d79d0596` ("octeon_ep: add Tx/Rx processing and interrupt support") Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Link: https://patch.msgid.link/20260227091402.1773833-2-vimleshk@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 11:34:20 +01:00
Paolo Abeni	210fd8f408	Merge branch 'bonding-fix-missing-xdp-compat-check-on-xmit_hash_policy-change' Jiayuan Chen says: ==================== bonding: fix missing XDP compat check on xmit_hash_policy change syzkaller reported a bug https://syzkaller.appspot.com/bug?extid=5a287bcdc08104bc3132 When a bond device is in 802.3ad or balance-xor mode, XDP is supported only when xmit_hash_policy != vlan+srcmac. This constraint is enforced in bond_option_mode_set() via bond_xdp_check(), which prevents switching to an XDP-incompatible mode while a program is loaded. However, the symmetric path -- changing xmit_hash_policy while XDP is loaded -- had no such guard in bond_option_xmit_hash_policy_set(). This means the following sequence silently creates an inconsistent state: 1. Create a bond in 802.3ad mode with xmit_hash_policy=layer2+3. 2. Attach a native XDP program to the bond. 3. Change xmit_hash_policy to vlan+srcmac (no error, not checked). Now bond->xdp_prog is set but bond_xdp_check() returns false for the same device. When the bond is later torn down (e.g. netns deletion), dev_xdp_uninstall() calls bond_xdp_set(dev, NULL) to remove the program, which hits the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering a kernel WARNING: bond1 (unregistering): Error: No native XDP support for the current bonding mode ------------[ cut here ]------------ dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL) WARNING: net/core/dev.c:10361 at dev_xdp_uninstall net/core/dev.c:10361 [inline], CPU#0: kworker/u8:22/11031 Modules linked in: CPU: 0 UID: 0 PID: 11031 Comm: kworker/u8:22 Not tainted syzkaller #0 PREEMPT(full) Workqueue: netns cleanup_net RIP: 0010:dev_xdp_uninstall net/core/dev.c:10361 [inline] RIP: 0010:unregister_netdevice_many_notify+0x1efd/0x2370 net/core/dev.c:12393 RSP: 0018:ffffc90003b2f7c0 EFLAGS: 00010293 RAX: ffffffff8971e99c RBX: ffff888052f84c40 RCX: ffff88807896bc80 RDX: 0000000000000000 RSI: 00000000ffffffa1 RDI: 0000000000000000 RBP: ffffc90003b2f930 R08: ffffc90003b2f207 R09: 1ffff92000765e40 R10: dffffc0000000000 R11: fffff52000765e41 R12: 00000000ffffffa1 R13: ffff888052f84c38 R14: 1ffff1100a5f0988 R15: ffffc9000df67000 FS: 0000000000000000(0000) GS:ffff8881254ae000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f60871d5d58 CR3: 000000006c41c000 CR4: 00000000003526f0 Call Trace: <TASK> ops_exit_rtnl_list net/core/net_namespace.c:187 [inline] ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248 cleanup_net+0x56b/0x800 net/core/net_namespace.c:704 process_one_work kernel/workqueue.c:3275 [inline] process_scheduled_works+0xaec/0x17a0 kernel/workqueue.c:3358 worker_thread+0xa50/0xfc0 kernel/workqueue.c:3439 kthread+0x388/0x470 kernel/kthread.c:467 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Beyond the WARNING itself, when dev_xdp_install() fails during dev_xdp_uninstall(), bond_xdp_set() returns early without calling bpf_prog_put() on the old program. dev_xdp_uninstall() then releases only the reference held by dev->xdp_state[], while the reference held by bond->xdp_prog is never dropped, leaking the struct bpf_prog. The fix refactors the core logic of bond_xdp_check() into a new helper __bond_xdp_check_mode(mode, xmit_policy) that takes both parameters explicitly, avoiding the need to read them from the bond struct. bond_xdp_check() becomes a thin wrapper around it. bond_option_xmit_hash_policy_set() then uses __bond_xdp_check_mode() directly, passing the candidate xmit_policy before it is committed, mirroring exactly what bond_option_mode_set() already does for mode changes. Patch 1 adds the kernel fix. Patch 2 adds a selftest that reproduces the WARNING by attaching native XDP to a bond in 802.3ad mode, then attempting to change xmit_hash_policy to vlan+srcmac -- verifying the change is rejected with the fix applied. ==================== Link: https://patch.msgid.link/20260226080306.98766-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 10:47:44 +01:00
Jiayuan Chen	181cafbd8a	selftests/bpf: add test for xdp_bonding xmit_hash_policy compat Add a selftest to verify that changing xmit_hash_policy to vlan+srcmac is rejected when a native XDP program is loaded on a bond in 802.3ad mode. Without the fix in bond_option_xmit_hash_policy_set(), the change succeeds silently, creating an inconsistent state that triggers a kernel WARNING in dev_xdp_uninstall() when the bond is torn down. The test attaches native XDP to a bond0 (802.3ad, layer2+3), then attempts to switch xmit_hash_policy to vlan+srcmac and asserts the operation fails. It also verifies the change succeeds after XDP is detached, confirming the rejection is specific to the XDP-loaded state. Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260226080306.98766-3-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 10:47:38 +01:00
Jiayuan Chen	479d589b40	bpf/bonding: reject vlan+srcmac xmit_hash_policy change when XDP is loaded bond_option_mode_set() already rejects mode changes that would make a loaded XDP program incompatible via bond_xdp_check(). However, bond_option_xmit_hash_policy_set() has no such guard. For 802.3ad and balance-xor modes, bond_xdp_check() returns false when xmit_hash_policy is vlan+srcmac, because the 802.1q payload is usually absent due to hardware offload. This means a user can: 1. Attach a native XDP program to a bond in 802.3ad/balance-xor mode with a compatible xmit_hash_policy (e.g. layer2+3). 2. Change xmit_hash_policy to vlan+srcmac while XDP remains loaded. This leaves bond->xdp_prog set but bond_xdp_check() now returning false for the same device. When the bond is later destroyed, dev_xdp_uninstall() calls bond_xdp_set(dev, NULL, NULL) to remove the program, which hits the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering: WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL)) Fix this by rejecting xmit_hash_policy changes to vlan+srcmac when an XDP program is loaded on a bond in 802.3ad or balance-xor mode. commit `39a0876d59` ("net, bonding: Disallow vlan+srcmac with XDP") introduced bond_xdp_check() which returns false for 802.3ad/balance-xor modes when xmit_hash_policy is vlan+srcmac. The check was wired into bond_xdp_set() to reject XDP attachment with an incompatible policy, but the symmetric path -- preventing xmit_hash_policy from being changed to an incompatible value after XDP is already loaded -- was left unguarded in bond_option_xmit_hash_policy_set(). Note: commit `094ee6017e` ("bonding: check xdp prog when set bond mode") later added a similar guard to bond_option_mode_set(), but bond_option_xmit_hash_policy_set() remained unprotected. Reported-by: syzbot+5a287bcdc08104bc3132@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6995aff6.050a0220.2eeac1.014e.GAE@google.com/T/ Fixes: `39a0876d59` ("net, bonding: Disallow vlan+srcmac with XDP") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260226080306.98766-2-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-03-03 10:47:37 +01:00
Arthur Kiyanovski	1939d9816d	MAINTAINERS: ena: update AMAZON ETHERNET maintainers Remove Shay Agroskin and Saeed Bishara. Promote David Arinzon to maintainer. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Link: https://patch.msgid.link/20260301191652.5916-1-akiyano@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 19:03:34 -08:00
Simon Baatz	3f10543c5b	selftests/net: packetdrill: restore tcp_rcv_big_endseq.pkt Commit `1cc93c48b5` ("selftests/net: packetdrill: remove tests for tcp_rcv_*big") removed the test for the reverted commit `1d2fbaad7c` ("tcp: stronger sk_rcvbuf checks") but also the one for commit `9ca48d616e` ("tcp: do not accept packets beyond window"). Restore the test with the necessary adaptation: expect a delayed ACK instead of an immediate one, since tcp_can_ingest() does not fail anymore for the last data packet. Signed-off-by: Simon Baatz <gmbnomis@gmail.com> Link: https://patch.msgid.link/20260301-tcp_rcv_big_endseq-v1-1-86ab7415ab58@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:47:46 -08:00
Mieczyslaw Nalewaj	7cbe98f7be	net: dsa: realtek: rtl8365mb: fix rtl8365mb_phy_ocp_write return value Function rtl8365mb_phy_ocp_write() always returns 0, even when an error occurs during register access. This patch fixes the return value to propagate the actual error code from regmap operations. Link: https://lore.kernel.org/netdev/a2dfde3c-d46f-434b-9d16-1e251e449068@yahoo.com/ Fixes: `2796728460` ("net: dsa: realtek: rtl8365mb: serialize indirect PHY register access") Signed-off-by: Mieczyslaw Nalewaj <namiltd@yahoo.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260301-realtek_namiltd_fix1-v1-1-43a6bb707f9c@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-03-02 18:32:40 -08:00
Raju Rangoju	9439a661c2	amd-xgbe: fix MAC_TCR_SS register width for 2.5G and 10M speeds Extend the MAC_TCR_SS (Speed Select) register field width from 2 bits to 3 bits to properly support all speed settings. The MAC_TCR register's SS field encoding requires 3 bits to represent all supported speeds: - 0x00: 10Gbps (XGMII) - 0x02: 2.5Gbps (GMII) / 100Mbps - 0x03: 1Gbps / 10Mbps - 0x06: 2.5Gbps (XGMII) - P100a only With only 2 bits, values 0x04-0x07 cannot be represented, which breaks 2.5G XGMII mode on newer platforms and causes incorrect speed select values to be programmed. Fixes: `07445f3c7c` ("amd-xgbe: Add support for 10 Mbps speed") Co-developed-by: Guruvendra Punugupati <Guruvendra.Punugupati@amd.com> Signed-off-by: Guruvendra Punugupati <Guruvendra.Punugupati@amd.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260226170753.250312-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 14:22:34 -08:00
MD Danish Anwar	147792c395	net: ti: icssg-prueth: Fix ping failure after offload mode setup when link speed is not 1G When both eth interfaces with links up are added to a bridge or hsr interface, ping fails if the link speed is not 1Gbps (e.g., 100Mbps). The issue is seen because when switching to offload (bridge/hsr) mode, prueth_emac_restart() restarts the firmware and clears DRAM with memset_io(), setting all memory to 0. This includes PORT_LINK_SPEED_OFFSET which firmware reads for link speed. The value 0 corresponds to FW_LINK_SPEED_1G (0x00), so for 1Gbps links the default value is correct and ping works. For 100Mbps links, the firmware needs FW_LINK_SPEED_100M (0x01) but gets 0 instead, causing ping to fail. The function emac_adjust_link() is called to reconfigure, but it detects no state change (emac->link is still 1, speed/duplex match PHY) so new_state remains false and icssg_config_set_speed() is never called to correct the firmware speed value. The fix resets emac->link to 0 before calling emac_adjust_link() in prueth_emac_common_start(). This forces new_state=true, ensuring icssg_config_set_speed() is called to write the correct speed value to firmware memory. Fixes: `06feac1540` ("net: ti: icssg-prueth: Fix emac link speed handling") Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Link: https://patch.msgid.link/20260226102356.2141871-1-danishanwar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 13:41:35 -08:00
Jiayuan Chen	101bacb303	atm: lec: fix null-ptr-deref in lec_arp_clear_vccs syzkaller reported a null-ptr-deref in lec_arp_clear_vccs(). This issue can be easily reproduced using the syzkaller reproducer. In the ATM LANE (LAN Emulation) module, the same atm_vcc can be shared by multiple lec_arp_table entries (e.g., via entry->vcc or entry->recv_vcc). When the underlying VCC is closed, lec_vcc_close() iterates over all ARP entries and calls lec_arp_clear_vccs() for each matched entry. For example, when lec_vcc_close() iterates through the hlists in priv->lec_arp_empty_ones or other ARP tables: 1. In the first iteration, for the first matched ARP entry sharing the VCC, lec_arp_clear_vccs() frees the associated vpriv (which is vcc->user_back) and sets vcc->user_back to NULL. 2. In the second iteration, for the next matched ARP entry sharing the same VCC, lec_arp_clear_vccs() is called again. It obtains a NULL vpriv from vcc->user_back (via LEC_VCC_PRIV(vcc)) and then attempts to dereference it via `vcc->pop = vpriv->old_pop`, leading to a null-ptr-deref crash. Fix this by adding a null check for vpriv before dereferencing it. If vpriv is already NULL, it means the VCC has been cleared by a previous call, so we can safely skip the cleanup and just clear the entry's vcc/recv_vcc pointers. The entire cleanup block (including vcc_release_async()) is placed inside the vpriv guard because a NULL vpriv indicates the VCC has already been fully released by a prior iteration — repeating the teardown would redundantly set flags and trigger callbacks on an already-closing socket. The Fixes tag points to the initial commit because the entry->vcc path has been vulnerable since the original code. The entry->recv_vcc path was later added by commit `8d9f73c0ad` ("atm: fix a memory leak of vcc->user_back") with the same pattern, and both paths are fixed here. Reported-by: syzbot+72e3ea390c305de0e259@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68c95a83.050a0220.3c6139.0e5c.GAE@google.com/T/ Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260225123250.189289-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 09:33:26 -08:00
Guenter Roeck	74badb9c20	dpaa2-switch: Fix interrupt storm after receiving bad if_id in IRQ handler Commit `31a7a0bbeb` ("dpaa2-switch: add bounds check for if_id in IRQ handler") introduces a range check for if_id to avoid an out-of-bounds access. If an out-of-bounds if_id is detected, the interrupt status is not cleared. This may result in an interrupt storm. Clear the interrupt status after detecting an out-of-bounds if_id to avoid the problem. Found by an experimental AI code review agent at Google. Fixes: `31a7a0bbeb` ("dpaa2-switch: add bounds check for if_id in IRQ handler") Cc: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://patch.msgid.link/20260227055812.1777915-1-linux@roeck-us.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 09:01:41 -08:00
Jakub Kicinski	0eb5965b29	Merge branch 'xsk-fixes-for-af_xdp-fragment-handling' Nikhil P. Rao says: ==================== xsk: Fixes for AF_XDP fragment handling This series fixes two issues in AF_XDP zero-copy fragment handling: Patch 1 fixes a buffer leak caused by incorrect list node handling after commit `b692bf9a75`. The list_node field is now reused for both the xskb pool list and the buffer free list. Using list_del() instead of list_del_init() causes list_empty() checks in xp_free() to fail, preventing buffers from being added to the free list. Patch 2 fixes partial packet delivery to userspace. In the zero-copy path, if the Rx queue fills up while enqueuing fragments, the remaining fragments are dropped, causing the application to receive incomplete packets. The fix ensures the Rx queue has sufficient space for all fragments before starting to enqueue them. [1] https://lore.kernel.org/oe-kbuild-all/202602051720.YfZO23pZ-lkp@intel.com/ [2] https://lore.kernel.org/oe-kbuild-all/202602172046.vf9DtpdF-lkp@intel.com/ ==================== Link: https://patch.msgid.link/20260225000456.107806-1-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:15 -08:00
Nikhil P. Rao	f7387d6579	xsk: Fix zero-copy AF_XDP fragment drop AF_XDP should ensure that only a complete packet is sent to application. In the zero-copy case, if the Rx queue gets full as fragments are being enqueued, the remaining fragments are dropped. For the multi-buffer case, add a check to ensure that the Rx queue has enough space for all fragments of a packet before starting to enqueue them. Fixes: `24ea50127e` ("xsk: support mbuf on ZC RX") Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-3-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:11 -08:00
Nikhil P. Rao	60abb0ac11	xsk: Fix fragment node deletion to prevent buffer leak After commit `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"), the list_node field is reused for both the xskb pool list and the buffer free list, this causes a buffer leak as described below. xp_free() checks if a buffer is already on the free list using list_empty(&xskb->list_node). When list_del() is used to remove a node from the xskb pool list, it doesn't reinitialize the node pointers. This means list_empty() will return false even after the node has been removed, causing xp_free() to incorrectly skip adding the buffer to the free list. Fix this by using list_del_init() instead of list_del() in all fragment handling paths, this ensures the list node is reinitialized after removal, allowing the list_empty() to work correctly. Fixes: `b692bf9a75` ("xsk: Get rid of xdp_buff_xsk::xskb_list_node") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:55:11 -08:00
Jakub Kicinski	6df0022b6c	Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2026-02-19 (idpf, ice, i40e, ixgbevf, e1000e) For idpf: Li Li moves the check for software marker to occur after incrementing next to clean to avoid re-encountering the same packet. He also adds a couple of checks to prevent NULL pointer dereferences and NULLs rss_key, after free, in error path so that later checks are properly evaluated. Brian Vazquez adjusts IRQ naming to have correlation with netdev naming. Sreedevi removes validation of action type as part of ntuple rule deletion. For ice: Aaron Ma breaks RDMA initialization into two steps and adjusts calls so that VSIs are entirely configured before plugging. Michal Schmidt fixes initialization of loopback VSI to have proper resources allocated to allow for loopback testing to occur. For i40e: Thomas Gleixner fixes a leak of preempt count by replacing get_cpu() with smp_processor_id(). For ixgbevf: Jedrzej adds a check for mailbox version before attempting to call an associated link state call that is supported in that mailbox version. For e1000e: Vitaly clears power gating feature for Panther Lake systems to avoid packet issues. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: e1000e: clear DPG_EN after reset to avoid autonomous power-gating e1000e: introduce new board type for Panther Lake PCH ixgbevf: fix link setup issue i40e: Fix preempt count leak in napi poll tracepoint ice: fix crash in ethtool offline loopback test ice: recap the VSI and QoS info after rebuild idpf: Fix flow rule delete failure due to invalid validation idpf: change IRQ naming to match netdev and ethtool queue numbering idpf: nullify pointers after they are freed idpf: skip deallocating txq group's txqs if it is NULL idpf: skip deallocating bufq_sets from rx_qgrp if it is NULL idpf: increment completion queue next_to_clean in sw marker wait routine ==================== Link: https://patch.msgid.link/20260225211546.1949260-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 08:43:56 -08:00
Jakub Kicinski	1cc93c48b5	selftests/net: packetdrill: remove tests for tcp_rcv_*big Since commit `1d2fbaad7c` ("tcp: stronger sk_rcvbuf checks") has been reverted we need to remove the corresponding tests. Link: https://lore.kernel.org/20260227003359.2391017-1-kuba@kernel.org Link: https://patch.msgid.link/20260227033446.2596457-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:55:52 -08:00
Jakub Kicinski	026dfef287	tcp: give up on stronger sk_rcvbuf checks (for now) We hit another corner case which leads to TcpExtTCPRcvQDrop Connections which send RPCs in the 20-80kB range over loopback experience spurious drops. The exact conditions for most of the drops I investigated are that: - socket exchanged >1MB of data so its not completely fresh - rcvbuf is around 128kB (default, hasn't grown) - there is ~60kB of data in rcvq - skb > 64kB arrives The sum of skb->len (!) of both of the skbs (the one already in rcvq and the arriving one) is larger than rwnd. My suspicion is that this happens because __tcp_select_window() rounds the rwnd up to (1 << wscale) if less than half of the rwnd has been consumed. Eric suggests that given the number of Fixes we already have pointing to `1d2fbaad7c` it's probably time to give up on it, until a bigger revamp of rmem management. Also while we could risk tweaking the rwnd math, there are other drops on workloads I investigated, after the commit in question, not explained by this phenomenon. Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/20260225122355.585fd57b@kernel.org Fixes: `1d2fbaad7c` ("tcp: stronger sk_rcvbuf checks") Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260227003359.2391017-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:55:39 -08:00
Kuniyuki Iwashima	6996a2d2d0	udp: Unhash auto-bound connected sk from 4-tuple hash table when disconnected. Let's say we bind() an UDP socket to the wildcard address with a non-zero port, connect() it to an address, and disconnect it from the address. bind() sets SOCK_BINDPORT_LOCK on sk->sk_userlocks (but not SOCK_BINDADDR_LOCK), and connect() calls udp_lib_hash4() to put the socket into the 4-tuple hash table. Then, __udp_disconnect() calls sk->sk_prot->rehash(sk). It computes a new hash based on the wildcard address and moves the socket to a new slot in the 4-tuple hash table, leaving a garbage in the chain that no packet hits. Let's remove such a socket from 4-tuple hash table when disconnected. Note that udp_sk(sk)->udp_portaddr_hash needs to be udpated after udp_hash4_dec(hslot2) in udp_unhash4(). Fixes: `78c91ae2c6` ("ipv4/udp: Add 4-tuple hash for connected socket") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260227035547.3321327-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-28 07:46:24 -08:00
Long Li	dabffd0854	net: mana: Ring doorbell at 4 CQ wraparounds MANA hardware requires at least one doorbell ring every 8 wraparounds of the CQ. The driver rings the doorbell as a form of flow control to inform hardware that CQEs have been consumed. The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ has fewer than 512 entries, a single poll call can process more than 4 wraparounds without ringing the doorbell. The doorbell threshold check also uses ">" instead of ">=", delaying the ring by one extra CQE beyond 4 wraparounds. Combined, these issues can cause the driver to exceed the 8-wraparound hardware limit, leading to missed completions and stalled queues. Fix this by capping the number of CQEs polled per call to 4 wraparounds of the CQ in both TX and RX paths. Also change the doorbell threshold from ">" to ">=" so the doorbell is rung as soon as 4 wraparounds are reached. Cc: stable@vger.kernel.org Fixes: `58a63729c9` ("net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings") Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260226192833.1050807-1-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:29:38 -08:00
Valentin Spreckels	15fba71533	net: usb: r8152: add TRENDnet TUC-ET2G The TRENDnet TUC-ET2G is a RTL8156 based usb ethernet adapter. Add its vendor and product IDs. Signed-off-by: Valentin Spreckels <valentin@spreckels.dev> Link: https://patch.msgid.link/20260226195409.7891-2-valentin@spreckels.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:27:31 -08:00
Victor Nogueira	b14e82abf7	selftests/tc-testing: Create tests to exercise act_ct binding restrictions Add 4 test cases to exercise new act_ct binding restrictions: - Try to attach act_ct to an ets qdisc - Attach act_ct to an ingress qdisc - Attach act_ct to a clsact/egress qdisc - Attach act_ct to a shared block Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-2-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:06:21 -08:00
Victor Nogueira	11cb63b0d1	net/sched: Only allow act_ct to bind to clsact/ingress qdiscs and shared blocks As Paolo said earlier [1]: "Since the blamed commit below, classify can return TC_ACT_CONSUMED while the current skb being held by the defragmentation engine. As reported by GangMin Kim, if such packet is that may cause a UaF when the defrag engine later on tries to tuch again such packet." act_ct was never meant to be used in the egress path, however some users are attaching it to egress today [2]. Attempting to reach a middle ground, we noticed that, while most qdiscs are not handling TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we address the issue by only allowing act_ct to bind to clsact/ingress qdiscs and shared blocks. That way it's still possible to attach act_ct to egress (albeit only with clsact). [1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/ [2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/ Reported-by: GangMin Kim <km.kim1503@gmail.com> Fixes: `3f14b377d0` ("net/sched: act_ct: fix skb leak and crash on ooo frags") CC: stable@vger.kernel.org Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 19:06:21 -08:00
Florian Westphal	ba14798653	selftests: netfilter: nft_queue.sh: avoid flakes on debug kernels Jakub reports test flakes on debug kernels: FAIL: test_udp_gro_ct: Expected software segmentation to occur, had 23 and 17 This test assumes that the kernels nfnetlink_queue module sees N GSO packets, segments them into M skbs and queues them to userspace for reinjection. Hence, if M >= N, no segmentation occurred. However, its possible that this happens: - nfnetlink_queue gets GSO packet - segments that into n skbs - userspace buffer is full, kernel drops the segmented skbs -> "toqueue" counter incremented by 1, "fromqueue" is unchanged. If this happens often enough in a single run, M >= N check triggers incorrectly. To solve this, allow the nf_queue.c test program to set the FAIL_OPEN flag so that the segmented skbs bypass the queueing step in the kernel if the receive buffer is full. Also, reduce number of sending socat instances, decrease their priority and increase nice value for the nf_queue program itself to reduce the probability of overruns happening in the first place. Fixes: `59ecffa399` ("selftests: netfilter: nft_queue.sh: add udp fraglist gro test case") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20260218184114.0b405b72@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260226161920.1205-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:36:59 -08:00
Jakub Kicinski	71347b9d8c	Merge branch 'net-sched-sch_cake-fixes-for-cake_mq' Jonas Köppeler says: ==================== net/sched: sch_cake: fixes for cake_mq This patch contains two fixes for cake_mq: - do not sync when bandwidth is unlimited - adjust the rates for all tins during sync ==================== Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-0-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:42 -08:00
Jonas Köppeler	15c2715a52	net/sched: sch_cake: fixup cake_mq rate adjustment for diffserv config cake_mq's rate adjustment during the sync periods did not adjust the rates for every tin in a diffserv config. This lead to inconsistencies of rates between the tins. Fix this by setting the rates for all tins during synchronization. Fixes: `1bddd758ba` ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-2-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:40 -08:00
Jonas Köppeler	0b3cd139be	net/sched: sch_cake: avoid sync overhead when unlimited Skip inter-instance sync when no rate limit is configured, as it serves no purpose and only adds overhead. Fixes: `1bddd758ba` ("net/sched: sch_cake: share shaper state across sub-instances of cake_mq") Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260226-cake-mq-skip-sync-bandwidth-unlimited-v1-1-01830bb4db87@tu-berlin.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 18:35:40 -08:00
Eric Dumazet	29252397bc	inet: annotate data-races around isk->inet_num UDP/TCP lookups are using RCU, thus isk->inet_num accesses should use READ_ONCE() and WRITE_ONCE() where needed. Fixes: `3ab5aee7fe` ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260225203545.1512417-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 17:16:59 -08:00
Paul Moses	62413a9c3c	net/sched: act_gate: snapshot parameters with RCU on replace The gate action can be replaced while the hrtimer callback or dump path is walking the schedule list. Convert the parameters to an RCU-protected snapshot and swap updates under tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits the entry list, preserve the existing schedule so the effective state is unchanged. Fixes: `a51c328df3` ("net: qos: introduce a gate control flow action") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-27 16:10:36 -08:00
Chintan Vankar	be11a53722	net: ethernet: ti: am65-cpsw-nuss/cpsw-ale: Fix multicast entry handling in ALE table In the current implementation, flushing multicast entries in MAC mode incorrectly deletes entries for all ports instead of only the target port, disrupting multicast traffic on other ports. The cause is adding multicast entries by setting only host port bit, and not setting the MAC port bits. Fix this by setting the MAC port's bit in the port mask while adding the multicast entry. Also fix the flush logic to preserve the host port bit during removal of MAC port and free ALE entries when mask contains only host port. Fixes: `5c50a856d5` ("drivers: net: ethernet: cpsw: add multicast address to ALE table") Signed-off-by: Chintan Vankar <c-vankar@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224181359.2055322-1-c-vankar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:43:54 -08:00
Jakub Kicinski	7e5b450c49	Merge branch 'bridge-check-relevant-options-in-vlan-range-grouping' Danielle Ratson says: ==================== bridge: Check relevant options in VLAN range grouping The br_vlan_opts_eq_range() function determines if consecutive VLANs can be grouped together in a range for compact netlink notifications. It currently checks state, tunnel info, and multicast router configuration, but misses two categories of per-VLAN options that affect the output: 1. User-visible priv_flags (neigh_suppress, mcast_enabled) 2. Port multicast context options (mcast_max_groups, mcast_n_groups) When VLANs have different settings for these options, they are incorrectly grouped into ranges, causing netlink notifications to report only one VLAN's settings for the entire range. Fix by checking priv_flags equality, but only for flags that affect netlink output (BR_VLFLAG_NEIGH_SUPPRESS_ENABLED and BR_VLFLAG_MCAST_ENABLED), and comparing multicast context options (mcast_max_groups, mcast_n_groups). Add a test with four test cases for each option, to ensure that VLANs with different values are not grouped into ranges and VLANs with matching values are properly grouped together. ==================== Link: https://patch.msgid.link/20260225143956.3995415-1-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:24:33 -08:00
Danielle Ratson	13540021be	selftests: net: Add bridge VLAN range grouping tests Add a new test file bridge_vlan_dump.sh with four test cases that verify VLANs with different per-VLAN options are not incorrectly grouped into ranges in the dump output. The tests verify the kernel's br_vlan_opts_eq_range() function correctly prevents VLAN range grouping when neigh_suppress, mcast_max_groups, mcast_n_groups, or mcast_enabled options differ. Each test verifies that VLANs with different option values appear as individual entries rather than ranges, and that VLANs with matching values are properly grouped together. Example output: $ ./bridge_vlan_dump.sh TEST: VLAN range grouping with neigh_suppress [ OK ] TEST: VLAN range grouping with mcast_max_groups [ OK ] TEST: VLAN range grouping with mcast_n_groups [ OK ] TEST: VLAN range grouping with mcast_enabled [ OK ] Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20260225143956.3995415-3-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:24:29 -08:00
Danielle Ratson	93c9475c04	bridge: Check relevant per-VLAN options in VLAN range grouping The br_vlan_opts_eq_range() function determines if consecutive VLANs can be grouped together in a range for compact netlink notifications. It currently checks state, tunnel info, and multicast router configuration, but misses two categories of per-VLAN options that affect the output: 1. User-visible priv_flags (neigh_suppress, mcast_enabled) 2. Port multicast context (mcast_max_groups, mcast_n_groups) When VLANs have different settings for these options, they are incorrectly grouped into ranges, causing netlink notifications to report only one VLAN's settings for the entire range. Fix by checking priv_flags equality, but only for flags that affect netlink output (BR_VLFLAG_NEIGH_SUPPRESS_ENABLED and BR_VLFLAG_MCAST_ENABLED), and comparing multicast context (mcast_max_groups and mcast_n_groups). Example showing the bugs before the fix: $ bridge vlan set vid 10 dev dummy1 neigh_suppress on $ bridge vlan set vid 11 dev dummy1 neigh_suppress off $ bridge -d vlan show dev dummy1 port vlan-id dummy1 10-11 ... neigh_suppress on $ bridge vlan set vid 10 dev dummy1 mcast_max_groups 100 $ bridge vlan set vid 11 dev dummy1 mcast_max_groups 200 $ bridge -d vlan show dev dummy1 port vlan-id dummy1 10-11 ... mcast_max_groups 100 After the fix, VLANs 10 and 11 are shown as separate entries with their correct individual settings. Fixes: `a1aee20d5d` ("net: bridge: Add netlink knobs for number / maximum MDB entries") Fixes: `83f6d60079` ("bridge: vlan: Allow setting VLAN neighbor suppression state") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260225143956.3995415-2-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:24:29 -08:00
Eric Dumazet	2ef2b20cf4	net: annotate data-races around sk->sk_{data_ready,write_space} skmsg (and probably other layers) are changing these pointers while other cpus might read them concurrently. Add corresponding READ_ONCE()/WRITE_ONCE() annotations for UDP, TCP and AF_UNIX. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: syzbot+87f770387a9e5dc6b79b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/699ee9fc.050a0220.1cd54b.0009.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260225131547.1085509-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:23:03 -08:00
Jakub Kicinski	754a3d081a	Here is a batman-adv bugfix: - Avoid double-rtnl_lock ELP metric worker, by Sven Eckelmann -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmmes84WHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSxDD/wI/ssEvqmay/4okfp6Fk/+hjLi 2BvCLwKei8JKqsnNvUSW7I+inrp0AilwfUuMqQlIiOdz6zJ6O4s4SXdiwl8TH49p uVp4dSwoOPHzBKaPH+dU15fcLD4yBqRYnl6gyxem7hWtsDU04fn96se7lagUdJc/ 35LZ2ni9cRmxgmvcLECNGOj4Tm7TxbcG0wkifS/rIO7gd05rXb7c7T1lCGRPeBf4 2i4RVQXwSEVhff1ig7yU/1gs2FUzIKnrlKHayyfYkynEI37Ggc4IBiqLkdyBuxJ4 Z+qlCfumrtdrt79kirzezrcWEzQEj5Yn3fnXj0X27QYy5FJVKnLczHnGuLUSUqzl QgwvQ87tNwEmz50ODsq+TFY9GuowWJ5yLTMFb18u/5hJrAGvux5wU+mIbloTOpBg M/kMv8kZIMNzVEirxbD08Ygx9Fsxu3UWGptDAunlv1GkHBj7XqA2Jkoq77eDfxx+ lIa0tu1s/y1eTb5tA9JXUn0BsoNrafDIY5zrjz+lDKYpmmeNUgiTbQBGuVCZ+t2o EWLYPxdV84QpwuoaXZ/ZkD0YVAx/sfDLptxaBGViWbThLGVYxYSELePO94Mkr6Os Fa/8gEg0Z+jNUZ3UfVVnjyjPaa5/BM2vtbwSQFgv1udJGLoa/AkWIcOEgkeZAzWc B5cubcmbSHx4mBCEuA== =Ng/2 -----END PGP SIGNATURE----- Merge tag 'batadv-net-pullrequest-20260225' of https://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - Avoid double-rtnl_lock ELP metric worker, by Sven Eckelmann * tag 'batadv-net-pullrequest-20260225' of https://git.open-mesh.org/linux-merge: batman-adv: Avoid double-rtnl_lock ELP metric worker ==================== Link: https://patch.msgid.link/20260225084614.229077-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 19:15:09 -08:00
Davide Caratti	e35626f610	net/sched: ets: fix divide by zero in the offload path Offloading ETS requires computing each class' WRR weight: this is done by averaging over the sums of quanta as 'q_sum' and 'q_psum'. Using unsigned int, the same integer size as the individual DRR quanta, can overflow and even cause division by zero, like it happened in the following splat: Oops: divide error: 0000 [#1] SMP PTI CPU: 13 UID: 0 PID: 487 Comm: tc Tainted: G E 6.19.0-virtme #45 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Call Trace: <TASK> ets_qdisc_change+0x870/0xf40 [sch_ets] qdisc_create+0x12b/0x540 tc_modify_qdisc+0x6d7/0xbd0 rtnetlink_rcv_msg+0x168/0x6b0 netlink_rcv_skb+0x5c/0x110 netlink_unicast+0x1d6/0x2b0 netlink_sendmsg+0x22e/0x470 ____sys_sendmsg+0x38a/0x3c0 ___sys_sendmsg+0x99/0xe0 __sys_sendmsg+0x8a/0xf0 do_syscall_64+0x111/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f440b81c77e Code: 4d 89 d8 e8 d4 bc 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff951e4c10 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000481820 RCX: 00007f440b81c77e RDX: 0000000000000000 RSI: 00007fff951e4cd0 RDI: 0000000000000003 RBP: 00007fff951e4c20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff951f4fa8 R13: 00000000699ddede R14: 00007f440bb01000 R15: 0000000000486980 </TASK> Modules linked in: sch_ets(E) netdevsim(E) ---[ end trace 0000000000000000 ]--- RIP: 0010:ets_offload_change+0x11f/0x290 [sch_ets] Code: e4 45 31 ff eb 03 41 89 c7 41 89 cb 89 ce 83 f9 0f 0f 87 b7 00 00 00 45 8b 08 31 c0 45 01 cc 45 85 c9 74 09 41 6b c4 64 31 d2 <41> f7 f2 89 c2 44 29 fa 45 89 df 41 83 fb 0f 0f 87 c7 00 00 00 44 RSP: 0018:ffffd0a180d77588 EFLAGS: 00010246 RAX: 00000000ffffff38 RBX: ffff8d3d482ca000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd0a180d77660 RBP: ffffd0a180d77690 R08: ffff8d3d482ca2d8 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffe R13: ffff8d3d472f2000 R14: 0000000000000003 R15: 0000000000000000 FS: 00007f440b6c2740(0000) GS:ffff8d3dc9803000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000003cdd2000 CR3: 0000000007b58002 CR4: 0000000000172ef0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x30000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- Fix this using 64-bit integers for 'q_sum' and 'q_psum'. Cc: stable@vger.kernel.org Fixes: `d35eb52bd2` ("net: sch_ets: Make the ETS qdisc offloadable") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/28504887df314588c7255e9911769c36f751edee.1771964872.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-26 18:28:47 -08:00

1 2 3 4 5 ...

1427066 commits