linux

mirror of https://github.com/torvalds/linux.git synced 2026-03-13 22:36:17 +01:00

Author	SHA1	Message	Date
Kuniyuki Iwashima	68889dfd54	mptcp: Fix up subflow's memcg when CONFIG_SOCK_CGROUP_DATA=n. When sk_alloc() allocates a socket, mem_cgroup_sk_alloc() sets sk->sk_memcg based on the current task. MPTCP subflow socket creation is triggered from userspace or an in-kernel worker. In the latter case, sk->sk_memcg is not what we want. So, we fix it up from the parent socket's sk->sk_memcg in mptcp_attach_cgroup(). Although the code is placed under #ifdef CONFIG_MEMCG, it is buried under #ifdef CONFIG_SOCK_CGROUP_DATA. The two configs are orthogonal. If CONFIG_MEMCG is enabled without CONFIG_SOCK_CGROUP_DATA, the subflow's memory usage is not charged correctly. Let's move the code out of the wrong ifdef guard. Note that sk->sk_memcg is freed in sk_prot_free() and the parent sk holds the refcnt of memcg->css here, so we don't need to use css_tryget(). Fixes: `3764b0c565` ("mptcp: attach subflow socket to parent cgroup") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 19:20:58 -07:00
Jakub Kicinski	5c69e0b395	Merge branch 'stmmac-stop-silently-dropping-bad-checksum-packets' Oleksij Rempel says: ==================== stmmac: stop silently dropping bad checksum packets this series reworks how stmmac handles receive checksum offload (CoE) errors on dwmac4. At present, when CoE is enabled, the hardware silently discards any frame that fails checksum validation. These packets never reach the driver and are not accounted in the generic drop statistics. They are only visible in the stmmac-specific counters as "payload error" or "header error" packets, which makes it harder to debug or monitor network issues. Following discussion [1], the driver is reworked to propagate checksum error information up to the stack. With these changes, CoE stays enabled, but frames that fail hardware validation are no longer dropped in hardware. Instead, the driver marks them with CHECKSUM_NONE so the network stack can validate, drop, and properly account them in the standard drop statistics. [1] https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/ ==================== Link: https://patch.msgid.link/20250818090217.2789521-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:33:09 -07:00
Oleksij Rempel	fe40427976	net: stmmac: dwmac4: stop hardware from dropping checksum-error packets Tell the MAC not to discard frames that fail TCP/IP checksum validation. By default, when the hardware checksum engine (CoE) is enabled, dwmac4 silently drops any packet where the offload engine detects a checksum error. These frames are not reported to the driver and are not counted in any statistics as dropped packets. Set the MTL_OP_MODE_DIS_TCP_EF bit when initializing the Rx channel so that all packets are delivered, even if they failed hardware checksum validation. CoE remains enabled, but instead of dropping such frames, the driver propagates the error status and marks the skb with CHECKSUM_NONE. This allows the stack to verify and drop the packet while updating statistics. This change follows the decision made in the discussion: Link: https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/ It depends on the previous patches that added proper error propagation in the Rx path. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250818090217.2789521-4-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:33:06 -07:00
Oleksij Rempel	644b8437cc	net: stmmac: dwmac4: report Rx checksum errors in status Propagate hardware checksum failures from the descriptor parser to the caller. Currently, dwmac4_wrback_get_rx_status() updates stats when the Rx descriptor signals an IP header or payload checksum error, but it does not reflect this in its return value. The higher-level stmmac_rx() code therefore cannot tell that hardware checksum validation failed. Set the csum_none flag in the returned status when either RDES1_IP_HDR_ERROR or RDES1_IP_PAYLOAD_ERROR is present. This aligns dwmac4 with enh_desc_coe_rdes0() and lets stmmac_rx() mark the skb as CHECKSUM_NONE for software verification. This is a preparatory step for disabling the hardware filter that drops frames which do not pass checksum validation. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250818090217.2789521-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:33:05 -07:00
Oleksij Rempel	ee0aace5f8	net: stmmac: Correctly handle Rx checksum offload errors The stmmac_rx function would previously set skb->ip_summed to CHECKSUM_UNNECESSARY if hardware checksum offload (CoE) was enabled and the packet was of a known IP ethertype. However, this logic failed to check if the hardware had actually reported a checksum error. The hardware status, indicating a header or payload checksum failure, was being ignored at this stage. This could cause corrupt packets to be passed up the network stack as valid. This patch corrects the logic by checking the `csum_none` status flag, which is set when the hardware reports a checksum error. If this flag is set, skb->ip_summed is now correctly set to CHECKSUM_NONE, ensuring the kernel's network stack will perform its own validation and properly handle the corrupt packet. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250818090217.2789521-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:33:05 -07:00
Jakub Kicinski	8beead2d15	Merge branch 'there-are-a-cleancode-and-a-parameter-check-for-hns3-driver' Jijie Shao says: ==================== There are a cleancode and a parameter check for hns3 driver This patchset includes: 1. a parameter check omitted from fix code in net branch https://lore.kernel.org/all/20250723072900.GV2459@horms.kernel.org/ 2. a small clean code ==================== Link: https://patch.msgid.link/20250815100414.949752-1-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:11:19 -07:00
Jijie Shao	021f989c86	net: hns3: change the function return type from int to bool hclge_only_alloc_priv_buff() only return true or false, So, change the function return type from integer to boolean. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Link: https://patch.msgid.link/20250815100414.949752-3-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:11:15 -07:00
Jijie Shao	e16e973c57	net: hns3: add parameter check for tx_copybreak and tx_spare_buf_size Since the driver always enables tx bounce buffer, there are minimum values for `copybreak` and `tx_spare_buf_size`. This patch will check and reject configurations with values smaller than these minimums. Closes: https://lore.kernel.org/all/20250723072900.GV2459@horms.kernel.org/ Signed-off-by: Jijie Shao <shaojijie@huawei.com> Link: https://patch.msgid.link/20250815100414.949752-2-shaojijie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:11:15 -07:00
Markus Stockhausen	3a752e6780	net: phy: realtek: enable serdes option mode for RTL8226-CG The RTL8226-CG can make use of the serdes option mode feature to dynamically switch between SGMII and 2500base-X. From what is known the setup sequence is much simpler with no magic values. Convert the exiting config_init() into a helper that configures the PHY depending on generation 1 or 2. Call the helper from two separated new config_init() functions. Finally convert the phy_driver specs of the RTL8226-CG to make use of the new configuration and switch over to the extended read_status() function to dynamically change the interface according to the serdes mode. Remark! The logic could be simpler if the serdes mode could be set before all other generation 2 magic values. Due to missing RTL8221B test hardware the mmd command order was kept. Tested on Zyxel XGS1210-12. Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de> Link: https://patch.msgid.link/20250815082009.3678865-1-markus.stockhausen@gmx.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:09:52 -07:00
Miguel García	09bde6fdcd	ipv6: ip6_gre: replace strcpy with strscpy for tunnel name Replace the strcpy() call that copies the device name into tunnel->parms.name with strscpy(), to avoid potential overflow and guarantee NULL termination. This uses the two-argument form of strscpy(), where the destination size is inferred from the array type. Destination is tunnel->parms.name (size IFNAMSIZ). Tested in QEMU (Alpine rootfs): - Created IPv6 GRE tunnels over loopback - Assigned overlay IPv6 addresses - Verified bidirectional ping through the tunnel - Changed tunnel parameters at runtime (`ip -6 tunnel change`) Signed-off-by: Miguel García <miguelgarciaroman8@gmail.com> Link: https://patch.msgid.link/20250818220203.899338-1-miguelgarciaroman8@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 18:06:24 -07:00
Jakub Kicinski	9efd5152e3	Merge branch 'net-convert-to-skb_dstref_steal-and-skb_dstref_restore' Stanislav Fomichev says: ==================== net: Convert to skb_dstref_steal and skb_dstref_restore To diagnose and prevent issues similar to [0], emit warning (CONFIG_DEBUG_NET) from skb_dst_set and skb_dst_set_noref when overwriting non-null reference-counted entry. Two new helpers are added to handle special cases where the entry needs to be reset and restored: skb_dstref_steal/skb_dstref_restore. The bulk of the patches in the series converts manual _skb_refst manipulations to these new helpers. 0: https://lore.kernel.org/netdev/20250723224625.1340224-1-sdf@fomichev.me/T/#u ==================== Link: https://patch.msgid.link/20250818154032.3173645-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:47 -07:00
Stanislav Fomichev	a890348adc	net: Add skb_dst_check_unset To prevent dst_entry leaks, add warning when the non-NULL dst_entry is rewritten. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-8-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:44 -07:00
Stanislav Fomichev	3e31075a11	chtls: Convert to skb_dst_reset Going forward skb_dst_set will assert that skb dst_entry is empty during skb_dst_set. skb_dstref_steal is added to reset existing entry without doing refcnt. Chelsio driver is doing extra dst management via skb_dst_set(NULL). Replace these calls with skb_dstref_steal. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-7-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:41 -07:00
Stanislav Fomichev	da3b9d493b	staging: octeon: Convert to skb_dst_drop Instead of doing dst_release and skb_dst_set, do skb_dst_drop which should do the right thing. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:38 -07:00
Stanislav Fomichev	e97e6a1830	net: Switch to skb_dstref_steal/skb_dstref_restore for ip_route_input callers Going forward skb_dst_set will assert that skb dst_entry is empty during skb_dst_set. skb_dstref_steal is added to reset existing entry without doing refcnt. skb_dstref_restore should be used to restore the previous entry. Convert icmp_route_lookup and ip_options_rcv_srr to these helpers. Add extra call to skb_dstref_reset to icmp_route_lookup to clear the ip_route_input entry. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-5-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:35 -07:00
Stanislav Fomichev	15488d4d8d	netfilter: Switch to skb_dstref_steal to clear dst_entry Going forward skb_dst_set will assert that skb dst_entry is empty during skb_dst_set. skb_dstref_steal is added to reset existing entry without doing refcnt. Switch to skb_dstref_steal in ip[6]_route_me_harder and add a comment on why it's safe to skip skb_dstref_restore. Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:19 -07:00
Stanislav Fomichev	c829aab21e	xfrm: Switch to skb_dstref_steal to clear dst_entry Going forward skb_dst_set will assert that skb dst_entry is empty during skb_dst_set. skb_dstref_steal is added to reset existing entry without doing refcnt. Switch to skb_dstref_steal in __xfrm_route_forward and add a comment on why it's safe to skip skb_dstref_restore. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:17 -07:00
Stanislav Fomichev	c3f0c02997	net: Add skb_dstref_steal and skb_dstref_restore Going forward skb_dst_set will assert that skb dst_entry is empty during skb_dst_set to prevent potential leaks. There are few places that still manually manage dst_entry not using the helpers. Convert them to the following new helpers: - skb_dstref_steal that resets dst_entry and returns previous dst_entry value - skb_dstref_restore that restores dst_entry previously reset via skb_dstref_steal Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:54:13 -07:00
Jakub Kicinski	0e041220ea	Merge branch 'net-speedup-some-nexthop-handling-when-having-a-lot-of-nexthops' Christoph Paasch says: ==================== net: Speedup some nexthop handling when having A LOT of nexthops Configuring a very large number of nexthops is fairly possible within a reasonable time-frame. But, certain netlink commands can become extremely slow. This series addresses some of these, namely dumping and removing nexthops. v1: https://lore.kernel.org/20250724-nexthop_dump-v1-1-6b43fffd5bac@openai.com ==================== Link: https://patch.msgid.link/20250816-nexthop_dump-v2-0-491da3462118@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:50:36 -07:00
Christoph Paasch	b0ac6d3b56	net: When removing nexthops, don't call synchronize_net if it is not necessary When removing a nexthop, commit `90f33bffa3` ("nexthops: don't modify published nexthop groups") added a call to synchronize_rcu() (later changed to _net()) to make sure everyone sees the new nexthop-group before the rtnl-lock is released. When one wants to delete a large number of groups and nexthops, it is fastest to first flush the groups (ip nexthop flush groups) and then flush the nexthops themselves (ip -6 nexthop flush). As that way the groups don't need to be rebalanced. However, `ip -6 nexthop flush` will still take a long time if there is a very large number of nexthops because of the call to synchronize_net(). Now, if there are no more groups, there is no point in calling synchronize_net(). So, let's skip that entirely by checking if nh->grp_list is empty. This gives us a nice speedup: BEFORE: ======= $ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 2097152 nexthops real 1m45.345s user 0m0.001s sys 0m0.005s $ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 4194304 nexthops real 3m10.430s user 0m0.002s sys 0m0.004s AFTER: ====== $ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 2097152 nexthops real 0m17.545s user 0m0.003s sys 0m0.003s $ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 4194304 nexthops real 0m35.823s user 0m0.002s sys 0m0.004s Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250816-nexthop_dump-v2-2-491da3462118@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:50:33 -07:00
Christoph Paasch	5236f57e7c	net: Make nexthop-dumps scale linearly with the number of nexthops When we have a (very) large number of nexthops, they do not fit within a single message. rtm_dump_walk_nexthops() thus will be called repeatedly and ctx->idx is used to avoid dumping the same nexthops again. The approach in which we avoid dumping the same nexthops is by basically walking the entire nexthop rb-tree from the left-most node until we find a node whose id is >= s_idx. That does not scale well. Instead of this inefficient approach, rather go directly through the tree to the nexthop that should be dumped (the one whose nh_id >= s_idx). This allows us to find the relevant node in O(log(n)). We have quite a nice improvement with this: Before: ======= --> ~1M nexthops: $ time ~/libnl/src/nl-nh-list \| wc -l 1050624 real 0m21.080s user 0m0.666s sys 0m20.384s --> ~2M nexthops: $ time ~/libnl/src/nl-nh-list \| wc -l 2101248 real 1m51.649s user 0m1.540s sys 1m49.908s After: ====== --> ~1M nexthops: $ time ~/libnl/src/nl-nh-list \| wc -l 1050624 real 0m1.157s user 0m0.926s sys 0m0.259s --> ~2M nexthops: $ time ~/libnl/src/nl-nh-list \| wc -l 2101248 real 0m2.763s user 0m2.042s sys 0m0.776s Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250816-nexthop_dump-v2-1-491da3462118@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:50:33 -07:00
Jakub Kicinski	51992f99f0	selftests: drv-net: ncdevmem: make configure_channels() support combined channels ncdevmem tests that the kernel correctly rejects attempts to deactivate queues with MPs bound. Make the configure_channels() test support combined channels. Currently it tries to set the queue counts to rx N tx N-1, which only makes sense for devices which have IRQs per ring type. Most modern devices used combined IRQs/channels with both Rx and Tx queues. Since the math is total Rx == combined+Rx setting Rx when combined is non-zero will be increasing the total queue count, not decreasing as the test intends. Note that the test would previously also try to set the Tx ring count to Rx - 1, for some reason. Which would be 0 if the device has only 2 queues configured. With this change (device with 2 queues): setting channel count rx:1 tx:1 YNL set channels: Kernel error: 'requested channel counts are too low for existing memory provider setting (2)' Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250815231513.381652-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:49:35 -07:00
Jakub Kicinski	eddc821f98	selftests: drv-net: tso: increase the retransmit threshold We see quite a few flakes during the TSO test against virtualized devices in NIPA. There's often 10-30 retransmissions during the test. Sometimes as many as 100. Set the retransmission threshold at 1/4th of the wire frame target. Link: https://patch.msgid.link/20250815224100.363438-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-19 17:49:21 -07:00
Chaoyi Chen	da114122b8	net: ethernet: stmmac: dwmac-rk: Make the clk_phy could be used for external phy For external phy, clk_phy should be optional, and some external phy need the clock input from clk_phy. This patch adds support for setting clk_phy for external phy. Signed-off-by: David Wu <david.wu@rock-chips.com> Signed-off-by: Chaoyi Chen <chaoyi.chen@rock-chips.com> Link: https://patch.msgid.link/20250815023515.114-1-kernel@airkyi.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 15:57:28 +02:00
Jakub Kicinski	0283b8f134	selftests: drv-net: test the napi init state Test that threaded state (in the persistent NAPI config) gets updated even when NAPI with given ID is not allocated at the time. This test is validating commit `ccba9f6baa` ("net: update NAPI threaded config even for disabled NAPIs"). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20250815013314.2237512-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 15:46:04 +02:00
Dipayaan Roy	730ff06d3f	net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency. This patch enhances RX buffer handling in the mana driver by allocating pages from a page pool and slicing them into MTU-sized fragments, rather than dedicating a full page per packet. This approach is especially beneficial on systems with large base page sizes like 64KB. Key improvements: - Proper integration of page pool for RX buffer allocations. - MTU-sized buffer slicing to improve memory utilization. - Reduce overall per Rx queue memory footprint. - Automatic fallback to full-page buffers when: * Jumbo frames are enabled (MTU > PAGE_SIZE / 2). * The XDP path is active, to avoid complexities with fragment reuse. Testing on VMs with 64KB pages shows around 200% throughput improvement. Memory efficiency is significantly improved due to reduced wastage in page allocations. Example: We are now able to fit 35 rx buffers in a single 64kb page for MTU size of 1500, instead of 1 rx buffer per page previously. Tested: - iperf3, iperf2, and nttcp benchmarks. - Jumbo frames with MTU 9000. - Native XDP programs (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) for testing the XDP path in driver. - Memory leak detection (kmemleak). - Driver load/unload, reboot, and stress scenarios. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/20250814140410.GA22089@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 14:42:44 +02:00
Lorenzo Bianconi	a8bdd935d1	net: airoha: Add wlan flowtable TX offload Introduce support to offload the traffic received on the ethernet NIC and forwarded to the wireless one using HW Packet Processor Engine (PPE) capabilities. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250814-airoha-en7581-wlan-tx-offload-v1-1-72e0a312003e@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 12:22:24 +02:00
Paolo Abeni	244ada9cb7	Merge branch 'net-macb-add-taprio-traffic-scheduling-support' Vineeth Karumanchi says: ==================== net: macb: Add TAPRIO traffic scheduling support Implement Time-Aware Traffic Scheduling (TAPRIO) offload support for Cadence MACB/GEM ethernet controllers to enable IEEE 802.1Qbv compliant time-sensitive networking (TSN) capabilities. Key features implemented: - Complete TAPRIO qdisc offload infrastructure with TC_SETUP_QDISC_TAPRIO - Hardware-accelerated time-based gate control for multiple queues - Enhanced Scheduled Traffic (ENST) register configuration and management - Gate state scheduling with configurable start times, on/off intervals - Support for cycle-time based traffic scheduling with validation - Hardware capability detection via MACB_CAPS_QBV flag - Robust error handling and parameter validation - Queue-specific timing register programming (ENST_START_TIME, ENST_ON_TIME, ENST_OFF_TIME) Changes include: - Add enst_ns_to_hw_units(): Converts nanoseconds to hardware units - Add enst_max_hw_interval(): Returns max interval for given speed - Add macb_taprio_setup_replace() for TAPRIO configuration - Add macb_taprio_destroy() for cleanup and reset - Add macb_setup_tc() as TC offload entry point - Enable NETIF_F_HW_TC feature for QBV-capable hardware - Add ENST register offsets to queue configuration The implementation validates timing constraints against hardware limits, supports per-queue gate mask configuration, and provides comprehensive logging for debugging and monitoring. Hardware registers are programmed atomically with proper locking to ensure consistent state. Tested on Xilinx Versal platforms with QBV-capable MACB controllers. Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> ==================== Link: https://patch.msgid.link/20250814071058.3062453-1-vineeth.karumanchi@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 12:13:05 +02:00
Vineeth Karumanchi	d739ce4beb	net: macb: Add capability-based QBV detection and Versal support The 'exclude_qbv' bit in the designcfg_debug1 register varies across MACB/GEM IP revisions, making direct probing unreliable for detecting QBV support. This patch introduces a capability-based approach for consistent QBV feature identification across the IP family. Platform support updates: - Establish foundation for QBV detection in TAPRIO implementation - Enable MACB_CAPS_QBV for Xilinx Versal platform configuration - Fix capability line wrapping, ensuring code stays within 80 columns Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Link: https://patch.msgid.link/20250814071058.3062453-3-vineeth.karumanchi@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 12:13:03 +02:00
Vineeth Karumanchi	89934dbf16	net: macb: Add TAPRIO traffic scheduling support Implement Time-Aware Traffic Scheduling (TAPRIO) offload support for Cadence MACB/GEM ethernet controllers to enable IEEE 802.1Qbv compliant time-sensitive networking (TSN) capabilities. Key Features: 1. Enhanced Scheduled Traffic (ENST) Register Management - Per-queue ENST registers: ENST_START_TIME, ENST_ON_TIME, ENST_OFF_TIME - Centralized control via ENST_CONTROL for gate enable/disable - Infrastructure enhancements: * Extended macb_queue structure with ENST timing control registers * Mapped ENST register offsets into queue management framework * Introduced macb_queue_enst_config for per-entry TC configuration - Timing conversion utility: * enst_ns_to_hw_units(): Converts nanoseconds to hardware units * Timing values are programmed as hardware units based on link speed * Conversion formula: time_bytes = time_ns / divisor * Speed-specific divisors: 1Gbps=8, 100Mbps=80, 10Mbps=800 - Hardware limit utility: * enst_max_hw_interval(): Returns max interval for given speed 2. TAPRIO Configuration via "tc qdisc replace" - macb_taprio_setup_replace(): Configures TAPRIO hardware offload - Parameter validation checks performed: * TC entry limit validation against available hardware queues * Base time non-negativity enforcement * Speed-adaptive timing constraint verification * Cycle time vs. total gate time consistency checks * Single-queue gate mask enforcement per scheduling entry - Programming sequence: * GEM doesn't support changing ENST registers if ENST is enabled, hence disable ENST before programming * Atomic timing register configuration (START_TIME, ON_TIME, OFF_TIME) * Enable queues via ENST_CONTROL 3. TAPRIO Cleanup via "tc qdisc destroy" - macb_taprio_destroy(): Safely removes TAPRIO configuration - Restores default queue behavior - Cleanup steps: * Reset TC state * Disable ENST * Clear timing registers * Ensure atomic updates with locking 4. Traffic Control Offload Infrastructure - macb_setup_taprio(): TAPRIO command dispatcher * Verifies hardware support * Handles runtime suspend state - macb_setup_tc(): TC_SETUP_QDISC_TAPRIO entry point - Supports REPLACE and DESTROY operations Tested on Xilinx Versal platforms with QBV-capable MACB controllers. Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Link: https://patch.msgid.link/20250814071058.3062453-2-vineeth.karumanchi@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 12:13:03 +02:00
Paolo Abeni	6089970b07	Merge branch 'eth-fbnic-add-xdp-support-for-fbnic' Mohsin Bashir says: ==================== eth: fbnic: Add XDP support for fbnic This patch series introduces basic XDP support for fbnic. To enable this, it also includes preparatory changes such as making the HDS threshold configurable via ethtool, updating headroom for fbnic, tracking frag state in shinfo, and prefetching the first cacheline of data. V3: https://lore.kernel.org/netdev/20250812220150.161848-1-mohsin.bashr@gmail.com/ V2: https://lore.kernel.org/netdev/20250811211338.857992-1-mohsin.bashr@gmail.com/ V1: https://lore.kernel.org/netdev/20250723145926.4120434-1-mohsin.bashr@gmail.com/ ==================== Link: https://patch.msgid.link/20250813221319.3367670-1-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:20 +02:00
Mohsin Bashir	7fedb8f267	eth: fbnic: Report XDP stats via ethtool Add support to collect XDP stats via ethtool API. We record packets and bytes sent, and packets dropped on the XDP_TX path. ethtool -S eth0 \| grep xdp \| grep -v "0" xdp_tx_queue_13_packets: 2 xdp_tx_queue_13_bytes: 16126 Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-10-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	5213ff0863	eth: fbnic: Collect packet statistics for XDP Add support for XDP statistics collection and reporting via rtnl_link and netdev_queue API. For XDP programs without frags support, fbnic requires MTU to be less than the HDS threshold. If an over-sized frame is received, the frame is dropped and recorded as rx_length_errors reported via ip stats to highlight that this is an error. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-9-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	168deb7b31	eth: fbnic: Add support for XDP_TX action Add support for XDP_TX action and cleaning the associated work. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-8-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	cf4facfb13	eth: fbnic: Add support for XDP queues Add support for allocating XDP_TX queues and configuring ring support. FBNIC has been designed with XDP support in mind. Each Tx queue has 2 submission queues and one completion queue, with the expectation that one of the submission queues will be used by the stack, and the other by XDP. XDP queues are populated by XDP_TX and start from index 128 in the TX queue array. The support for XDP_TX is added in the next patch. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-7-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	1b0a3950db	eth: fbnic: Add XDP pass, drop, abort support Add basic support for attaching an XDP program to the device and support for PASS/DROP/ABORT actions. In fbnic, buffers are always mapped as DMA_BIDIRECTIONAL. The BPF program pointer can be read either on a per-packet basis or on a per-NAPI poll basis. Both approaches are functionally equivalent, in the current code. Stick to per-packet as it limits number of arguments we need to pass around. On the XDP hot path, check that packets with fragments are only allowed when multi-buffer support is enabled for the XDP program. Ideally, this check should not be necessary because ndo_bpf verifies that for XDP programs without multi-buff support, MTU is less than the hds_thresh. However, the MTU currently does not enforce the receive size which would require cleaning up the data path and bouncing the link. For practical reasons, prioritize the ability to enter and exit BPF mode with different MTU sizes without requiring a full reconfig. Testing: Hook a simple XDP program that passes all the packets destined for a specific port iperf3 -c 192.168.1.10 -P 5 -p 12345 Connecting to host 192.168.1.10, port 12345 [ 5] local 192.168.1.9 port 46702 connected to 192.168.1.10 port 12345 [ ID] Interval Transfer Bitrate Retr Cwnd - - - - - - - - - - - - - - - - - - - - - - - - - [SUM] 1.00-2.00 sec 3.86 GBytes 33.2 Gbits/sec 0 XDP_DROP: Hook an XDP program that drops packets destined for a specific port iperf3 -c 192.168.1.10 -P 5 -p 12345 ^C- - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [SUM] 0.00-0.00 sec 0.00 Bytes 0.00 bits/sec 0 sender [SUM] 0.00-0.00 sec 0.00 Bytes 0.00 bits/sec receiver iperf3: interrupt - the client has terminated XDP with HDS: - Validate XDP attachment failure when HDS is low ~] ethtool -G eth0 hds-thresh 512 ~] sudo ip link set eth0 xdpdrv obj xdp_pass_12345.o sec xdp ~] Error: fbnic: MTU too high, or HDS threshold is too low for single buffer XDP. - Validate successful XDP attachment when HDS threshold is appropriate ~] ethtool -G eth0 hds-thresh 1536 ~] sudo ip link set eth0 xdpdrv obj xdp_pass_12345.o sec xdp - Validate when the XDP program is attached, changing HDS thresh to a lower value fails ~] ethtool -G eth0 hds-thresh 512 ~] netlink error: fbnic: Use higher HDS threshold or multi-buf capable program - Validate HDS thresh does not matter when xdp frags support is available ~] ethtool -G eth0 hds-thresh 512 ~] sudo ip link set eth0 xdpdrv obj xdp_pass_mb_12345.o sec xdp.frags Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-6-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	9064ab485f	eth: fbnic: Prefetch packet headers on Rx Issue a prefetch for the start of the buffer on Rx to try to avoid cache miss on packet headers. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-5-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	61f9a066c3	eth: fbnic: Use shinfo to track frags state on Rx Remove local fields that track frags state and instead store this information directly in the shinfo struct. This change is necessary because the current implementation can lead to inaccuracies in certain scenarios, such as when using XDP multi-buff support. Specifically, the XDP program may update nr_frags without updating the local variables, resulting in an inconsistent state. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-4-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	0cf5a39720	eth: fbnic: Update Headroom Fbnic currently reserves a minimum of 64B headroom, but this is insufficient for inserting additional headers (e.g., IPV6) via XDP, as only 24 bytes are available for adjustment. To address this limitation, increase the headroom to a larger value while ensuring better page use. Although the resulting headroom (192B) is smaller than the recommended value (256B), forcing the headroom to 256B would require aligning to 256B (as opposed to the current 128B), which can push the max headroom to 511B. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-3-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Mohsin Bashir	2b30fc01a6	eth: fbnic: Add support for HDS configuration Add support for configuring the header data split threshold. For fbnic, the tcp data split support is enabled all the time. Fbnic supports a maximum buffer size of 4KB. However, the reservation for the headroom, tailroom, and padding reduce the max header size accordingly. ethtool_hds -g eth0 Ring parameters for eth0: Pre-set maximums: ... HDS thresh: 3584 Current hardware settings: ... HDS thresh: 1536 Verify hds tests in ksft-net-drv are passing ksft-net-drv]# ./drivers/net/hds.py TAP version 13 1..13 ok 1 hds.get_hds ok 2 hds.get_hds_thresh ok 3 hds.set_hds_disable # SKIP disabling of HDS not supported by ... ... ... ok 12 hds.ioctl_set_xdp ok 13 hds.ioctl_enabled_set_xdp \# Totals: pass:12 fail:0 xfail:0 xpass:0 skip:1 error:0 Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250813221319.3367670-2-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-08-19 10:51:16 +02:00
Jakub Kicinski	38e1467392	Merge branch 'net-stmmac-eee-and-wol-cleanups' Russell King says: ==================== net: stmmac: EEE and WoL cleanups This series contains a series of cleanup patches for the EEE and WoL code in stmmac, prompted by issues raised during the last three weeks. ==================== Link: https://patch.msgid.link/aJ8avIp8DBAckgMc@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:14 -07:00
Russell King (Oracle)	5e5b39aa6f	net: stmmac: explain the phylink_speed_down() call in stmmac_release() The call to phylink_speed_down() looks odd on the face of it. Add a comment to explain why this call is there. phylink_speed_up() is always called in __stmmac_open(), and already has a comment. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/E1umsfV-008vKv-1O@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:12 -07:00
Russell King (Oracle)	6a9a6ce962	net: stmmac: add helpers to indicate WoL enable status Add two helpers to abstract the WoL enable status at the PHY and MAC to make the code easier to read. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/E1umsfP-008vKp-U1@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:12 -07:00
Russell King (Oracle)	d09413dd25	net: stmmac: use core wake IRQ support The PM core provides management of wake IRQs along side setting the device wake enable state. In order to use this, we need to register the interrupt used to wakeup the system using devm_pm_set_wake_irq() or dev_pm_set_wake_irq(). The core will then enable or disable IRQ wake state on this interrupt as appropriate, depending on the device_set_wakeup_enable() state. device_set_wakeup_enable() does not care about having balanced enable/disable calls. Make use of this functionality, rather than explicitly managing the IRQ enable state in the set_wol() ethtool op. This removes the IRQ wake state management from stmmac. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/E1umsfK-008vKj-Pw@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:12 -07:00
Russell King (Oracle)	f17bd297bb	net: stmmac: remove unnecessary "stmmac: wakeup enable" print Printing "stmmac: wakeup enable" to the kernel log isn't useful - it doesn't identify the adapter, and is effectively nothing more than a debugging print. This information can be discovered by looking at /sys/device.../power/wakeup as the device_set_wakeup_enable() call updates this sysfs file. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1umsfF-008vKc-Kt@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:11 -07:00
Russell King (Oracle)	b181306e5e	net: stmmac: remove redundant WoL option validation The core ethtool API validates the WoL options passed from userspace against the support which the driver reports from its get_wol() method, returning EINVAL if an unsupported mode is requested. Therefore, there is no need for stmmac to implement its own validation. Remove this unnecessary code. See ethnl_set_wol() in net/ethtool/wol.c and ethtool_set_wol() in net/ethtool/ioctl.c. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1umsfA-008vKW-H1@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:11 -07:00
Russell King (Oracle)	49b97bc52a	net: stmmac: remove write-only mac->pmt mac_device_info->pmt is only ever written, nothing reads it. Remove this struct member. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1umsf5-008vKQ-DT@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:11 -07:00
Russell King (Oracle)	6d598e856d	net: stmmac: remove unnecessary checks in ethtool eee ops Phylink will check whether the MAC supports the LPI methods in struct phylink_mac_ops, and return -EOPNOTSUPP if the LPI capabilities are not provided. stmmac doesn't provide LPI capabilities if priv->dma_cap.eee is not set. Therefore, checking the state of priv->dma_cap.eee in the ethtool ops and returning -EOPNOTSUPP is redundant - let phylink handle this. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1umsf0-008vKK-A3@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 18:10:11 -07:00
Vishal Badole	5883cb32fc	amd-xgbe: Configure and retrieve 'tx-usecs' for Tx coalescing Ethtool has advanced with additional configurable options, but the current driver does not support tx-usecs configuration using Ethtool. Add support to configure and retrieve 'tx-usecs' using ethtool, which specifies the wait time before servicing an interrupt for Tx coalescing. Signed-off-by: Vishal Badole <Vishal.Badole@amd.com> Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Link: https://patch.msgid.link/20250816141941.126054-1-Vishal.Badole@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 17:50:55 -07:00
Jakub Kicinski	8e33dc6f78	Merge branch 'net-use-vmalloc_array-to-simplify-code' Qianfeng Rong says: ==================== net: use vmalloc_array() to simplify code Remove array_size() calls and replace vmalloc() with vmalloc_array() to simplify the code and maintain consistency with existing kmalloc_array() usage. vmalloc_array() is also optimized better, resulting in less instructions being used [1]. [1]: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/ ==================== Link: https://patch.msgid.link/20250816090659.117699-1-rongqianfeng@vivo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-18 17:49:54 -07:00

1 2 3 4 5 ...

1381972 commits