linux/net
Jakub Kicinski 7aa767d0d3 net: consume xmit errors of GSO frames
udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
currently in NIPA. They fail in the same exact way, TCP GRO
test stalls occasionally and the test gets killed after 10min.

These tests use veth to simulate GRO. They attach a trivial
("return XDP_PASS;") XDP program to the veth to force TSO off
and NAPI on.

Digging into the failure mode we can see that the connection
is completely stuck after a burst of drops. The sender's snd_nxt
is at sequence number N [1], but the receiver claims to have
received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
is that senders rtx queue is not empty (let's say the block in
the rtx queue is at sequence number N - 4 * MSS [3]).

In this state, sender sends a retransmission from the rtx queue
with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
Receiver sees it and responds with an ACK all the way up to
N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
because it has no recollection of ever sending data that far out [1].
And we are stuck.

The root cause is the mess of the xmit return codes. veth returns
an error when it can't xmit a frame. We end up with a loss event
like this:

  -------------------------------------------------
  |   GSO super frame 1   |   GSO super frame 2   |
  |-----------------------------------------------|
  | seg | seg | seg | seg | seg | seg | seg | seg |
  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
  -------------------------------------------------
     x    ok    ok    <ok>|  ok    ok    ok   <x>
                          \\
			   snd_nxt

"x" means packet lost by veth, and "ok" means it went thru.
Since veth has TSO disabled in this test it sees individual segments.
Segment 1 is on the retransmit queue and will be resent.

So why did the sender not advance snd_nxt even tho it clearly did
send up to seg 8? tcp_write_xmit() interprets the return code
from the core to mean that data has not been sent at all. Since
TCP deals with GSO super frames, not individual segment the crux
of the problem is that loss of a single segment can be interpreted
as loss of all. TCP only sees the last return code for the last
segment of the GSO frame (in <> brackets in the diagram above).

Of course for the problem to occur we need a setup or a device
without a Qdisc. Otherwise Qdisc layer disconnects the protocol
layer from the device errors completely.

We have multiple ways to fix this.

 1) make veth not return an error when it lost a packet.
    While this is what I think we did in the past, the issue keeps
    reappearing and it's annoying to debug. The game of whack
    a mole is not great.

 2) fix the damn return codes
    We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
    documentation, so maybe we should make the return code from
    ndo_start_xmit() a boolean. I like that the most, but perhaps
    some ancient, not-really-networking protocol would suffer.

 3) make TCP ignore the errors
    It is not entirely clear to me what benefit TCP gets from
    interpreting the result of ip_queue_xmit()? Specifically once
    the connection is established and we're pushing data - packet
    loss is just packet loss?

 4) this fix
    Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
    We already always return OK in the TCQ_F_CAN_BYPASS case.
    In the Qdisc-less case let's be a bit more conservative and only
    mask the GSO errors. This path is taken by non-IP-"networks"
    like CAN, MCTP etc, so we could regress some ancient thing.
    This is the simplest, but also maybe the hackiest fix?

Similar fix has been proposed by Eric in the past but never committed
because original reporter was working with an OOT driver and wasn't
providing feedback (see Link).

Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com
Fixes: 1f59533f9c ("qdisc: validate frames going through the direct_xmit path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-26 11:35:00 +01:00
..
6lowpan
9p 9p/xen: protect xen_9pfs_front_free against concurrent calls 2026-01-29 23:48:33 +00:00
802 net: remove HIPPI support and RoadRunner HIPPI driver 2026-01-20 19:12:06 -08:00
8021q net: vlan: sync VLAN features with lower device 2025-10-31 17:42:35 -07:00
appletalk net: Convert proto_ops connect() callbacks to use sockaddr_unsized 2025-11-04 19:10:32 -08:00
atm net: atm: fix crash due to unvalidated vcc pointer in sigd_send() 2026-02-10 11:24:47 +01:00
ax25 net: ax25: remove plumbing for never-implemented DAMA Master support 2026-01-30 19:19:39 -08:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-10-31 06:46:03 -07:00
bluetooth Bluetooth: L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ 2026-02-23 16:08:15 -05:00
bpf bpf: add fsession support 2026-01-24 18:49:35 -08:00
bridge Including fixes from Netfilter. 2026-02-19 10:39:08 -08:00
caif caif: fix integer underflow in cffrml_receive() 2025-12-11 01:35:41 -08:00
can can: gw: use can_gw_hops instead of sk_buff::csum_start 2026-02-05 11:58:40 +01:00
ceph libceph: adapt ceph_x_challenge_blob hashing and msgr1 message signing 2026-02-09 12:29:22 +01:00
core net: consume xmit errors of GSO frames 2026-02-26 11:35:00 +01:00
dcb Revert "Documentation: net: add flow control guide and document ethtool API" 2025-10-01 09:48:21 +02:00
devlink devlink: Refactor devlink_rate_nodes_check 2026-02-02 20:05:51 -08:00
dns_resolver net/dns_resolver: use credential guards in dns_query() 2025-11-04 12:36:51 +01:00
dsa net: dsa: add tag format for MxL862xx switches 2026-02-11 11:27:57 +01:00
ethernet net: optimize eth_type_trans() vs CONFIG_STACKPROTECTOR_STRONG=y 2025-11-24 19:27:31 -08:00
ethtool Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2026-02-05 09:54:08 -08:00
handshake net/handshake: Fix null-ptr-deref in handshake_complete() 2025-12-22 12:36:40 +01:00
hsr hsr: Implement more robust duplicate discard for HSR 2026-02-10 12:02:29 +01:00
ieee802154 net: Convert proto callbacks from sockaddr to sockaddr_unsized 2025-11-04 19:10:33 -08:00
ife
ipv4 tcp: re-enable acceptance of FIN packets when RWIN is 0 2026-02-25 19:07:02 -08:00
ipv6 udplite: Fix null-ptr-deref in __udp_enqueue_schedule_skb(). 2026-02-20 16:14:10 -08:00
iucv net/iucv: clean up iucv kernel-doc warnings 2026-02-04 20:39:58 -08:00
kcm kcm: fix zero-frag skb in frag_list on partial sendmsg error 2026-02-23 17:26:55 -08:00
key pfkey: Deprecate pfkey 2025-10-30 09:03:12 +01:00
l2tp l2tp: avoid one data-race in l2tp_tunnel_del_work() 2026-01-19 09:55:41 -08:00
l3mdev
lapb
llc net: Convert proto_ops connect() callbacks to use sockaddr_unsized 2025-11-04 19:10:32 -08:00
mac80211 wifi: mac80211: fix NULL pointer dereference in mesh_rx_csa_frame() 2026-02-24 10:03:10 +01:00
mac802154
mctp net: mctp: ensure our nlmsg responses are initialised 2026-02-12 18:35:45 -08:00
mpls mpls: Drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE. 2025-11-03 17:40:54 -08:00
mptcp tcp: fix potential race in tcp_v6_syn_recv_sock() 2026-02-19 14:02:19 -08:00
ncsi
netfilter netfilter: nf_tables: fix use-after-free in nf_tables_addchain() 2026-02-17 15:04:20 +01:00
netlabel audit: add record for multiple task security contexts 2025-08-30 10:15:30 -04:00
netlink net: Convert proto_ops connect() callbacks to use sockaddr_unsized 2025-11-04 19:10:32 -08:00
netrom netrom: fix double-free in nr_route_frame() 2026-01-20 19:15:40 -08:00
nfc net: nfc: nci: Fix parameter validation for packet data 2026-02-19 09:32:51 -08:00
nsh
openvswitch net: openvswitch: fix data race in ovs_vport_get_upcall_stats 2026-01-22 12:55:22 +01:00
packet net: add vlan_get_protocol_offset_inline() helper 2026-02-05 16:33:52 +01:00
phonet net: Convert proto callbacks from sockaddr to sockaddr_unsized 2025-11-04 19:10:33 -08:00
psample
psp psp: use sk->sk_hash in psp_write_headers() 2026-02-19 14:04:23 -08:00
qrtr net: qrtr: Drop the MHI auto_queue feature for IPCR DL channels 2025-12-31 16:24:04 +05:30
rds net/rds: fix recursive lock in rds_tcp_conn_slots_available 2026-02-24 10:11:04 +01:00
rfkill net: replace use of system_wq with system_percpu_wq 2025-09-22 17:40:30 -07:00
rose net: rose: fix invalid array index in rose_kill_by_device() 2025-12-30 11:45:51 +01:00
rxrpc rxrpc: Fix data-race warning and potential load/store tearing 2026-01-21 19:59:29 -08:00
sched net/sched: act_skbedit: fix divide-by-zero in tcf_skbedit_hash() 2026-02-17 17:27:39 -08:00
sctp sctp: move SCTP_CMD_ASSOC_SHKEY right after SCTP_CMD_PEER_INIT 2026-01-17 15:10:34 -08:00
shaper tools: ynl-gen: add regeneration comment 2025-11-25 19:20:42 -08:00
smc tcp: fix potential race in tcp_v6_syn_recv_sock() 2026-02-19 14:02:19 -08:00
strparser Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-11-13 12:35:38 -08:00
sunrpc NFS Client Updates for Linux 7.0 2026-02-12 17:49:33 -08:00
switchdev
tipc tipc: fix duplicate publication key in tipc_service_insert_publ() 2026-02-23 17:40:52 -08:00
tls tls: Fix race condition in tls_sw_cancel_work_tx() 2026-02-23 17:08:14 -08:00
unix af_unix: Fix memleak of newsk in unix_stream_connect(). 2026-02-11 13:01:13 +01:00
vmw_vsock vsock: lock down child_ns_mode as write-once 2026-02-26 11:10:03 +01:00
wireless wifi: radiotap: reject radiotap with unknown bits 2026-02-23 09:23:44 +01:00
x25 net: Convert proto_ops connect() callbacks to use sockaddr_unsized 2025-11-04 19:10:32 -08:00
xdp Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'" 2026-01-20 18:06:01 -08:00
xfrm ipsec-2026-02-20 2026-02-20 15:57:55 -08:00
compat.c socket: Unify getsockname and getpeername implementation 2025-11-26 13:45:23 -07:00
devres.c
Kconfig net: Kconfig: discourage drop_monitor enablement 2025-10-17 16:29:26 -07:00
Kconfig.debug
Makefile psp: base PSP device support 2025-09-18 12:32:06 +02:00
socket.c net: Drop the lock in skb_may_tx_timestamp() 2026-02-24 11:27:29 +01:00
sysctl_net.c