From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BAF493F44CE for ; Fri, 15 May 2026 15:41:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.194 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778859673; cv=none; b=KL1O+oUoHfrdQYJx3GB0M5Vg/Xw7waE2b452KI8u5bcfFLJCrx0i6twkB6S/U4sS9KxszRAQ5kQCCgKaHOYP8OnYgSZ0f4FIWxr4V9RvtblA5zhBi2vHhmsOUxAbrzQ61vwGGTUJctE0ZJ16K/DwxySC1zKLLx4Y625g1GRBVns= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778859673; c=relaxed/simple; bh=enzv6N0QCiqfkNtjs/94FtsKlI6TpVoKaDPEWI1KWEQ=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=QHYICRQtHsX+H7KPqlEhXJ4r+7TlOwxfc8aRS+NltHU40gTkK16vTxSeyFf3FRatA/+c6FVZYrqhOD3OQ5UDosgTLV5i6Thxwau9ugASALDfdsQLcCdzXmUDulQmgZrY1Yrk5kfO3NIbIikHwiy3hcG7jr9ELbKn3hjcKoHzjJU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Z/nxhxjJ; arc=none smtp.client-ip=209.85.210.194 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Z/nxhxjJ" Received: by mail-pf1-f194.google.com with SMTP id d2e1a72fcca58-835386ff122so9006963b3a.3 for ; Fri, 15 May 2026 08:41:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778859665; x=1779464465; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=Gb6mWUFOgGMdl96ZRrsaW/Iiz+8X35ZZDW3pFeC3q5c=; b=Z/nxhxjJVb3DGWwLaMI5Jwrzkg0e38jcSoYdJn5zbPag4NuMPcJ7zYDfGk0iUX45Mb 7N//cFPF2JKXuQK+pa/rWfTV8mylRWH67hJixQ21rTLPQvW74vLYU6asfbmaDJXRTd7g +qjGRUwEF9AkBWOAebfQptTcvuC3xLAjRRgGpi4mMtLhE49AAEWgI+wSojbViYri0EIU 4AJZyN40O83AoC3wuFP1bI5jbODETy0LQjtgvPq3J1ZpxukcP/qKQWJEa3V4Hf3HPob3 7QgbdTswVVjdErlc01xZPd4gRUdslZgoL4sJRNR8g3T50Zn3YiZihe4i8YktavfQN/7e 7DLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778859665; x=1779464465; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Gb6mWUFOgGMdl96ZRrsaW/Iiz+8X35ZZDW3pFeC3q5c=; b=g/6ny7VQATmpwPKtdpe21hIYGhUsBS6z5owFVEmJ+XBzmhsyRiQKzFPD53iQ7/XF04 UWK17FO6KFNLSUoruRruAg9Ovaw5zUZIGP17PKY5+FEHuoMvU6RzLcKaQyRKxRh77M07 OmJpjZNJjkgPBKKuAwLzJwFQjPHuVIdJWGw9RHa6vjHX2LS4r29klPdtl8mCupGYbLJY bIKA/u63jzU3uGksUpblKsh9NMriGCcjxoCppDF13HiQowHum2aXuxIb0WGVnEY5Acjh yEd+psCh6nbVsJ1R+2ahikBhFZ52N1Mwy7eeLPkH1A3NtDNJNsfUHfc808aVQU7v8Amu rODw== X-Forwarded-Encrypted: i=1; AFNElJ/YIREPxc2/VGGaH4OM5pgszII+SIhWyfAnUtteq3ybu8OHKVHRVDNDCTXMoLwVWHsluQI=@vger.kernel.org X-Gm-Message-State: AOJu0YyNh1ldzvO39PHW4y0QTmf78GB8OOXFtyp/BKesVv3Hy6ZrVr7l fbx6BXGs5US6VbQTzhtJ02cTaMu542adtyNknzJiyu4ul+O14SAmVaJc X-Gm-Gg: Acq92OGsOETRPfYHTdrLb0IFgHBI+i12M1KGUQvnJ/NfIaNxki62DF2pJ2Rcm9wY6oo RwMkyef8mhtG8SQntdz5R2OD6d0z1DcCwS+1N9YdLVTKs9f3XjUVPv9IXSpGATCPwKlGerQ5GiU zFtGOO7DCjMJB1K3AZN2mHbiu2kKExPTuCKcLcF0tmsKCdhyVyEX0rKyqgs9FI+ovI8fG+K3qb/ 6Dtzc7LvBuKHhDdXfXvNOBWd5mK88q7+PIouHG1eI8CWLvCgmpT4RGpZtwAWYAXG3+8YLE498Bd o0YMyeypkZO+EtGspmo9cyqdfCGBrohnofXVWxJfbOZXS35+/i9HT+2Vwc3JPQWrsQKkGZrOSso 9NWW4/KbZXy+HTuIpV366V05eU1e3YlQMV+xmchEO3SaMG2ofEB6dK3HT74HBE5YIbQf1+SE3fZ C4ILtxPNSa3WUVhewl X-Received: by 2002:a05:6a00:12e7:b0:821:7d7e:41cd with SMTP id d2e1a72fcca58-83f33aeb06dmr5353877b3a.10.1778859664787; Fri, 15 May 2026 08:41:04 -0700 (PDT) Received: from localhost ([2a03:2880:2ff:72::]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83f196660f9sm8235975b3a.10.2026.05.15.08.41.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 May 2026 08:41:04 -0700 (PDT) Date: Fri, 15 May 2026 08:41:03 -0700 From: Stanislav Fomichev To: Jason Xing Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, bjorn@kernel.org, magnus.karlsson@intel.com, maciej.fijalkowski@intel.com, jonathan.lemon@gmail.com, sdf@fomichev.me, ast@kernel.org, daniel@iogearbox.net, hawk@kernel.org, john.fastabend@gmail.com, horms@kernel.org, andrew+netdev@lunn.ch, bpf@vger.kernel.org, netdev@vger.kernel.org, Jason Xing Subject: Re: [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb() Message-ID: References: <20260510012310.88570-1-kerneljasonxing@gmail.com> <20260510012310.88570-4-kerneljasonxing@gmail.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On 05/15, Jason Xing wrote: > On Fri, May 15, 2026 at 8:29 AM Stanislav Fomichev wrote: > > > > On 05/14, Jason Xing wrote: > > > On Thu, May 14, 2026 at 12:27 AM Stanislav Fomichev > > > wrote: > > > > > > > > On 05/10, Jason Xing wrote: > > > > > From: Jason Xing > > > > > > > > > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW, > > > > > only the current descriptor is released from the TX ring. The remaining > > > > > continuation descriptors of the same packet stay in the ring. Since > > > > > xs->skb is set to NULL after the drop, the TX loop picks up these > > > > > leftover frags and misinterprets each one as the beginning of a new > > > > > packet, corrupting the packet stream. > > > > > > > > > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs > > > > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised. > > > > > The main TX loop in __xsk_generic_xmit() then handles continuation > > > > > descriptors one at a time: each gets a normal CQ reservation (with > > > > > backpressure), its address is submitted to the completion queue, and > > > > > the descriptor is released from the TX ring. When the last fragment > > > > > (without XDP_PKT_CONTD) is processed, the flag is cleared and the > > > > > function returns -EOVERFLOW so the next call starts with a fresh > > > > > budget for normal packets. > > > > > > > > > > This reuses the existing CQ backpressure and budget mechanisms, so if > > > > > the CQ is full the function returns -EAGAIN and userspace drains the > > > > > CQ before retrying. Zero buffer leakage, zero packet stream corruption. > > > > > > > > > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/ > > > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") > > > > > Signed-off-by: Jason Xing > > > > > --- > > > > > include/net/xdp_sock.h | 1 + > > > > > net/xdp/xsk.c | 22 ++++++++++++++++++++++ > > > > > 2 files changed, 23 insertions(+) > > > > > > > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h > > > > > index 23e8861e8b25..1958d19d9925 100644 > > > > > --- a/include/net/xdp_sock.h > > > > > +++ b/include/net/xdp_sock.h > > > > > @@ -80,6 +80,7 @@ struct xdp_sock { > > > > > * call of __xsk_generic_xmit(). > > > > > */ > > > > > struct sk_buff *skb; > > > > > + bool drain_cont; > > > > > > > > > > struct list_head map_list; > > > > > /* Protects map_list */ > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > > > index 3f1e590c855d..232dd7126905 100644 > > > > > --- a/net/xdp/xsk.c > > > > > +++ b/net/xdp/xsk.c > > > > > @@ -936,6 +936,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > > > > > xs->tx->invalid_descs++; > > > > > } > > > > > xskq_cons_release(xs->tx); > > > > > + if (xp_mb_desc(desc)) > > > > > + xs->drain_cont = true; > > > > > } else { > > > > > /* Let application retry */ > > > > > xsk_cq_cancel_locked(xs->pool, 1); > > > > > @@ -982,6 +984,26 @@ static int __xsk_generic_xmit(struct sock *sk) > > > > > goto out; > > > > > } > > > > > > > > > > + if (unlikely(xs->drain_cont)) { > > > > > + unsigned long flags; > > > > > + u32 idx; > > > > > + > > > > > > > > [..] > > > > > > > > > + spin_lock_irqsave(&xs->pool->cq_prod_lock, flags); > > > > > + idx = xskq_get_prod(xs->pool->cq); > > > > > + xskq_prod_write_addr(xs->pool->cq, idx, desc.addr); > > > > > + xskq_prod_submit_n(xs->pool->cq, 1); > > > > > + spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags); > > > > > > > > Not sure I understand why you want this if you're still marking the desc > > > > invalid. > > > > > > The key point is that as long as we read the desc from txq, the desc > > > should be either put back to txq again or publish it in the cq (for > > > application to keep track of this) at this point. Or else, the > > > application will lose track of this desc, which breaks the whole > > > logic. > > > > Yes, makes sense! > > > > > > > + > > > > > + xs->tx->invalid_descs++; > > > > > + xskq_cons_release(xs->tx); > > > > > + if (!xp_mb_desc(&desc)) { > > > > > + xs->drain_cont = false; > > > > > + err = -EOVERFLOW; > > > > > + goto out; > > > > > + } > > > > > > > > I also don't understand why you want to return -EOVERFLOW again? Why not > > > > (quietly) swallow these invalid xp_mb_desc from the previous packet and move > > > > on? > > > > > > This particular desc is really one of overflow cases, right? We should > > > warn users to handle this with this error code. > > > > Right, but didn't we already return -EOVERFLOW to the user once? > > The part where you set drain_count=true will -EOVERFLOW. Then this > > remainder will also return -EOVERFLOW? > > Right, my intention is to alert users twice under this kind of > circumstance. We silently drain the remaining portion of the skb the > second time, which looks a bit strange, doesn't it? > > > > > But let's maybe step back, what's the expectation on the user > > when we return -EOVERFLOW? In theory, the user can re-submit new > > shorter packet (at the current prod index), right? And then your > > !xp_mb_desc logic will break/swallow/-EOVERFLOW it? > > People would be aware of the skb that is too big to handle when > receiving two times of overflow warning. To put it in a simpler way, > the second time in xmit path only handling the remaining is enough, no > more descs that belong to another skb should be taken care of. > > If the user is able to quickly react to this case, I think the whole > skb should be re-put into the txq again instead of adding the > remaining part. I'm not so sure if I interpret the "break" correctly > here. Let's say the user puts a packet with too many descriptors. We find that mid-way and return -EOVERFLOW. What is user supposed to do with that error? In my mind, the user should drain TX ring completely and start posting shorter packets. You seem to be trying to handle the case where the user leaves the remainder of the previously EOVERFLOW'd packet in place, why?