All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stanislav Fomichev <stfomichev@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, bjorn@kernel.org, magnus.karlsson@intel.com,
	maciej.fijalkowski@intel.com, jonathan.lemon@gmail.com,
	sdf@fomichev.me, ast@kernel.org, daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, joe@dama.to,
	willemdebruijn.kernel@gmail.com, bpf@vger.kernel.org,
	netdev@vger.kernel.org, Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH net-next] xsk: skip validating skb list in xmit path
Date: Wed, 16 Jul 2025 14:40:21 -0700	[thread overview]
Message-ID: <aHgcRdp8xKIIB7Q_@mini-arch> (raw)
In-Reply-To: <CAL+tcoB1aBsB5EtgnrPg-t4CebU2ZFFDp8L-9DjRTeOuYGQ-GQ@mail.gmail.com>

On 07/16, Jason Xing wrote:
> On Wed, Jul 16, 2025 at 4:44 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 07/15, Jason Xing wrote:
> > > On Tue, Jul 15, 2025 at 12:03 AM Stanislav Fomichev
> > > <stfomichev@gmail.com> wrote:
> > > >
> > > > On 07/13, Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > For xsk, it's not needed to validate and check the skb in
> > > > > validate_xmit_skb_list() in copy mode because xsk_build_skb() doesn't
> > > > > and doesn't need to prepare those requisites to validate. Xsk is just
> > > > > responsible for delivering raw data from userspace to the driver.
> > > >
> > > > So the __dev_direct_xmit was taken out of af_packet in commit 865b03f21162
> > > > ("dev: packet: make packet_direct_xmit a common function"). And a call
> > > > to validate_xmit_skb_list was added in 104ba78c9880 ("packet: on direct_xmit,
> > > > limit tso and csum to supported devices") to support TSO. Since we don't
> > > > support tso/vlan offloads in xsk_build_skb, removing validate_xmit_skb_list
> > > > seems fair.
> > >
> > > Right, if you don't mind, I think I will copy&paste your words in the
> > > next respin.
> > >
> > > >
> > > > Although, again, if you care about performance, why not use zerocopy
> > > > mode?
> > >
> > > I attached the performance impact because I'm working on the different
> > > modes in xsk to see how it really behaves. You can take it as a kind
> > > of investigation :)
> > >
> > > I like zc mode, but the fact is that:
> > > 1) with ixgbe driver, my machine could totally lose connection as long
> > > as the xsk tries to send packets. I'm still debugging it :(
> > > 2) some customers using virtio_net don't have a supported host, so
> > > copy mode is the only one choice.
> > >
> > > >
> > > > > Skipping numerous checks somehow contributes to the transmission
> > > > > especially in the extremely hot path.
> > > > >
> > > > > Performance-wise, I used './xdpsock -i enp2s0f0np0 -t  -S -s 64' to verify
> > > > > the guess and then measured on the machine with ixgbe driver. It stably
> > > > > goes up by 5.48%, which can be seen in the shown below:
> > > > > Before:
> > > > >  sock0@enp2s0f0np0:0 txonly xdp-skb
> > > > >                    pps            pkts           1.00
> > > > > rx                 0              0
> > > > > tx                 1,187,410      3,513,536
> > > > > After:
> > > > >  sock0@enp2s0f0np0:0 txonly xdp-skb
> > > > >                    pps            pkts           1.00
> > > > > rx                 0              0
> > > > > tx                 1,252,590      2,459,456
> > > > >
> > > > > This patch also removes total ~4% consumption which can be observed
> > > > > by perf:
> > > > > |--2.97%--validate_xmit_skb
> > > > > |          |
> > > > > |           --1.76%--netif_skb_features
> > > > > |                     |
> > > > > |                      --0.65%--skb_network_protocol
> > > > > |
> > > > > |--1.06%--validate_xmit_xfrm
> > > >
> > > > It is a bit surprising that mostly no-op validate_xmit_skb_list takes
> > > > 4% of the cycles. netif_skb_features taking ~2%? Any idea why? Is
> > > > it unoptimized kernel? Which driver is it?
> > >
> > > No idea on this one, sorry. I tested with different drivers (like
> > > i40e) and it turned out to be nearly the same result.
> >
> > I was trying to follow validate_xmit_skb_list, but too many things
> > happen in there. Although, without gso/vlan, most of these should
> > be no-op. Plus you have xfrm compiled in. So still surprising, let's see
> > if other people have any suggestions.
> >
> > > One of my machines looks like this:
> > > # lspci -vv | grep -i ether
> > > 02:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> > > 10-Gigabit X540-AT2 (rev 01)
> > > 02:00.1 Ethernet controller: Intel Corporation Ethernet Controller
> > > 10-Gigabit X540-AT2 (rev 01)
> > > # lscpu
> > > Architecture:                x86_64
> > >   CPU op-mode(s):            32-bit, 64-bit
> > >   Address sizes:             46 bits physical, 48 bits virtual
> > >   Byte Order:                Little Endian
> > > CPU(s):                      48
> > >   On-line CPU(s) list:       0-47
> > > Vendor ID:                   GenuineIntel
> > >   BIOS Vendor ID:            Intel(R) Corporation
> > >   Model name:                Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> > >     BIOS Model name:         Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> > >  CPU @ 2.3GHz
> > >     BIOS CPU family:         179
> > >     CPU family:              6
> > >     Model:                   63
> > >     Thread(s) per core:      2
> > >     Core(s) per socket:      12
> > >     Socket(s):               2
> > >     Stepping:                2
> > >     CPU(s) scaling MHz:      96%
> > >     CPU max MHz:             3100.0000
> > >     CPU min MHz:             1200.0000
> > >     BogoMIPS:                4589.31
> > >     Flags:                   fpu vme de pse tsc msr pae mce cx8 apic
> > > sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
> > > ht tm pbe syscall nx pdpe1gb rdtscp lm constant_ts
> > >                              c arch_perfmon pebs bts rep_good nopl
> > > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
> > > ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
> > >                              cid dca sse4_1 sse4_2 x2apic movbe popcnt
> > > tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault
> > > epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow fl
> > >                              expriority ept vpid ept_ad fsgsbase
> > > tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc
> > > cqm_occup_llc dtherm ida arat pln pts vnmi md_clear flush_l
> > >                              1d
> > >
> > > >
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >  include/linux/netdevice.h |  4 ++--
> > > > >  net/core/dev.c            | 10 ++++++----
> > > > >  net/xdp/xsk.c             |  2 +-
> > > > >  3 files changed, 9 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > > index a80d21a14612..2df44c22406c 100644
> > > > > --- a/include/linux/netdevice.h
> > > > > +++ b/include/linux/netdevice.h
> > > > > @@ -3351,7 +3351,7 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
> > > > >                    struct net_device *sb_dev);
> > > > >
> > > > >  int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
> > > > > -int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
> > > > > +int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id, bool validate);
> > > > >
> > > > >  static inline int dev_queue_xmit(struct sk_buff *skb)
> > > > >  {
> > > > > @@ -3368,7 +3368,7 @@ static inline int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
> > > > >  {
> > > > >       int ret;
> > > > >
> > > > > -     ret = __dev_direct_xmit(skb, queue_id);
> > > > > +     ret = __dev_direct_xmit(skb, queue_id, true);
> > > > >       if (!dev_xmit_complete(ret))
> > > > >               kfree_skb(skb);
> > > > >       return ret;
> > > >
> > > > Implementation wise, will it be better if we move a call to validate_xmit_skb_list
> > > > from __dev_direct_xmit to dev_direct_xmit (and a few other callers of
> > > > __dev_direct_xmit)? This will avoid the new flag.
> > >
> > > __dev_direct_xmit() helper was developed to serve the xsk type a few
> > > years ago. For now it has only two callers. If we expect to avoid a
> > > new parameter, we will also move the dev check[1] as below to the
> > > callers of __dev_direct_xmit(). Then move validate_xmit_skb_list to
> > > __dev_direct_xmit(). It's not that concise, I assume? I'm not sure if
> > > I miss your point.
> > >
> > > [1]
> > > if (unlikely(!netif_running(dev) ||  !netif_carrier_ok(dev)))
> > >         goto drop;
> >
> > We can keep the check in its original place. I don't think the order of
> > the checks matters? I was thinking something along these (untested)
> > lines. Avoids one conditional in each path (not that it matters)..
> 
> Sorry, in my point of view, one additional flag is really not a big
> deal even though it is on the hot path. A new flag doesn't cause any

[..]

> side effects at all, does it? On the contrary, running on top of the
> following code snippet, if the netdevice is down, dev_direct_xmit()
> will take a lot of time to run validate_xmit_skb_list() before two
> simple device if-statements.

Not sure that's worth optimizing (unless someone actively complains). We want
to send the packets fast, not drop them fast :-)

  reply	other threads:[~2025-07-16 21:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-13  2:57 [PATCH net-next] xsk: skip validating skb list in xmit path Jason Xing
2025-07-14 16:03 ` Stanislav Fomichev
2025-07-14 23:53   ` Jason Xing
2025-07-15 20:44     ` Stanislav Fomichev
2025-07-15 23:24       ` Jason Xing
2025-07-16 21:40         ` Stanislav Fomichev [this message]
2025-07-15 23:19     ` Jakub Kicinski
2025-07-15 23:39       ` Jason Xing

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aHgcRdp8xKIIB7Q_@mini-arch \
    --to=stfomichev@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=joe@dama.to \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.