[BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
@ 2025-06-04 15:33 Marcus Wichelmann
  2025-06-05 16:15 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 8+ messages in thread
From: Marcus Wichelmann @ 2025-06-04 15:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, linux-kernel

Hi,

while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
noticed that the veth-pair looses lots of packets when multiple TCP streams go
through it, resulting in stalling TCP connections and noticeable instabilities.

This doesn't seem to be an issue with just XDP but rather occurs whenever the
NAPI mode of the veth driver is active.
I managed to reproduce the same behavior just by bringing the veth-pair into
NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
XDP")) and running multiple TCP streams through it using a network namespace.

Here is how I reproduced it:

  ip netns add lb
  ip link add dev to-lb type veth peer name in-lb netns lb

  # Enable NAPI
  ethtool -K to-lb gro on
  ethtool -K to-lb tso off
  ip netns exec lb ethtool -K in-lb gro on
  ip netns exec lb ethtool -K in-lb tso off

  ip link set dev to-lb up
  ip -netns lb link set dev in-lb up

Then run a HTTP server inside the "lb" namespace that serves a large file:

  fallocate -l 10G testfiles/10GB.bin
  caddy file-server --root testfiles/

Download this file from within the root namespace multiple times in parallel:

  curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null

In my tests, I ran four parallel curls at the same time and after just a few
seconds, three of them stalled while the other one "won" over the full bandwidth
and completed the download.

This is probably a result of the veth's ptr_ring running full, causing many
packet drops on TX, and the TCP congestion control reacting to that.

In this context, I also took notice of Jesper's patch which describes a very
similar issue and should help to resolve this:
  commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
  reduce TX drops")

But when repeating the above test with latest mainline, which includes this
patch, and enabling qdisc via
  tc qdisc add dev in-lb root sfq perturb 10
the Kernel crashed just after starting the second TCP stream (see output below).

So I have two questions:
- Is my understanding of the described issue correct and is Jesper's patch
  sufficient to solve this?
- Is my qdisc configuration to make use of this patch correct and the kernel
  crash is likely a bug?

------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
index 65535 is out of range for type 'sfq_head [128]'
CPU: 1 UID: 0 PID: 24 Comm: ksoftirqd/1 Not tainted 6.15.0+ #1 PREEMPT(voluntary) 
Hardware name: GIGABYTE MP32-AR1-SW-HZ-001/MP32-AR1-00, BIOS F31n (SCP: 2.10.20220810) 09/30/2022
Call trace:
 show_stack+0x24/0x50 (C)
 dump_stack_lvl+0x80/0x140
 dump_stack+0x1c/0x38
 __ubsan_handle_out_of_bounds+0xd0/0x128
 sfq_dequeue+0x37c/0x3e0 [sch_sfq]
 __qdisc_run+0x90/0x760
 net_tx_action+0x1b8/0x3b0
 handle_softirqs+0x13c/0x418
 run_ksoftirqd+0x9c/0xe8
 smpboot_thread_fn+0x1c0/0x2e0
 kthread+0x150/0x230
 ret_from_fork+0x10/0x20
---[ end trace ]---
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:208:8
index 65535 is out of range for type 'sfq_head [128]'
CPU: 1 UID: 0 PID: 24 Comm: ksoftirqd/1 Not tainted 6.15.0+ #1 PREEMPT(voluntary) 
Hardware name: GIGABYTE MP32-AR1-SW-HZ-001/MP32-AR1-00, BIOS F31n (SCP: 2.10.20220810) 09/30/2022
Call trace:
 show_stack+0x24/0x50 (C)
 dump_stack_lvl+0x80/0x140
 dump_stack+0x1c/0x38
 __ubsan_handle_out_of_bounds+0xd0/0x128
 sfq_dequeue+0x394/0x3e0 [sch_sfq]
 __qdisc_run+0x90/0x760
 net_tx_action+0x1b8/0x3b0
 handle_softirqs+0x13c/0x418
 run_ksoftirqd+0x9c/0xe8
 smpboot_thread_fn+0x1c0/0x2e0
 kthread+0x150/0x230
 ret_from_fork+0x10/0x20
---[ end trace ]---
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000005
Mem abort info:
  ESR = 0x0000000096000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=000008002ad67000
[0000000000000005] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 0000000096000004 [#1]  SMP

CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz

# tc qdisc
qdisc sfq 8001: dev in-lb root refcnt 81 limit 127p quantum 1514b depth 127 divisor 1024 perturb 10sec 

Thanks,
Marcus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
  2025-06-04 15:33 [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc Marcus Wichelmann
@ 2025-06-05 16:15 ` Toke Høiland-Jørgensen
  2025-06-05 16:46   ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-06-05 16:15 UTC (permalink / raw)
  To: Marcus Wichelmann, Jesper Dangaard Brouer, bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, linux-kernel

Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> writes:

> Hi,
>
> while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
> noticed that the veth-pair looses lots of packets when multiple TCP streams go
> through it, resulting in stalling TCP connections and noticeable instabilities.
>
> This doesn't seem to be an issue with just XDP but rather occurs whenever the
> NAPI mode of the veth driver is active.
> I managed to reproduce the same behavior just by bringing the veth-pair into
> NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
> XDP")) and running multiple TCP streams through it using a network namespace.
>
> Here is how I reproduced it:
>
>   ip netns add lb
>   ip link add dev to-lb type veth peer name in-lb netns lb
>
>   # Enable NAPI
>   ethtool -K to-lb gro on
>   ethtool -K to-lb tso off
>   ip netns exec lb ethtool -K in-lb gro on
>   ip netns exec lb ethtool -K in-lb tso off
>
>   ip link set dev to-lb up
>   ip -netns lb link set dev in-lb up
>
> Then run a HTTP server inside the "lb" namespace that serves a large file:
>
>   fallocate -l 10G testfiles/10GB.bin
>   caddy file-server --root testfiles/
>
> Download this file from within the root namespace multiple times in parallel:
>
>   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
>
> In my tests, I ran four parallel curls at the same time and after just a few
> seconds, three of them stalled while the other one "won" over the full bandwidth
> and completed the download.
>
> This is probably a result of the veth's ptr_ring running full, causing many
> packet drops on TX, and the TCP congestion control reacting to that.
>
> In this context, I also took notice of Jesper's patch which describes a very
> similar issue and should help to resolve this:
>   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>   reduce TX drops")
>
> But when repeating the above test with latest mainline, which includes this
> patch, and enabling qdisc via
>   tc qdisc add dev in-lb root sfq perturb 10
> the Kernel crashed just after starting the second TCP stream (see output below).
>
> So I have two questions:
> - Is my understanding of the described issue correct and is Jesper's patch
>   sufficient to solve this?

Hmm, yeah, this does sound likely.

> - Is my qdisc configuration to make use of this patch correct and the kernel
>   crash is likely a bug?
>
> ------------[ cut here ]------------
> UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
> index 65535 is out of range for type 'sfq_head [128]'

This (the 'index 65535') kinda screams "integer underflow". So certainly
looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
patch would trigger this; maybe Eric has an idea?

Does this happen with other qdiscs as well, or is it specific to sfq?

-Toke


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
  2025-06-05 16:15 ` Toke Høiland-Jørgensen
@ 2025-06-05 16:46   ` Eric Dumazet
  2025-06-05 22:11     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-06-05 16:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Marcus Wichelmann, Jesper Dangaard Brouer, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrew Lunn,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, linux-kernel

On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> writes:
>
> > Hi,
> >
> > while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
> > noticed that the veth-pair looses lots of packets when multiple TCP streams go
> > through it, resulting in stalling TCP connections and noticeable instabilities.
> >
> > This doesn't seem to be an issue with just XDP but rather occurs whenever the
> > NAPI mode of the veth driver is active.
> > I managed to reproduce the same behavior just by bringing the veth-pair into
> > NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
> > XDP")) and running multiple TCP streams through it using a network namespace.
> >
> > Here is how I reproduced it:
> >
> >   ip netns add lb
> >   ip link add dev to-lb type veth peer name in-lb netns lb
> >
> >   # Enable NAPI
> >   ethtool -K to-lb gro on
> >   ethtool -K to-lb tso off
> >   ip netns exec lb ethtool -K in-lb gro on
> >   ip netns exec lb ethtool -K in-lb tso off
> >
> >   ip link set dev to-lb up
> >   ip -netns lb link set dev in-lb up
> >
> > Then run a HTTP server inside the "lb" namespace that serves a large file:
> >
> >   fallocate -l 10G testfiles/10GB.bin
> >   caddy file-server --root testfiles/
> >
> > Download this file from within the root namespace multiple times in parallel:
> >
> >   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
> >
> > In my tests, I ran four parallel curls at the same time and after just a few
> > seconds, three of them stalled while the other one "won" over the full bandwidth
> > and completed the download.
> >
> > This is probably a result of the veth's ptr_ring running full, causing many
> > packet drops on TX, and the TCP congestion control reacting to that.
> >
> > In this context, I also took notice of Jesper's patch which describes a very
> > similar issue and should help to resolve this:
> >   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> >   reduce TX drops")
> >
> > But when repeating the above test with latest mainline, which includes this
> > patch, and enabling qdisc via
> >   tc qdisc add dev in-lb root sfq perturb 10
> > the Kernel crashed just after starting the second TCP stream (see output below).
> >
> > So I have two questions:
> > - Is my understanding of the described issue correct and is Jesper's patch
> >   sufficient to solve this?
>
> Hmm, yeah, this does sound likely.
>
> > - Is my qdisc configuration to make use of this patch correct and the kernel
> >   crash is likely a bug?
> >
> > ------------[ cut here ]------------
> > UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
> > index 65535 is out of range for type 'sfq_head [128]'
>
> This (the 'index 65535') kinda screams "integer underflow". So certainly
> looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
> patch would trigger this; maybe Eric has an idea?
>
> Does this happen with other qdiscs as well, or is it specific to sfq?

This seems like a bug in sfq, we already had recent fixes in it, and
other fixes in net/sched vs qdisc_tree_reduce_backlog()

It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
  2025-06-05 16:46   ` Eric Dumazet
@ 2025-06-05 22:11     ` Eric Dumazet
  2025-06-05 22:17       ` Marcus Wichelmann
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-06-05 22:11 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Marcus Wichelmann, Jesper Dangaard Brouer, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrew Lunn,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, linux-kernel

On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> writes:
> >
> > > Hi,
> > >
> > > while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
> > > noticed that the veth-pair looses lots of packets when multiple TCP streams go
> > > through it, resulting in stalling TCP connections and noticeable instabilities.
> > >
> > > This doesn't seem to be an issue with just XDP but rather occurs whenever the
> > > NAPI mode of the veth driver is active.
> > > I managed to reproduce the same behavior just by bringing the veth-pair into
> > > NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
> > > XDP")) and running multiple TCP streams through it using a network namespace.
> > >
> > > Here is how I reproduced it:
> > >
> > >   ip netns add lb
> > >   ip link add dev to-lb type veth peer name in-lb netns lb
> > >
> > >   # Enable NAPI
> > >   ethtool -K to-lb gro on
> > >   ethtool -K to-lb tso off
> > >   ip netns exec lb ethtool -K in-lb gro on
> > >   ip netns exec lb ethtool -K in-lb tso off
> > >
> > >   ip link set dev to-lb up
> > >   ip -netns lb link set dev in-lb up
> > >
> > > Then run a HTTP server inside the "lb" namespace that serves a large file:
> > >
> > >   fallocate -l 10G testfiles/10GB.bin
> > >   caddy file-server --root testfiles/
> > >
> > > Download this file from within the root namespace multiple times in parallel:
> > >
> > >   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
> > >
> > > In my tests, I ran four parallel curls at the same time and after just a few
> > > seconds, three of them stalled while the other one "won" over the full bandwidth
> > > and completed the download.
> > >
> > > This is probably a result of the veth's ptr_ring running full, causing many
> > > packet drops on TX, and the TCP congestion control reacting to that.
> > >
> > > In this context, I also took notice of Jesper's patch which describes a very
> > > similar issue and should help to resolve this:
> > >   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> > >   reduce TX drops")
> > >
> > > But when repeating the above test with latest mainline, which includes this
> > > patch, and enabling qdisc via
> > >   tc qdisc add dev in-lb root sfq perturb 10
> > > the Kernel crashed just after starting the second TCP stream (see output below).
> > >
> > > So I have two questions:
> > > - Is my understanding of the described issue correct and is Jesper's patch
> > >   sufficient to solve this?
> >
> > Hmm, yeah, this does sound likely.
> >
> > > - Is my qdisc configuration to make use of this patch correct and the kernel
> > >   crash is likely a bug?
> > >
> > > ------------[ cut here ]------------
> > > UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
> > > index 65535 is out of range for type 'sfq_head [128]'
> >
> > This (the 'index 65535') kinda screams "integer underflow". So certainly
> > looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
> > patch would trigger this; maybe Eric has an idea?
> >
> > Does this happen with other qdiscs as well, or is it specific to sfq?
>
> This seems like a bug in sfq, we already had recent fixes in it, and
> other fixes in net/sched vs qdisc_tree_reduce_backlog()
>
> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)

This seems to be a very old bug, indeed caused by sch->gso_skb
contribution to sch->q.qlen

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
struct sk_buff **to_free)
                /* It is difficult to believe, but ALL THE SLOTS HAVE
LENGTH 1. */
                x = q->tail->next;
                slot = &q->slots[x];
-               q->tail->next = slot->next;
+               if (slot->next == x)
+                       q->tail = NULL; /* no more active slots */
+               else
+                       q->tail->next = slot->next;
                q->ht[slot->hash] = SFQ_EMPTY_SLOT;
                goto drop;
        }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
  2025-06-05 22:11     ` Eric Dumazet
@ 2025-06-05 22:17       ` Marcus Wichelmann
  2025-06-06  9:06         ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Marcus Wichelmann @ 2025-06-05 22:17 UTC (permalink / raw)
  To: Eric Dumazet, Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrew Lunn, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, linux-kernel

Am 06.06.25 um 00:11 schrieb Eric Dumazet:
> On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> writes:
>>>
>>>> Hi,
>>>>
>>>> while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
>>>> noticed that the veth-pair looses lots of packets when multiple TCP streams go
>>>> through it, resulting in stalling TCP connections and noticeable instabilities.
>>>>
>>>> This doesn't seem to be an issue with just XDP but rather occurs whenever the
>>>> NAPI mode of the veth driver is active.
>>>> I managed to reproduce the same behavior just by bringing the veth-pair into
>>>> NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
>>>> XDP")) and running multiple TCP streams through it using a network namespace.
>>>>
>>>> Here is how I reproduced it:
>>>>
>>>>   ip netns add lb
>>>>   ip link add dev to-lb type veth peer name in-lb netns lb
>>>>
>>>>   # Enable NAPI
>>>>   ethtool -K to-lb gro on
>>>>   ethtool -K to-lb tso off
>>>>   ip netns exec lb ethtool -K in-lb gro on
>>>>   ip netns exec lb ethtool -K in-lb tso off
>>>>
>>>>   ip link set dev to-lb up
>>>>   ip -netns lb link set dev in-lb up
>>>>
>>>> Then run a HTTP server inside the "lb" namespace that serves a large file:
>>>>
>>>>   fallocate -l 10G testfiles/10GB.bin
>>>>   caddy file-server --root testfiles/
>>>>
>>>> Download this file from within the root namespace multiple times in parallel:
>>>>
>>>>   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
>>>>
>>>> In my tests, I ran four parallel curls at the same time and after just a few
>>>> seconds, three of them stalled while the other one "won" over the full bandwidth
>>>> and completed the download.
>>>>
>>>> This is probably a result of the veth's ptr_ring running full, causing many
>>>> packet drops on TX, and the TCP congestion control reacting to that.
>>>>
>>>> In this context, I also took notice of Jesper's patch which describes a very
>>>> similar issue and should help to resolve this:
>>>>   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>>>>   reduce TX drops")
>>>>
>>>> But when repeating the above test with latest mainline, which includes this
>>>> patch, and enabling qdisc via
>>>>   tc qdisc add dev in-lb root sfq perturb 10
>>>> the Kernel crashed just after starting the second TCP stream (see output below).
>>>>
>>>> So I have two questions:
>>>> - Is my understanding of the described issue correct and is Jesper's patch
>>>>   sufficient to solve this?
>>>
>>> Hmm, yeah, this does sound likely.
>>>
>>>> - Is my qdisc configuration to make use of this patch correct and the kernel
>>>>   crash is likely a bug?
>>>>
>>>> ------------[ cut here ]------------
>>>> UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
>>>> index 65535 is out of range for type 'sfq_head [128]'
>>>
>>> This (the 'index 65535') kinda screams "integer underflow". So certainly
>>> looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
>>> patch would trigger this; maybe Eric has an idea?
>>>
>>> Does this happen with other qdiscs as well, or is it specific to sfq?
>>
>> This seems like a bug in sfq, we already had recent fixes in it, and
>> other fixes in net/sched vs qdisc_tree_reduce_backlog()
>>
>> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)
> 
> This seems to be a very old bug, indeed caused by sch->gso_skb
> contribution to sch->q.qlen
> 
> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
> 100644
> --- a/net/sched/sch_sfq.c
> +++ b/net/sched/sch_sfq.c
> @@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
> struct sk_buff **to_free)
>                 /* It is difficult to believe, but ALL THE SLOTS HAVE
> LENGTH 1. */
>                 x = q->tail->next;
>                 slot = &q->slots[x];
> -               q->tail->next = slot->next;
> +               if (slot->next == x)
> +                       q->tail = NULL; /* no more active slots */
> +               else
> +                       q->tail->next = slot->next;
>                 q->ht[slot->hash] = SFQ_EMPTY_SLOT;
>                 goto drop;
>         }
> 

Hi,

thank you for looking into it.
I'll give your patch a try and will also do tests with other qdiscs as well when I'm back
in office.

Marcus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
  2025-06-05 22:17       ` Marcus Wichelmann
@ 2025-06-06  9:06         ` Eric Dumazet
  2025-06-10 14:40           ` Marcus Wichelmann
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-06-06  9:06 UTC (permalink / raw)
  To: Marcus Wichelmann
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, bpf,
	netdev, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, linux-kernel

On Thu, Jun 5, 2025 at 3:17 PM Marcus Wichelmann
<marcus.wichelmann@hetzner-cloud.de> wrote:
>
> Am 06.06.25 um 00:11 schrieb Eric Dumazet:
> > On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>>
> >>> Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> writes:
> >>>
> >>>> Hi,
> >>>>
> >>>> while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
> >>>> noticed that the veth-pair looses lots of packets when multiple TCP streams go
> >>>> through it, resulting in stalling TCP connections and noticeable instabilities.
> >>>>
> >>>> This doesn't seem to be an issue with just XDP but rather occurs whenever the
> >>>> NAPI mode of the veth driver is active.
> >>>> I managed to reproduce the same behavior just by bringing the veth-pair into
> >>>> NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
> >>>> XDP")) and running multiple TCP streams through it using a network namespace.
> >>>>
> >>>> Here is how I reproduced it:
> >>>>
> >>>>   ip netns add lb
> >>>>   ip link add dev to-lb type veth peer name in-lb netns lb
> >>>>
> >>>>   # Enable NAPI
> >>>>   ethtool -K to-lb gro on
> >>>>   ethtool -K to-lb tso off
> >>>>   ip netns exec lb ethtool -K in-lb gro on
> >>>>   ip netns exec lb ethtool -K in-lb tso off
> >>>>
> >>>>   ip link set dev to-lb up
> >>>>   ip -netns lb link set dev in-lb up
> >>>>
> >>>> Then run a HTTP server inside the "lb" namespace that serves a large file:
> >>>>
> >>>>   fallocate -l 10G testfiles/10GB.bin
> >>>>   caddy file-server --root testfiles/
> >>>>
> >>>> Download this file from within the root namespace multiple times in parallel:
> >>>>
> >>>>   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
> >>>>
> >>>> In my tests, I ran four parallel curls at the same time and after just a few
> >>>> seconds, three of them stalled while the other one "won" over the full bandwidth
> >>>> and completed the download.
> >>>>
> >>>> This is probably a result of the veth's ptr_ring running full, causing many
> >>>> packet drops on TX, and the TCP congestion control reacting to that.
> >>>>
> >>>> In this context, I also took notice of Jesper's patch which describes a very
> >>>> similar issue and should help to resolve this:
> >>>>   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> >>>>   reduce TX drops")
> >>>>
> >>>> But when repeating the above test with latest mainline, which includes this
> >>>> patch, and enabling qdisc via
> >>>>   tc qdisc add dev in-lb root sfq perturb 10
> >>>> the Kernel crashed just after starting the second TCP stream (see output below).
> >>>>
> >>>> So I have two questions:
> >>>> - Is my understanding of the described issue correct and is Jesper's patch
> >>>>   sufficient to solve this?
> >>>
> >>> Hmm, yeah, this does sound likely.
> >>>
> >>>> - Is my qdisc configuration to make use of this patch correct and the kernel
> >>>>   crash is likely a bug?
> >>>>
> >>>> ------------[ cut here ]------------
> >>>> UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
> >>>> index 65535 is out of range for type 'sfq_head [128]'
> >>>
> >>> This (the 'index 65535') kinda screams "integer underflow". So certainly
> >>> looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
> >>> patch would trigger this; maybe Eric has an idea?
> >>>
> >>> Does this happen with other qdiscs as well, or is it specific to sfq?
> >>
> >> This seems like a bug in sfq, we already had recent fixes in it, and
> >> other fixes in net/sched vs qdisc_tree_reduce_backlog()
> >>
> >> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)
> >
> > This seems to be a very old bug, indeed caused by sch->gso_skb
> > contribution to sch->q.qlen
> >
> > diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> > index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
> > 100644
> > --- a/net/sched/sch_sfq.c
> > +++ b/net/sched/sch_sfq.c
> > @@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
> > struct sk_buff **to_free)
> >                 /* It is difficult to believe, but ALL THE SLOTS HAVE
> > LENGTH 1. */
> >                 x = q->tail->next;
> >                 slot = &q->slots[x];
> > -               q->tail->next = slot->next;
> > +               if (slot->next == x)
> > +                       q->tail = NULL; /* no more active slots */
> > +               else
> > +                       q->tail->next = slot->next;
> >                 q->ht[slot->hash] = SFQ_EMPTY_SLOT;
> >                 goto drop;
> >         }
> >
>
> Hi,
>
> thank you for looking into it.
> I'll give your patch a try and will also do tests with other qdiscs as well when I'm back
> in office.
>

I have been using this repro :

ip netns add lb

ip link add dev to-lb type veth peer name in-lb netns lb

# force qdisc to requeue gso_skb
ethtool -K to-lb tso off

# Enable NAPI
ip netns exec lb ethtool -K in-lb gro on

ip link set dev to-lb up
ip -netns lb link set dev in-lb up

ip addr add dev to-lb 192.168.20.1/24
ip -netns lb addr add dev in-lb 192.168.20.2/24

tc qdisc replace dev to-lb root sfq limit 100

ip netns exec lb netserver

netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
  2025-06-06  9:06         ` Eric Dumazet
@ 2025-06-10 14:40           ` Marcus Wichelmann
  2025-06-10 14:49             ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Marcus Wichelmann @ 2025-06-10 14:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, bpf,
	netdev, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, linux-kernel

Am 06.06.25 um 11:06 schrieb Eric Dumazet:
> On Thu, Jun 5, 2025 at 3:17 PM Marcus Wichelmann
> <marcus.wichelmann@hetzner-cloud.de> wrote:
>>
>> Am 06.06.25 um 00:11 schrieb Eric Dumazet:
>>> On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@google.com> wrote:
>>>>
>>>> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>>
>>>>> Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> writes:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
>>>>>> noticed that the veth-pair looses lots of packets when multiple TCP streams go
>>>>>> through it, resulting in stalling TCP connections and noticeable instabilities.
>>>>>>
>>>>>> This doesn't seem to be an issue with just XDP but rather occurs whenever the
>>>>>> NAPI mode of the veth driver is active.
>>>>>> I managed to reproduce the same behavior just by bringing the veth-pair into
>>>>>> NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
>>>>>> XDP")) and running multiple TCP streams through it using a network namespace.
>>>>>>
>>>>>> Here is how I reproduced it:
>>>>>>
>>>>>>   ip netns add lb
>>>>>>   ip link add dev to-lb type veth peer name in-lb netns lb
>>>>>>
>>>>>>   # Enable NAPI
>>>>>>   ethtool -K to-lb gro on
>>>>>>   ethtool -K to-lb tso off
>>>>>>   ip netns exec lb ethtool -K in-lb gro on
>>>>>>   ip netns exec lb ethtool -K in-lb tso off
>>>>>>
>>>>>>   ip link set dev to-lb up
>>>>>>   ip -netns lb link set dev in-lb up
>>>>>>
>>>>>> Then run a HTTP server inside the "lb" namespace that serves a large file:
>>>>>>
>>>>>>   fallocate -l 10G testfiles/10GB.bin
>>>>>>   caddy file-server --root testfiles/
>>>>>>
>>>>>> Download this file from within the root namespace multiple times in parallel:
>>>>>>
>>>>>>   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
>>>>>>
>>>>>> In my tests, I ran four parallel curls at the same time and after just a few
>>>>>> seconds, three of them stalled while the other one "won" over the full bandwidth
>>>>>> and completed the download.
>>>>>>
>>>>>> This is probably a result of the veth's ptr_ring running full, causing many
>>>>>> packet drops on TX, and the TCP congestion control reacting to that.
>>>>>>
>>>>>> In this context, I also took notice of Jesper's patch which describes a very
>>>>>> similar issue and should help to resolve this:
>>>>>>   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>>>>>>   reduce TX drops")
>>>>>>
>>>>>> But when repeating the above test with latest mainline, which includes this
>>>>>> patch, and enabling qdisc via
>>>>>>   tc qdisc add dev in-lb root sfq perturb 10
>>>>>> the Kernel crashed just after starting the second TCP stream (see output below).
>>>>>>
>>>>>> So I have two questions:
>>>>>> - Is my understanding of the described issue correct and is Jesper's patch
>>>>>>   sufficient to solve this?
>>>>>
>>>>> Hmm, yeah, this does sound likely.
>>>>>
>>>>>> - Is my qdisc configuration to make use of this patch correct and the kernel
>>>>>>   crash is likely a bug?
>>>>>>
>>>>>> ------------[ cut here ]------------
>>>>>> UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
>>>>>> index 65535 is out of range for type 'sfq_head [128]'
>>>>>
>>>>> This (the 'index 65535') kinda screams "integer underflow". So certainly
>>>>> looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
>>>>> patch would trigger this; maybe Eric has an idea?
>>>>>
>>>>> Does this happen with other qdiscs as well, or is it specific to sfq?
>>>>
>>>> This seems like a bug in sfq, we already had recent fixes in it, and
>>>> other fixes in net/sched vs qdisc_tree_reduce_backlog()
>>>>
>>>> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)
>>>
>>> This seems to be a very old bug, indeed caused by sch->gso_skb
>>> contribution to sch->q.qlen
>>>
>>> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
>>> index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
>>> 100644
>>> --- a/net/sched/sch_sfq.c
>>> +++ b/net/sched/sch_sfq.c
>>> @@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
>>> struct sk_buff **to_free)
>>>                 /* It is difficult to believe, but ALL THE SLOTS HAVE
>>> LENGTH 1. */
>>>                 x = q->tail->next;
>>>                 slot = &q->slots[x];
>>> -               q->tail->next = slot->next;
>>> +               if (slot->next == x)
>>> +                       q->tail = NULL; /* no more active slots */
>>> +               else
>>> +                       q->tail->next = slot->next;
>>>                 q->ht[slot->hash] = SFQ_EMPTY_SLOT;
>>>                 goto drop;
>>>         }
>>>
>>
>> Hi,
>>
>> thank you for looking into it.
>> I'll give your patch a try and will also do tests with other qdiscs as well when I'm back
>> in office.
>>
> 
> I have been using this repro :
> 
> [...]

Hi,

I can confirm that the sfq qdisc is now stable in this setup, thanks to your fix.

I also experimented with other qdiscs and fq_codel works as well.

The sfq/fq_codel qdisc works hand-in-hand now with Jesper's patch to resolve the original
issue. Multiple TCP connections run very stable, even when NAPI/XDP is active on the veth
device, and I can see that the packets are being requeued instead of being dropped in the
veth driver.

Thank you for your help!

Marcus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc
  2025-06-10 14:40           ` Marcus Wichelmann
@ 2025-06-10 14:49             ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2025-06-10 14:49 UTC (permalink / raw)
  To: Marcus Wichelmann
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, bpf,
	netdev, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, linux-kernel

On Tue, Jun 10, 2025 at 7:41 AM Marcus Wichelmann
<marcus.wichelmann@hetzner-cloud.de> wrote:
>
> Am 06.06.25 um 11:06 schrieb Eric Dumazet:
> > On Thu, Jun 5, 2025 at 3:17 PM Marcus Wichelmann
> > <marcus.wichelmann@hetzner-cloud.de> wrote:
> >>
> >> Am 06.06.25 um 00:11 schrieb Eric Dumazet:
> >>> On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@google.com> wrote:
> >>>>
> >>>> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>>>>
> >>>>> Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> writes:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
> >>>>>> noticed that the veth-pair looses lots of packets when multiple TCP streams go
> >>>>>> through it, resulting in stalling TCP connections and noticeable instabilities.
> >>>>>>
> >>>>>> This doesn't seem to be an issue with just XDP but rather occurs whenever the
> >>>>>> NAPI mode of the veth driver is active.
> >>>>>> I managed to reproduce the same behavior just by bringing the veth-pair into
> >>>>>> NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
> >>>>>> XDP")) and running multiple TCP streams through it using a network namespace.
> >>>>>>
> >>>>>> Here is how I reproduced it:
> >>>>>>
> >>>>>>   ip netns add lb
> >>>>>>   ip link add dev to-lb type veth peer name in-lb netns lb
> >>>>>>
> >>>>>>   # Enable NAPI
> >>>>>>   ethtool -K to-lb gro on
> >>>>>>   ethtool -K to-lb tso off
> >>>>>>   ip netns exec lb ethtool -K in-lb gro on
> >>>>>>   ip netns exec lb ethtool -K in-lb tso off
> >>>>>>
> >>>>>>   ip link set dev to-lb up
> >>>>>>   ip -netns lb link set dev in-lb up
> >>>>>>
> >>>>>> Then run a HTTP server inside the "lb" namespace that serves a large file:
> >>>>>>
> >>>>>>   fallocate -l 10G testfiles/10GB.bin
> >>>>>>   caddy file-server --root testfiles/
> >>>>>>
> >>>>>> Download this file from within the root namespace multiple times in parallel:
> >>>>>>
> >>>>>>   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
> >>>>>>
> >>>>>> In my tests, I ran four parallel curls at the same time and after just a few
> >>>>>> seconds, three of them stalled while the other one "won" over the full bandwidth
> >>>>>> and completed the download.
> >>>>>>
> >>>>>> This is probably a result of the veth's ptr_ring running full, causing many
> >>>>>> packet drops on TX, and the TCP congestion control reacting to that.
> >>>>>>
> >>>>>> In this context, I also took notice of Jesper's patch which describes a very
> >>>>>> similar issue and should help to resolve this:
> >>>>>>   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> >>>>>>   reduce TX drops")
> >>>>>>
> >>>>>> But when repeating the above test with latest mainline, which includes this
> >>>>>> patch, and enabling qdisc via
> >>>>>>   tc qdisc add dev in-lb root sfq perturb 10
> >>>>>> the Kernel crashed just after starting the second TCP stream (see output below).
> >>>>>>
> >>>>>> So I have two questions:
> >>>>>> - Is my understanding of the described issue correct and is Jesper's patch
> >>>>>>   sufficient to solve this?
> >>>>>
> >>>>> Hmm, yeah, this does sound likely.
> >>>>>
> >>>>>> - Is my qdisc configuration to make use of this patch correct and the kernel
> >>>>>>   crash is likely a bug?
> >>>>>>
> >>>>>> ------------[ cut here ]------------
> >>>>>> UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
> >>>>>> index 65535 is out of range for type 'sfq_head [128]'
> >>>>>
> >>>>> This (the 'index 65535') kinda screams "integer underflow". So certainly
> >>>>> looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
> >>>>> patch would trigger this; maybe Eric has an idea?
> >>>>>
> >>>>> Does this happen with other qdiscs as well, or is it specific to sfq?
> >>>>
> >>>> This seems like a bug in sfq, we already had recent fixes in it, and
> >>>> other fixes in net/sched vs qdisc_tree_reduce_backlog()
> >>>>
> >>>> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)
> >>>
> >>> This seems to be a very old bug, indeed caused by sch->gso_skb
> >>> contribution to sch->q.qlen
> >>>
> >>> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> >>> index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
> >>> 100644
> >>> --- a/net/sched/sch_sfq.c
> >>> +++ b/net/sched/sch_sfq.c
> >>> @@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
> >>> struct sk_buff **to_free)
> >>>                 /* It is difficult to believe, but ALL THE SLOTS HAVE
> >>> LENGTH 1. */
> >>>                 x = q->tail->next;
> >>>                 slot = &q->slots[x];
> >>> -               q->tail->next = slot->next;
> >>> +               if (slot->next == x)
> >>> +                       q->tail = NULL; /* no more active slots */
> >>> +               else
> >>> +                       q->tail->next = slot->next;
> >>>                 q->ht[slot->hash] = SFQ_EMPTY_SLOT;
> >>>                 goto drop;
> >>>         }
> >>>
> >>
> >> Hi,
> >>
> >> thank you for looking into it.
> >> I'll give your patch a try and will also do tests with other qdiscs as well when I'm back
> >> in office.
> >>
> >
> > I have been using this repro :
> >
> > [...]
>
> Hi,
>
> I can confirm that the sfq qdisc is now stable in this setup, thanks to your fix.
>
> I also experimented with other qdiscs and fq_codel works as well.
>
> The sfq/fq_codel qdisc works hand-in-hand now with Jesper's patch to resolve the original
> issue. Multiple TCP connections run very stable, even when NAPI/XDP is active on the veth
> device, and I can see that the packets are being requeued instead of being dropped in the
> veth driver.
>
> Thank you for your help!

Well, thanks to you for providing a very clean report, including repro
instructions !

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-10 14:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 15:33 [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc Marcus Wichelmann
2025-06-05 16:15 ` Toke Høiland-Jørgensen
2025-06-05 16:46   ` Eric Dumazet
2025-06-05 22:11     ` Eric Dumazet
2025-06-05 22:17       ` Marcus Wichelmann
2025-06-06  9:06         ` Eric Dumazet
2025-06-10 14:40           ` Marcus Wichelmann
2025-06-10 14:49             ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).