[PATCH] net: mvneta: fix Tx interrupt delay

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net: mvneta: fix Tx interrupt delay
@ 2014-12-02  7:13 Willy Tarreau
  2014-12-02 12:18 ` Eric Dumazet
  2014-12-09  1:44 ` David Miller
  0 siblings, 2 replies; 14+ messages in thread
From: Willy Tarreau @ 2014-12-02  7:13 UTC (permalink / raw)
  To: netdev; +Cc: Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

The mvneta driver sets the amount of Tx coalesce packets to 16 by
default. Normally that does not cause any trouble since the driver
uses a much larger Tx ring size (532 packets). But some sockets
might run with very small buffers, much smaller than the equivalent
of 16 packets. This is what ping is doing for example, by setting
SNDBUF to 324 bytes rounded up to 2kB by the kernel.

The problem is that there is no documented method to force a specific
packet to emit an interrupt (eg: the last of the ring) nor is it
possible to make the NIC emit an interrupt after a given delay.

In this case, it causes trouble, because when ping sends packets over
its raw socket, the few first packets leave the system, and the first
15 packets will be emitted without an IRQ being generated, so without
the skbs being freed. And since the socket's buffer is small, there's
no way to reach that amount of packets, and the ping ends up with
"send: no buffer available" after sending 6 packets. Running with 3
instances of ping in parallel is enough to hide the problem, because
with 6 packets per instance, that's 18 packets total, which is enough
to grant a Tx interrupt before all are sent.

The original driver in the LSP kernel worked around this design flaw
by using a software timer to clean up the Tx descriptors. This timer
was slow and caused terrible network performance on some Tx-bound
workloads (such as routing) but was enough to make tools like ping
work correctly.

Instead here, we simply set the packet counts before interrupt to 1.
This ensures that each packet sent will produce an interrupt. NAPI
takes care of coalescing interrupts since the interrupt is disabled
once generated.

No measurable performance impact nor CPU usage were observed on small
nor large packets, including when saturating the link on Tx, and this
fixes tools like ping which rely on too small a send buffer. If one
wants to increase this value for certain workloads where it is safe
to do so, "ethtool -C $dev tx-frames" will override this default
setting.

This fix needs to be applied to stable kernels starting with 3.10.

Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 4762994..35bfba7 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -214,7 +214,7 @@
 /* Various constants */

 /* Coalescing */
-#define MVNETA_TXDONE_COAL_PKTS		16
+#define MVNETA_TXDONE_COAL_PKTS		1
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100

-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02  7:13 [PATCH] net: mvneta: fix Tx interrupt delay Willy Tarreau
@ 2014-12-02 12:18 ` Eric Dumazet
  2014-12-02 12:30   ` [PATCH] net: mvneta: fix race condition in mvneta_tx() Eric Dumazet
                     ` (2 more replies)
  2014-12-09  1:44 ` David Miller
  1 sibling, 3 replies; 14+ messages in thread
From: Eric Dumazet @ 2014-12-02 12:18 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

On Tue, 2014-12-02 at 08:13 +0100, Willy Tarreau wrote:
> The mvneta driver sets the amount of Tx coalesce packets to 16 by
> default. Normally that does not cause any trouble since the driver
> uses a much larger Tx ring size (532 packets). But some sockets
> might run with very small buffers, much smaller than the equivalent
> of 16 packets. This is what ping is doing for example, by setting
> SNDBUF to 324 bytes rounded up to 2kB by the kernel.
> 
> The problem is that there is no documented method to force a specific
> packet to emit an interrupt (eg: the last of the ring) nor is it
> possible to make the NIC emit an interrupt after a given delay.
> 
> In this case, it causes trouble, because when ping sends packets over
> its raw socket, the few first packets leave the system, and the first
> 15 packets will be emitted without an IRQ being generated, so without
> the skbs being freed. And since the socket's buffer is small, there's
> no way to reach that amount of packets, and the ping ends up with
> "send: no buffer available" after sending 6 packets. Running with 3
> instances of ping in parallel is enough to hide the problem, because
> with 6 packets per instance, that's 18 packets total, which is enough
> to grant a Tx interrupt before all are sent.
> 
> The original driver in the LSP kernel worked around this design flaw
> by using a software timer to clean up the Tx descriptors. This timer
> was slow and caused terrible network performance on some Tx-bound
> workloads (such as routing) but was enough to make tools like ping
> work correctly.
> 
> Instead here, we simply set the packet counts before interrupt to 1.
> This ensures that each packet sent will produce an interrupt. NAPI
> takes care of coalescing interrupts since the interrupt is disabled
> once generated.
> 
> No measurable performance impact nor CPU usage were observed on small
> nor large packets, including when saturating the link on Tx, and this
> fixes tools like ping which rely on too small a send buffer. If one
> wants to increase this value for certain workloads where it is safe
> to do so, "ethtool -C $dev tx-frames" will override this default
> setting.
> 
> This fix needs to be applied to stable kernels starting with 3.10.
> 
> Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> 
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 4762994..35bfba7 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -214,7 +214,7 @@
>  /* Various constants */
>  
>  /* Coalescing */
> -#define MVNETA_TXDONE_COAL_PKTS		16
> +#define MVNETA_TXDONE_COAL_PKTS		1
>  #define MVNETA_RX_COAL_PKTS		32
>  #define MVNETA_RX_COAL_USEC		100
>  


I am surprised TCP even worked correctly with this problem.

I highly suggest BQL for this driver, now this issue is fixed.

I wonder if this high setting was not because of race conditions in the
driver :

mvneta_tx() seems to access skb->len too late, TX completion might have
already freed skb :

                u64_stats_update_begin(&stats->syncp);
                stats->tx_packets++;
                stats->tx_bytes  += skb->len;         // potential use after free
                u64_stats_update_end(&stats->syncp);

Thanks Willy !

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] net: mvneta: fix race condition in mvneta_tx()
  2014-12-02 12:18 ` Eric Dumazet
@ 2014-12-02 12:30   ` Eric Dumazet
  2014-12-02 12:37     ` Eric Dumazet
  2014-12-02 12:39   ` [PATCH] net: mvneta: fix Tx interrupt delay Willy Tarreau
  2014-12-02 17:12   ` Ezequiel Garcia
  2 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2014-12-02 12:30 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

From: Eric Dumazet <edumazet@google.com>

mvneta_tx() dereferences skb to get skb->len too late,
as hardware might have completed the transmit and TX completion
could have freed the skb from another cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/marvell/mvneta.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index ccc3ce2e8c8c7c8d929b2ff5e271bf62948c7012..34f58150d693b4215b58285c1dcbb4b16cc03697 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1721,6 +1721,7 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 	u16 txq_id = skb_get_queue_mapping(skb);
 	struct mvneta_tx_queue *txq = &pp->txqs[txq_id];
 	struct mvneta_tx_desc *tx_desc;
+	int len = skb->len;
 	int frags = 0;
 	u32 tx_cmd;
 
@@ -1788,7 +1789,7 @@ out:
 
 		u64_stats_update_begin(&stats->syncp);
 		stats->tx_packets++;
-		stats->tx_bytes  += skb->len;
+		stats->tx_bytes  += len;
 		u64_stats_update_end(&stats->syncp);
 	} else {
 		dev->stats.tx_dropped++;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix race condition in mvneta_tx()
  2014-12-02 12:30   ` [PATCH] net: mvneta: fix race condition in mvneta_tx() Eric Dumazet
@ 2014-12-02 12:37     ` Eric Dumazet
  2014-12-02 12:40       ` Willy Tarreau
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2014-12-02 12:37 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

On Tue, 2014-12-02 at 04:30 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> mvneta_tx() dereferences skb to get skb->len too late,
> as hardware might have completed the transmit and TX completion
> could have freed the skb from another cpu.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>



For completeness, this bug was added in linux-3.14 and seems a stable
candidate.

Fixes: 71f6d1b31fb1 ("net: mvneta: replace Tx timer with a real interrupt")

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02 12:18 ` Eric Dumazet
  2014-12-02 12:30   ` [PATCH] net: mvneta: fix race condition in mvneta_tx() Eric Dumazet
@ 2014-12-02 12:39   ` Willy Tarreau
  2014-12-02 13:04     ` Eric Dumazet
  2014-12-02 17:12   ` Ezequiel Garcia
  2 siblings, 1 reply; 14+ messages in thread
From: Willy Tarreau @ 2014-12-02 12:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

Hi Eric,

On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote:
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> > index 4762994..35bfba7 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -214,7 +214,7 @@
> >  /* Various constants */
> >  
> >  /* Coalescing */
> > -#define MVNETA_TXDONE_COAL_PKTS		16
> > +#define MVNETA_TXDONE_COAL_PKTS		1
> >  #define MVNETA_RX_COAL_PKTS		32
> >  #define MVNETA_RX_COAL_USEC		100
> >  
> 
> 
> I am surprised TCP even worked correctly with this problem.

There are multiple explanations to this :
  - we used to flush Tx queues upon Rx interrupt, which used to hide
    the problem.

  - we tend to have large socket buffers, which cover the issue. I've
    never had any issue at high data rates. After all only 23kB send
    buffer is needed to get it to work.

  - most often you have multiple parallel streams which hide the issue
    even more.

> I highly suggest BQL for this driver, now this issue is fixed.

How does that work ?

> I wonder if this high setting was not because of race conditions in the
> driver :
>
> mvneta_tx() seems to access skb->len too late, TX completion might have
> already freed skb :
> 
>                 u64_stats_update_begin(&stats->syncp);
>                 stats->tx_packets++;
>                 stats->tx_bytes  += skb->len;         // potential use after free
>                 u64_stats_update_end(&stats->syncp);

Good catch! But no, this is unrelated since it does not fix the race either.
Initially this driver used to implement a timer to flush the Tx queue after
10ms, resulting in abysmal Tx-only performance as you can easily imagine. In
my opinion there's a design flaw in the chip, and they did everything they
could to workaround it (let's not say "hide it"), but that was not enough.
When I "fixed" the performance issue by enabling the Tx interrupt, I kept
the value 16 which gave pretty good results to me, without realizing that
there was this corner case :-/

> Thanks Willy !

Thanks for your review :-)

Willy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix race condition in mvneta_tx()
  2014-12-02 12:37     ` Eric Dumazet
@ 2014-12-02 12:40       ` Willy Tarreau
  2014-12-09  1:55         ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Willy Tarreau @ 2014-12-02 12:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

On Tue, Dec 02, 2014 at 04:37:15AM -0800, Eric Dumazet wrote:
> On Tue, 2014-12-02 at 04:30 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > mvneta_tx() dereferences skb to get skb->len too late,
> > as hardware might have completed the transmit and TX completion
> > could have freed the skb from another cpu.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> 
> 
> For completeness, this bug was added in linux-3.14 and seems a stable
> candidate.
> 
> Fixes: 71f6d1b31fb1 ("net: mvneta: replace Tx timer with a real interrupt")

Absolutely, we backported this one back to 3.10.

Thanks Eric!
Willy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02 12:39   ` [PATCH] net: mvneta: fix Tx interrupt delay Willy Tarreau
@ 2014-12-02 13:04     ` Eric Dumazet
  2014-12-02 13:24       ` Willy Tarreau
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2014-12-02 13:04 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

On Tue, 2014-12-02 at 13:39 +0100, Willy Tarreau wrote:
> Hi Eric,
> 
> On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote:
> 
> > I highly suggest BQL for this driver, now this issue is fixed.
> 
> How does that work ?

This works very well ;)

Its super easy to implement, take a look at commits
bdbc063129e811264cd6c311d8c2d9b95de01231 or
7070ce0a6419a118842298bc967061ad6cea40db

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02 13:04     ` Eric Dumazet
@ 2014-12-02 13:24       ` Willy Tarreau
  2014-12-02 15:17         ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Willy Tarreau @ 2014-12-02 13:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

On Tue, Dec 02, 2014 at 05:04:18AM -0800, Eric Dumazet wrote:
> On Tue, 2014-12-02 at 13:39 +0100, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > On Tue, Dec 02, 2014 at 04:18:08AM -0800, Eric Dumazet wrote:
> > 
> > > I highly suggest BQL for this driver, now this issue is fixed.
> > 
> > How does that work ?
> 
> This works very well ;)
> 
> Its super easy to implement, take a look at commits
> bdbc063129e811264cd6c311d8c2d9b95de01231 or
> 7070ce0a6419a118842298bc967061ad6cea40db

Thanks but I'm not sure I entirely understand the concept. Is it to
notify the sender that the packets were already queued for the NIC ?
And if so, how does that improve the situation ? I'm sorry if this
sounds like a stupid question, it's just that the concept by itself
is not clear to me.

Willy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02 13:24       ` Willy Tarreau
@ 2014-12-02 15:17         ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2014-12-02 15:17 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT,
	Ezequiel Garcia

On Tue, 2014-12-02 at 14:24 +0100, Willy Tarreau wrote:

> Thanks but I'm not sure I entirely understand the concept. Is it to
> notify the sender that the packets were already queued for the NIC ?
> And if so, how does that improve the situation ? I'm sorry if this
> sounds like a stupid question, it's just that the concept by itself
> is not clear to me.

http://lwn.net/Articles/454390/

BQL is first step to fight so called bufferbloat.

https://www.ietf.org/proceedings/86/slides/slides-86-iccrg-0.pdf

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02 12:18 ` Eric Dumazet
  2014-12-02 12:30   ` [PATCH] net: mvneta: fix race condition in mvneta_tx() Eric Dumazet
  2014-12-02 12:39   ` [PATCH] net: mvneta: fix Tx interrupt delay Willy Tarreau
@ 2014-12-02 17:12   ` Ezequiel Garcia
  2014-12-02 17:31     ` Dave Taht
  2014-12-02 18:39     ` Eric Dumazet
  2 siblings, 2 replies; 14+ messages in thread
From: Ezequiel Garcia @ 2014-12-02 17:12 UTC (permalink / raw)
  To: Eric Dumazet, Willy Tarreau
  Cc: netdev, Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT

Eric,

On 12/02/2014 09:18 AM, Eric Dumazet wrote:
[..]
> 
> I am surprised TCP even worked correctly with this problem.
> 
> I highly suggest BQL for this driver, now this issue is fixed.
> 

Implementing BQL for the mvneta driver was something I wanted to do a
while ago, but you explained that these kind drivers (i.e. those with
software TSO) didn't need BQL, because the latency that resulted from
the ring was too small.

Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
""
Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
descriptors.

So I do not think BQL is really needed, because a 512 slots TX ring wont
add a big latency : About 5 ms max.

BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
packet consumes 3 or 4 descriptors.

Also note that TCP Small Queues should limit TX ring occupancy of a
single bulk flow anyway.
""

Maybe I misunderstood something?
-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02 17:12   ` Ezequiel Garcia
@ 2014-12-02 17:31     ` Dave Taht
  2014-12-02 18:39     ` Eric Dumazet
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Taht @ 2014-12-02 17:31 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Eric Dumazet, Willy Tarreau, netdev@vger.kernel.org,
	Maggie Mae Roxas, Thomas Petazzoni, Gregory CLEMENT

On Tue, Dec 2, 2014 at 9:12 AM, Ezequiel Garcia
<ezequiel.garcia@free-electrons.com> wrote:
> Eric,
>
> On 12/02/2014 09:18 AM, Eric Dumazet wrote:
> [..]
>>
>> I am surprised TCP even worked correctly with this problem.
>>
>> I highly suggest BQL for this driver, now this issue is fixed.
>>
>
> Implementing BQL for the mvneta driver was something I wanted to do a
> while ago, but you explained that these kind drivers (i.e. those with
> software TSO) didn't need BQL, because the latency that resulted from
> the ring was too small.
>
> Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
> ""
> Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> descriptors.
>
> So I do not think BQL is really needed, because a 512 slots TX ring wont
> add a big latency : About 5 ms max.

at a gigabit.

Wildly variable with pause frames. And: 50ms at 100mbit. 500ms at 10mbit.

Secondly, why accept any latency in the driver when you can get it well below
2ms in most circumstances with a couple lines of BQL code?

>
> BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
> packet consumes 3 or 4 descriptors.

BQL is nice at any speed, and darn useful at lower speeds,
as well.

>
> Also note that TCP Small Queues should limit TX ring occupancy of a
> single bulk flow anyway.

We only do one upload or one download at a time on our systems,
followed by a ping measurement... never do more than one flow
at the same time, right?


> ""
>
> Maybe I misunderstood something?

Try it, test it, you'll like it.

> --
> Ezequiel García, Free Electrons
> Embedded Linux, Kernel and Android Engineering
> http://free-electrons.com
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht

http://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02 17:12   ` Ezequiel Garcia
  2014-12-02 17:31     ` Dave Taht
@ 2014-12-02 18:39     ` Eric Dumazet
  1 sibling, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2014-12-02 18:39 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Willy Tarreau, netdev, Maggie Mae Roxas, Thomas Petazzoni,
	Gregory CLEMENT

On Tue, 2014-12-02 at 14:12 -0300, Ezequiel Garcia wrote:
> Eric,
> 
> On 12/02/2014 09:18 AM, Eric Dumazet wrote:
> [..]
> > 
> > I am surprised TCP even worked correctly with this problem.
> > 
> > I highly suggest BQL for this driver, now this issue is fixed.
> > 
> 
> Implementing BQL for the mvneta driver was something I wanted to do a
> while ago, but you explained that these kind drivers (i.e. those with
> software TSO) didn't need BQL, because the latency that resulted from
> the ring was too small.
> 
> Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
> ""
> Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> descriptors.
> 
> So I do not think BQL is really needed, because a 512 slots TX ring wont
> add a big latency : About 5 ms max.
> 
> BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
> packet consumes 3 or 4 descriptors.
> 
> Also note that TCP Small Queues should limit TX ring occupancy of a
> single bulk flow anyway.
> ""
> 
> Maybe I misunderstood something?

This was indeed the case, but we added recently xmit_more support, and
it uses BQL information. So you might add BQL anyway, if xmit_more
support is useful for this hardware.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix Tx interrupt delay
  2014-12-02  7:13 [PATCH] net: mvneta: fix Tx interrupt delay Willy Tarreau
  2014-12-02 12:18 ` Eric Dumazet
@ 2014-12-09  1:44 ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2014-12-09  1:44 UTC (permalink / raw)
  To: w
  Cc: netdev, maggie.mae.roxas, thomas.petazzoni, gregory.clement,
	ezequiel.garcia

From: Willy Tarreau <w@1wt.eu>
Date: Tue, 2 Dec 2014 08:13:04 +0100

> The mvneta driver sets the amount of Tx coalesce packets to 16 by
> default. Normally that does not cause any trouble since the driver
> uses a much larger Tx ring size (532 packets). But some sockets
> might run with very small buffers, much smaller than the equivalent
> of 16 packets. This is what ping is doing for example, by setting
> SNDBUF to 324 bytes rounded up to 2kB by the kernel.
> 
> The problem is that there is no documented method to force a specific
> packet to emit an interrupt (eg: the last of the ring) nor is it
> possible to make the NIC emit an interrupt after a given delay.
> 
> In this case, it causes trouble, because when ping sends packets over
> its raw socket, the few first packets leave the system, and the first
> 15 packets will be emitted without an IRQ being generated, so without
> the skbs being freed. And since the socket's buffer is small, there's
> no way to reach that amount of packets, and the ping ends up with
> "send: no buffer available" after sending 6 packets. Running with 3
> instances of ping in parallel is enough to hide the problem, because
> with 6 packets per instance, that's 18 packets total, which is enough
> to grant a Tx interrupt before all are sent.
> 
> The original driver in the LSP kernel worked around this design flaw
> by using a software timer to clean up the Tx descriptors. This timer
> was slow and caused terrible network performance on some Tx-bound
> workloads (such as routing) but was enough to make tools like ping
> work correctly.
> 
> Instead here, we simply set the packet counts before interrupt to 1.
> This ensures that each packet sent will produce an interrupt. NAPI
> takes care of coalescing interrupts since the interrupt is disabled
> once generated.
> 
> No measurable performance impact nor CPU usage were observed on small
> nor large packets, including when saturating the link on Tx, and this
> fixes tools like ping which rely on too small a send buffer. If one
> wants to increase this value for certain workloads where it is safe
> to do so, "ethtool -C $dev tx-frames" will override this default
> setting.
> 
> This fix needs to be applied to stable kernels starting with 3.10.
> 
> Tested-By: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
> Signed-off-by: Willy Tarreau <w@1wt.eu>

Applied and queued up for -stable, thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] net: mvneta: fix race condition in mvneta_tx()
  2014-12-02 12:40       ` Willy Tarreau
@ 2014-12-09  1:55         ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2014-12-09  1:55 UTC (permalink / raw)
  To: w
  Cc: eric.dumazet, netdev, maggie.mae.roxas, thomas.petazzoni,
	gregory.clement, ezequiel.garcia

From: Willy Tarreau <w@1wt.eu>
Date: Tue, 2 Dec 2014 13:40:43 +0100

> On Tue, Dec 02, 2014 at 04:37:15AM -0800, Eric Dumazet wrote:
>> On Tue, 2014-12-02 at 04:30 -0800, Eric Dumazet wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> > 
>> > mvneta_tx() dereferences skb to get skb->len too late,
>> > as hardware might have completed the transmit and TX completion
>> > could have freed the skb from another cpu.
>> > 
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> 
>> 
>> 
>> For completeness, this bug was added in linux-3.14 and seems a stable
>> candidate.
>> 
>> Fixes: 71f6d1b31fb1 ("net: mvneta: replace Tx timer with a real interrupt")
> 
> Absolutely, we backported this one back to 3.10.

Applied, and queued up for -stable, thanks everyone.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-12-09  1:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-02  7:13 [PATCH] net: mvneta: fix Tx interrupt delay Willy Tarreau
2014-12-02 12:18 ` Eric Dumazet
2014-12-02 12:30   ` [PATCH] net: mvneta: fix race condition in mvneta_tx() Eric Dumazet
2014-12-02 12:37     ` Eric Dumazet
2014-12-02 12:40       ` Willy Tarreau
2014-12-09  1:55         ` David Miller
2014-12-02 12:39   ` [PATCH] net: mvneta: fix Tx interrupt delay Willy Tarreau
2014-12-02 13:04     ` Eric Dumazet
2014-12-02 13:24       ` Willy Tarreau
2014-12-02 15:17         ` Eric Dumazet
2014-12-02 17:12   ` Ezequiel Garcia
2014-12-02 17:31     ` Dave Taht
2014-12-02 18:39     ` Eric Dumazet
2014-12-09  1:44 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).