netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* r8169 rx_missed increasing in bursts (regression)
@ 2013-01-08  8:28 Timo Teras
  2013-01-08 22:58 ` Francois Romieu
  0 siblings, 1 reply; 11+ messages in thread
From: Timo Teras @ 2013-01-08  8:28 UTC (permalink / raw)
  To: Francois Romieu, netdev

While upgrading IPsec gateway, I noticed that few boxes have started to
drop packets since upgrading from 2.6.38.8 to 3.3+ kernels. Known bad
kernels are 3.3.8 and 3.4.24.

This happens with:
r8169 0000:02:00.0: eth0: RTL8168e/8111e at 0xf8318000, 00:30:18:a3:ae:e4, XID 0c200000 IRQ 68
r8169 0000:02:00.0: eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]

as well as with:
r8169 0000:02:00.0: eth0: RTL8168c/8111c at 0xf8360000, 00:30:18:a1:6e:58, XID 1c4000c0 IRQ 67

The boxes have relatively high softirq usage due to the fact that they
are forwarding data over IPsec tunnels; and the forwarded traffic
getting encrypted is done in softirq.

The symptoms include that "watch ethtool -S eth0" says rx_missed
increases in bursts. No other "dropped" stat counter is increasing.

This is happens only when the box is getting lot of traffic, is hard to
reproduce and happens only on few of the nodes. It might be also
related to specific network config: e.g. if the r8169 interfaces are
bonded or not, and if vlans are used or not.

My current hypothesis is that due to high softirq and recent(ish)
commit da78dbf "r8169: remove work from irq handler" moving more work
to softirq makes the receive path now suffer from latency from getting
irq to reading packets from the NIC on these boxes. And that at times
the rx fifo can get full causing a missed packet or so.

This might be further escalated by the bug fixed in commit 7dbb491
"r8169: avoid NAPI scheduling delay" (which is not present in -stable
trees). So my guess is that when a packet is lost it generates
RxOverflow triggering rtl_slow_event_work (but nothing is done with
this IRQ - not even printk). And this just causes the IRQs to be left
off due to the bug above - and ends up dropping a "burst" of packets.

So would it be sensible to do something like:
-#define NUM_RX_DESC    256     /* Number of Rx descriptor registers */
+#define NUM_RX_DESC    512     /* Number of Rx descriptor registers */

And cherry-picking the commit 7dbb491? Perhaps this could be pushed to
the -stable queues too.

Thanks,
 Timo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: r8169 rx_missed increasing in bursts (regression)
  2013-01-08  8:28 r8169 rx_missed increasing in bursts (regression) Timo Teras
@ 2013-01-08 22:58 ` Francois Romieu
  2013-01-09  9:58   ` Timo Teras
  0 siblings, 1 reply; 11+ messages in thread
From: Francois Romieu @ 2013-01-08 22:58 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev

Timo Teras <timo.teras@iki.fi> :
[...]
> My current hypothesis is that due to high softirq and recent(ish)
> commit da78dbf "r8169: remove work from irq handler" moving more work
> to softirq makes the receive path now suffer from latency from getting
> irq to reading packets from the NIC on these boxes. And that at times
> the rx fifo can get full causing a missed packet or so.

This hypothesis won't explain the regression in 3.3.8 since 3.3.x does
not include commit da78dbf.

Do you notice any netdev watchdog message in dmesg ?

'perf top' may exhibit something unusual too.

> This might be further escalated by the bug fixed in commit 7dbb491
> "r8169: avoid NAPI scheduling delay" (which is not present in -stable
> trees).

Right, it would had been worth adding to -stable.

However it only 1) is a problem for 3.4.x (fixed in 3.5) and 2) triggers
when returning from the slow work thread - which should not be used much.

[...]
> So would it be sensible to do something like:
> -#define NUM_RX_DESC    256     /* Number of Rx descriptor registers */
> +#define NUM_RX_DESC    512     /* Number of Rx descriptor registers */

You can try it but it may actually increase the amount of heavy work done
in softirq.

> And cherry-picking the commit 7dbb491? Perhaps this could be pushed to
> the -stable queues too.

3.4.x at most.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: r8169 rx_missed increasing in bursts (regression)
  2013-01-08 22:58 ` Francois Romieu
@ 2013-01-09  9:58   ` Timo Teras
  2013-01-09 17:14     ` Timo Teras
  0 siblings, 1 reply; 11+ messages in thread
From: Timo Teras @ 2013-01-09  9:58 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

On Tue, 8 Jan 2013 23:58:33 +0100 Francois Romieu
<romieu@fr.zoreil.com> wrote:

> Timo Teras <timo.teras@iki.fi> :
> [...]
> > My current hypothesis is that due to high softirq and recent(ish)
> > commit da78dbf "r8169: remove work from irq handler" moving more
> > work to softirq makes the receive path now suffer from latency from
> > getting irq to reading packets from the NIC on these boxes. And
> > that at times the rx fifo can get full causing a missed packet or
> > so.
> 
> This hypothesis won't explain the regression in 3.3.8 since 3.3.x does
> not include commit da78dbf.
> 
> Do you notice any netdev watchdog message in dmesg ?

In production boxes. No.

The lab environment where we tried to reproduce this, we received:
NOHZ: local_softirq_pending 08

Which is likely related, but separate issue. And fixed by commit
da78dbf. So seems that just got upgraded to "regression fix".

> 'perf top' may exhibit something unusual too.

Will try this.

I did notice that:
/proc/net/softnet_stat's 3rd field aka. softnet_data.time_squeeze keeps
incrementing when ever rx_missed increases. Sometiems time_squeeze
increments on it own. But rx_missed never increases without time_squeeze
bumping up seriously too.

> > This might be further escalated by the bug fixed in commit 7dbb491
> > "r8169: avoid NAPI scheduling delay" (which is not present in
> > -stable trees).
> 
> Right, it would had been worth adding to -stable.
> 
> However it only 1) is a problem for 3.4.x (fixed in 3.5) and 2)
> triggers when returning from the slow work thread - which should not
> be used much.

Ok. Didn't realize 3.3.x did not include it. So something else is broke
too.

The slow thread handles the RxOverflow, and in rx_missed case is taken
relatively often. Maybe add a printk there.

> [...]
> > So would it be sensible to do something like:
> > -#define NUM_RX_DESC    256     /* Number of Rx descriptor
> > registers */ +#define NUM_RX_DESC    512     /* Number of Rx
> > descriptor registers */
> 
> You can try it but it may actually increase the amount of heavy work
> done in softirq.

Ok. Will try this and some other things along with added debug logging.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: r8169 rx_missed increasing in bursts (regression)
  2013-01-09  9:58   ` Timo Teras
@ 2013-01-09 17:14     ` Timo Teras
  2013-01-15  8:11       ` Timo Teras
  0 siblings, 1 reply; 11+ messages in thread
From: Timo Teras @ 2013-01-09 17:14 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

On Wed, 9 Jan 2013 11:58:50 +0200 Timo Teras <timo.teras@iki.fi> wrote:

> On Tue, 8 Jan 2013 23:58:33 +0100 Francois Romieu
> <romieu@fr.zoreil.com> wrote:
> 
> > Timo Teras <timo.teras@iki.fi> :
> > [...]
> > > My current hypothesis is that due to high softirq and recent(ish)
> > > commit da78dbf "r8169: remove work from irq handler" moving more
> > > work to softirq makes the receive path now suffer from latency
> > > from getting irq to reading packets from the NIC on these boxes.
> > > And that at times the rx fifo can get full causing a missed
> > > packet or so.
> > 
> > This hypothesis won't explain the regression in 3.3.8 since 3.3.x
> > does not include commit da78dbf.
> > 
> > Do you notice any netdev watchdog message in dmesg ?
> 
> In production boxes. No.
> 
> The lab environment where we tried to reproduce this, we received:
> NOHZ: local_softirq_pending 08
> 
> Which is likely related, but separate issue. And fixed by commit
> da78dbf. So seems that just got upgraded to "regression fix".
> 
> > 'perf top' may exhibit something unusual too.
> 
> Will try this.
> 
> I did notice that:
> /proc/net/softnet_stat's 3rd field aka. softnet_data.time_squeeze
> keeps incrementing when ever rx_missed increases. Sometiems
> time_squeeze increments on it own. But rx_missed never increases
> without time_squeeze bumping up seriously too.

Did more general observing.

It seems that the rx_missed is not directly related to traffic amount.
At times the box is handling easily 10000+ pps, while packet loss can
happen at other times on 4000-8000pps levels.

Generally time_squeeze does not happen, and the box is at 20-30%
softirq. Some times time_squeeze bumps up with one (within a one second
interval) or two and packet loss does not happen.

When rx_missed is getting bumped, time_squeeze goes up with 1-3, and
rx_missed goes up with 50-1000 packets. Usually around 200 packets. (1
second sampling period)

I did find a strong correlation that rx_misses happen usually when the
box has dropped a packet due to iptables DROP/REJECT rule, or some
other reason (e.g. I'm seeing once in a while dmesg contain:
"nf_ct_sip: dropping packet").

Any ideas why a netfilter packet drop might cause netdevice rx to stall
long enough to saturate the hardware receive queue?

- Timo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: r8169 rx_missed increasing in bursts (regression)
  2013-01-09 17:14     ` Timo Teras
@ 2013-01-15  8:11       ` Timo Teras
  2013-01-15 22:53         ` Francois Romieu
  0 siblings, 1 reply; 11+ messages in thread
From: Timo Teras @ 2013-01-15  8:11 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

On Wed, 9 Jan 2013 19:14:56 +0200 Timo Teras <timo.teras@iki.fi> wrote:

> It seems that the rx_missed is not directly related to traffic amount.
> At times the box is handling easily 10000+ pps, while packet loss can
> happen at other times on 4000-8000pps levels.
> 
> Generally time_squeeze does not happen, and the box is at 20-30%
> softirq. Some times time_squeeze bumps up with one (within a one
> second interval) or two and packet loss does not happen.
> 
> When rx_missed is getting bumped, time_squeeze goes up with 1-3, and
> rx_missed goes up with 50-1000 packets. Usually around 200 packets. (1
> second sampling period)
> 
> I did find a strong correlation that rx_misses happen usually when the
> box has dropped a packet due to iptables DROP/REJECT rule, or some
> other reason (e.g. I'm seeing once in a while dmesg contain:
> "nf_ct_sip: dropping packet").
> 
> Any ideas why a netfilter packet drop might cause netdevice rx to
> stall long enough to saturate the hardware receive queue?

Ok, found the main culprit. Embarassingly, there was LOG targets, which
wound up writing to ttyS0. That apparently is synchronous operation, so
the printk's made the napi poller choke whenever packets got dropped.

However, I do still observe minor packet drops on specific loads.
Increasing the rx fifo size helped noticeably, but even at 512 it still
does some drops. Seems that Realtek's r8168 driver has rx fifo size of
1024.

Would it be feasible to make the fifo size a module parameter, or
somehow autotuned according to available system memory?

I also noticed that since recent commits, the dirty_rx is always
identical cur_rx. Basically, dirty_rx can be just removed. Should I
send a patch for this?

 - Timo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: r8169 rx_missed increasing in bursts (regression)
  2013-01-15  8:11       ` Timo Teras
@ 2013-01-15 22:53         ` Francois Romieu
  2013-01-16  7:01           ` [PATCH] r8169: remove unneeded dirty_rx index Timo Teräs
  0 siblings, 1 reply; 11+ messages in thread
From: Francois Romieu @ 2013-01-15 22:53 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev

Timo Teras <timo.teras@iki.fi> :
[...]
> Would it be feasible to make the fifo size a module parameter, or
> somehow autotuned according to available system memory?

If by fifo size you mean descriptor ring size, I can do it through ethtool.

> I also noticed that since recent commits, the dirty_rx is always
> identical cur_rx. Basically, dirty_rx can be just removed. Should I
> send a patch for this?

Yes, please.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] r8169: remove unneeded dirty_rx index
  2013-01-15 22:53         ` Francois Romieu
@ 2013-01-16  7:01           ` Timo Teräs
  2013-01-16 21:25             ` David Miller
  2013-01-16 22:16             ` Francois Romieu
  0 siblings, 2 replies; 11+ messages in thread
From: Timo Teräs @ 2013-01-16  7:01 UTC (permalink / raw)
  To: Francois Romieu, netdev; +Cc: Timo Teräs

After commit 6f0333b ("r8169: use 50% less ram for RX ring") the rx
ring buffers are always copied making dirty_rx useless.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
---
 drivers/net/ethernet/realtek/r8169.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 89184f3..97fdbb1 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -83,7 +83,7 @@ static const int multicast_filter_limit = 32;
 #define R8169_REGS_SIZE		256
 #define R8169_NAPI_WEIGHT	64
 #define NUM_TX_DESC	64	/* Number of Tx descriptor registers */
-#define NUM_RX_DESC	256	/* Number of Rx descriptor registers */
+#define NUM_RX_DESC	256U	/* Number of Rx descriptor registers */
 #define R8169_TX_RING_BYTES	(NUM_TX_DESC * sizeof(struct TxDesc))
 #define R8169_RX_RING_BYTES	(NUM_RX_DESC * sizeof(struct RxDesc))
 
@@ -727,7 +727,6 @@ struct rtl8169_private {
 	u16 mac_version;
 	u32 cur_rx; /* Index into the Rx descriptor buffer of next Rx pkt. */
 	u32 cur_tx; /* Index into the Tx descriptor buffer of next Rx pkt. */
-	u32 dirty_rx;
 	u32 dirty_tx;
 	struct rtl8169_stats rx_stats;
 	struct rtl8169_stats tx_stats;
@@ -4177,7 +4176,7 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
 
 static void rtl8169_init_ring_indexes(struct rtl8169_private *tp)
 {
-	tp->dirty_tx = tp->dirty_rx = tp->cur_tx = tp->cur_rx = 0;
+	tp->dirty_tx = tp->cur_tx = tp->cur_rx = 0;
 }
 
 static void rtl_hw_jumbo_enable(struct rtl8169_private *tp)
@@ -5920,7 +5919,7 @@ static void rtl8169_pcierr_interrupt(struct net_device *dev)
 		PCI_STATUS_REC_TARGET_ABORT | PCI_STATUS_SIG_TARGET_ABORT));
 
 	/* The infamous DAC f*ckup only happens at boot time */
-	if ((tp->cp_cmd & PCIDAC) && !tp->dirty_rx && !tp->cur_rx) {
+	if ((tp->cp_cmd & PCIDAC) && !tp->cur_rx) {
 		void __iomem *ioaddr = tp->mmio_addr;
 
 		netif_info(tp, intr, dev, "disabling PCI DAC\n");
@@ -6035,10 +6034,8 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
 	unsigned int count;
 
 	cur_rx = tp->cur_rx;
-	rx_left = NUM_RX_DESC + tp->dirty_rx - cur_rx;
-	rx_left = min(rx_left, budget);
 
-	for (; rx_left > 0; rx_left--, cur_rx++) {
+	for (rx_left = min(budget, NUM_RX_DESC); rx_left > 0; rx_left--, cur_rx++) {
 		unsigned int entry = cur_rx % NUM_RX_DESC;
 		struct RxDesc *desc = tp->RxDescArray + entry;
 		u32 status;
@@ -6123,8 +6120,6 @@ process_pkt:
 	count = cur_rx - tp->cur_rx;
 	tp->cur_rx = cur_rx;
 
-	tp->dirty_rx += count;
-
 	return count;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] r8169: remove unneeded dirty_rx index
  2013-01-16  7:01           ` [PATCH] r8169: remove unneeded dirty_rx index Timo Teräs
@ 2013-01-16 21:25             ` David Miller
  2013-01-16 21:26               ` Francois Romieu
  2013-01-16 22:16             ` Francois Romieu
  1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2013-01-16 21:25 UTC (permalink / raw)
  To: timo.teras; +Cc: romieu, netdev

From: Timo Teräs <timo.teras@iki.fi>
Date: Wed, 16 Jan 2013 09:01:24 +0200

> @@ -6035,10 +6034,8 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
>  	unsigned int count;
>  
>  	cur_rx = tp->cur_rx;
> -	rx_left = NUM_RX_DESC + tp->dirty_rx - cur_rx;
> -	rx_left = min(rx_left, budget);
>  
> -	for (; rx_left > 0; rx_left--, cur_rx++) {
> +	for (rx_left = min(budget, NUM_RX_DESC); rx_left > 0; rx_left--, cur_rx++) {

Is this really equivalent?  I think cur_rx still needs to be
incorporated into whatever you use as the initial rx_left.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] r8169: remove unneeded dirty_rx index
  2013-01-16 21:25             ` David Miller
@ 2013-01-16 21:26               ` Francois Romieu
  0 siblings, 0 replies; 11+ messages in thread
From: Francois Romieu @ 2013-01-16 21:26 UTC (permalink / raw)
  To: David Miller; +Cc: timo.teras, netdev

David Miller <davem@davemloft.net> :
[...]
> > @@ -6035,10 +6034,8 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
> >  	unsigned int count;
> >  
> >  	cur_rx = tp->cur_rx;
> > -	rx_left = NUM_RX_DESC + tp->dirty_rx - cur_rx;
> > -	rx_left = min(rx_left, budget);
> >  
> > -	for (; rx_left > 0; rx_left--, cur_rx++) {
> > +	for (rx_left = min(budget, NUM_RX_DESC); rx_left > 0; rx_left--, cur_rx++) {
> 
> Is this really equivalent?  I think cur_rx still needs to be
> incorporated into whatever you use as the initial rx_left.

The rx ring can't contain hole so the code above must be read with
"tp->cur_rx == tp->dirty_rx" in mind.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] r8169: remove unneeded dirty_rx index
  2013-01-16  7:01           ` [PATCH] r8169: remove unneeded dirty_rx index Timo Teräs
  2013-01-16 21:25             ` David Miller
@ 2013-01-16 22:16             ` Francois Romieu
  2013-01-16 23:02               ` David Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Francois Romieu @ 2013-01-16 22:16 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev

Timo Teräs <timo.teras@iki.fi> :
> After commit 6f0333b ("r8169: use 50% less ram for RX ring") the rx
> ring buffers are always copied making dirty_rx useless.
> 
> Signed-off-by: Timo Teräs <timo.teras@iki.fi>

Acked-by: Francois Romieu <romieu@fr.zoreil.com>

Even if it is obviously not -net material, please specify the branch you
aim at (eg [PATCH net-next] or [PATCH net]).

-- 
Ueimor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] r8169: remove unneeded dirty_rx index
  2013-01-16 22:16             ` Francois Romieu
@ 2013-01-16 23:02               ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2013-01-16 23:02 UTC (permalink / raw)
  To: romieu; +Cc: timo.teras, netdev

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 16 Jan 2013 23:16:40 +0100

> Timo Teräs <timo.teras@iki.fi> :
>> After commit 6f0333b ("r8169: use 50% less ram for RX ring") the rx
>> ring buffers are always copied making dirty_rx useless.
>> 
>> Signed-off-by: Timo Teräs <timo.teras@iki.fi>
> 
> Acked-by: Francois Romieu <romieu@fr.zoreil.com>

Applied to net-next.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-01-16 23:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-08  8:28 r8169 rx_missed increasing in bursts (regression) Timo Teras
2013-01-08 22:58 ` Francois Romieu
2013-01-09  9:58   ` Timo Teras
2013-01-09 17:14     ` Timo Teras
2013-01-15  8:11       ` Timo Teras
2013-01-15 22:53         ` Francois Romieu
2013-01-16  7:01           ` [PATCH] r8169: remove unneeded dirty_rx index Timo Teräs
2013-01-16 21:25             ` David Miller
2013-01-16 21:26               ` Francois Romieu
2013-01-16 22:16             ` Francois Romieu
2013-01-16 23:02               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).