twice past the taps, thence out to net?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* twice past the taps, thence out to net?
@ 2011-12-14 19:27 Rick Jones
  2011-12-15  0:58 ` Benjamin Poirier
  2011-12-15  2:12 ` Vijay Subramanian
  0 siblings, 2 replies; 15+ messages in thread
From: Rick Jones @ 2011-12-14 19:27 UTC (permalink / raw)
  To: tcpdump-workers, netdev

While looking at "something else" with tcpdump/tcptrace, tcptrace 
emitted lots of notices about hardware duplicated packets being detected 
(same TCP sequence number and IP datagram ID).  Sure enough, if I go 
into the tcpdump trace (taken on the sender) I can find instances of 
what it was talking about, separated in time by rather less than I would 
expect to be the RTO, and often as not with few if any intervening 
arriving ACKs to trigger anything like fast retransmit.  And besides, 
those would have a different IP datagram ID no?

I did manage to reproduce the issue with plain netperf tcp_stream tests. 
I had one sending system with 30 concurrent netperf tcp_stream tests to 
30 other receiving systems.  There are "hardware duplicates" in the 
sending trace, but no duplicate segments (that I can find thus far) in 
the two receiver side traces I took.  Of course that doesn't mean 
"conclusively" there were two actual sends but it suggests there werent.

While I work through the "obtain permission" path to post the packet 
traces (don't ask...) I thought I would ask if anyone else has seen 
something similar.

In this case, all the systems are running a 2.6.38-8 Ubuntu kernel (the 
same sorts of issues which delay my just putting the traces up on 
netperf.org preclude a later kernel, and I've no other test systems :( 
), with Intel 82576 interfaces being driven by:

$ sudo ethtool -i eth0
driver: igb
version: 2.1.0-k2
firmware-version: 1.8-2
bus-info: 0000:05:00.0

All the systems were connected to the same switch.

It is projecting, but given that the interface was fully saturated, and 
there were 30 concurrent streams making 64K TSO sends, it "feels" like 
some sort of "go past the packet tap and be captured, find a 
queue/resource past the tap unavailable, get re-queued above the tap, 
get captured again when resent" sort of thing.

Where in the Linux stack does the tap used by libpcap 1.1.1 reside?

rick jones

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-14 19:27 twice past the taps, thence out to net? Rick Jones
@ 2011-12-15  0:58 ` Benjamin Poirier
  2011-12-15  2:12 ` Vijay Subramanian
  1 sibling, 0 replies; 15+ messages in thread
From: Benjamin Poirier @ 2011-12-15  0:58 UTC (permalink / raw)
  To: Rick Jones; +Cc: tcpdump-workers, netdev

On 11/12/14 11:27, Rick Jones wrote:
> 
> Where in the Linux stack does the tap used by libpcap 1.1.1 reside?
> 

On transmission it's before tso/gso, via dev_queue_xmit_nit(). Taps are
registered in the ptype_all list.

-Benjamin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-14 19:27 twice past the taps, thence out to net? Rick Jones
  2011-12-15  0:58 ` Benjamin Poirier
@ 2011-12-15  2:12 ` Vijay Subramanian
  2011-12-15 17:43   ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: Vijay Subramanian @ 2011-12-15  2:12 UTC (permalink / raw)
  To: Rick Jones; +Cc: tcpdump-workers, netdev

On 14 December 2011 11:27, Rick Jones <rick.jones2@hp.com> wrote:
> While looking at "something else" with tcpdump/tcptrace, tcptrace emitted
> lots of notices about hardware duplicated packets being detected (same TCP
> sequence number and IP datagram ID).  Sure enough, if I go into the tcpdump
> trace (taken on the sender) I can find instances of what it was talking
> about, separated in time by rather less than I would expect to be the RTO,
> and often as not with few if any intervening arriving ACKs to trigger
> anything like fast retransmit.  And besides, those would have a different IP
> datagram ID no?
>
> I did manage to reproduce the issue with plain netperf tcp_stream tests. I
> had one sending system with 30 concurrent netperf tcp_stream tests to 30
> other receiving systems.  There are "hardware duplicates" in the sending
> trace, but no duplicate segments (that I can find thus far) in the two
> receiver side traces I took.  Of course that doesn't mean "conclusively"
> there were two actual sends but it suggests there werent.
>
> While I work through the "obtain permission" path to post the packet traces
> (don't ask...) I thought I would ask if anyone else has seen something
> similar.
>
> In this case, all the systems are running a 2.6.38-8 Ubuntu kernel (the same
> sorts of issues which delay my just putting the traces up on netperf.org
> preclude a later kernel, and I've no other test systems :( ), with Intel
> 82576 interfaces being driven by:
>
> $ sudo ethtool -i eth0
> driver: igb
> version: 2.1.0-k2
> firmware-version: 1.8-2
> bus-info: 0000:05:00.0
>
> All the systems were connected to the same switch.
>

Rick,
This may be of help.
http://www.tcptrace.org/faq_ans.html#FAQ%2021

Regards,
Vijay Subramanian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-15  2:12 ` Vijay Subramanian
@ 2011-12-15 17:43   ` Eric Dumazet
  2011-12-15 18:32     ` Rick Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2011-12-15 17:43 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: Rick Jones, tcpdump-workers, netdev

Le mercredi 14 décembre 2011 à 18:12 -0800, Vijay Subramanian a écrit :
> On 14 December 2011 11:27, Rick Jones <rick.jones2@hp.com> wrote:
> > While looking at "something else" with tcpdump/tcptrace, tcptrace emitted
> > lots of notices about hardware duplicated packets being detected (same TCP
> > sequence number and IP datagram ID).  Sure enough, if I go into the tcpdump
> > trace (taken on the sender) I can find instances of what it was talking
> > about, separated in time by rather less than I would expect to be the RTO,
> > and often as not with few if any intervening arriving ACKs to trigger
> > anything like fast retransmit.  And besides, those would have a different IP
> > datagram ID no?
> >
> > I did manage to reproduce the issue with plain netperf tcp_stream tests. I
> > had one sending system with 30 concurrent netperf tcp_stream tests to 30
> > other receiving systems.  There are "hardware duplicates" in the sending
> > trace, but no duplicate segments (that I can find thus far) in the two
> > receiver side traces I took.  Of course that doesn't mean "conclusively"
> > there were two actual sends but it suggests there werent.
> >
> > While I work through the "obtain permission" path to post the packet traces
> > (don't ask...) I thought I would ask if anyone else has seen something
> > similar.
> >
> > In this case, all the systems are running a 2.6.38-8 Ubuntu kernel (the same
> > sorts of issues which delay my just putting the traces up on netperf.org
> > preclude a later kernel, and I've no other test systems :( ), with Intel
> > 82576 interfaces being driven by:
> >
> > $ sudo ethtool -i eth0
> > driver: igb
> > version: 2.1.0-k2
> > firmware-version: 1.8-2
> > bus-info: 0000:05:00.0
> >
> > All the systems were connected to the same switch.
> >
> 
> Rick,
> This may be of help.
> http://www.tcptrace.org/faq_ans.html#FAQ%2021

More exactly, we call dev_queue_xmit_nit() from dev_hard_start_xmit()
_before_ giving skb to device driver.

If device driver returns NETDEV_TX_BUSY, and a qdisc was setup on the
device, packet is requeued.

Later, when queue is allowed to send again packets, packet is
retransmitted (and traced a second time in dev_queue_xmit_nit())

You can see the 'requeues' counter from "tc -s -d qdisc" output :

qdisc mq 0: dev eth2 root 
 Sent 29421597369 bytes 20301716 pkt (dropped 0, overlimits 0 requeues 371) 
 backlog 0b 0p requeues 371 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-15 17:43   ` Eric Dumazet
@ 2011-12-15 18:32     ` Rick Jones
  2011-12-15 18:44       ` Stephen Hemminger
  2011-12-15 18:54       ` Eric Dumazet
  0 siblings, 2 replies; 15+ messages in thread
From: Rick Jones @ 2011-12-15 18:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, tcpdump-workers, netdev


> More exactly, we call dev_queue_xmit_nit() from dev_hard_start_xmit()
> _before_ giving skb to device driver.
>
> If device driver returns NETDEV_TX_BUSY, and a qdisc was setup on the
> device, packet is requeued.
>
> Later, when queue is allowed to send again packets, packet is
> retransmitted (and traced a second time in dev_queue_xmit_nit())

Is this then an unintended consequence bug, or a known feature?

rick

> You can see the 'requeues' counter from "tc -s -d qdisc" output :
>
> qdisc mq 0: dev eth2 root
>   Sent 29421597369 bytes 20301716 pkt (dropped 0, overlimits 0 requeues 371)
>   backlog 0b 0p requeues 371

Sure enough:

$ tc -s -d qdisc
qdisc mq 0: dev eth0 root
  Sent 2212158799862 bytes 1938268098 pkt (dropped 0, overlimits 0 
requeues 4975139)
  backlog 0b 0p requeues 4975139

rick jones

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-15 18:32     ` Rick Jones
@ 2011-12-15 18:44       ` Stephen Hemminger
  2011-12-15 19:00         ` Eric Dumazet
  2011-12-15 18:54       ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2011-12-15 18:44 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, Vijay Subramanian, tcpdump-workers, netdev

On Thu, 15 Dec 2011 10:32:56 -0800
Rick Jones <rick.jones2@hp.com> wrote:

> 
> > More exactly, we call dev_queue_xmit_nit() from dev_hard_start_xmit()
> > _before_ giving skb to device driver.
> >
> > If device driver returns NETDEV_TX_BUSY, and a qdisc was setup on the
> > device, packet is requeued.
> >
> > Later, when queue is allowed to send again packets, packet is
> > retransmitted (and traced a second time in dev_queue_xmit_nit())
> 
> Is this then an unintended consequence bug, or a known feature?
> 
> rick
> 
> > You can see the 'requeues' counter from "tc -s -d qdisc" output :
> >
> > qdisc mq 0: dev eth2 root
> >   Sent 29421597369 bytes 20301716 pkt (dropped 0, overlimits 0 requeues 371)
> >   backlog 0b 0p requeues 371
> 
> Sure enough:
> 
> $ tc -s -d qdisc
> qdisc mq 0: dev eth0 root
>   Sent 2212158799862 bytes 1938268098 pkt (dropped 0, overlimits 0 
> requeues 4975139)
>   backlog 0b 0p requeues 4975139
> 
> rick jones

Device's work better if the driver proactively manages stop_queue/wake_queue.
Old devices used TX_BUSY, but newer devices tend to manage the queue
themselves.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-15 18:44       ` Stephen Hemminger
@ 2011-12-15 19:00         ` Eric Dumazet
  2011-12-15 22:22           ` Rick Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2011-12-15 19:00 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Rick Jones, Vijay Subramanian, tcpdump-workers, netdev

Le jeudi 15 décembre 2011 à 10:44 -0800, Stephen Hemminger a écrit :
> On Thu, 15 Dec 2011 10:32:56 -0800
> Rick Jones <rick.jones2@hp.com> wrote:
> 
> > 
> > > More exactly, we call dev_queue_xmit_nit() from dev_hard_start_xmit()
> > > _before_ giving skb to device driver.
> > >
> > > If device driver returns NETDEV_TX_BUSY, and a qdisc was setup on the
> > > device, packet is requeued.
> > >
> > > Later, when queue is allowed to send again packets, packet is
> > > retransmitted (and traced a second time in dev_queue_xmit_nit())
> > 
> > Is this then an unintended consequence bug, or a known feature?
> > 
> > rick
> > 
> > > You can see the 'requeues' counter from "tc -s -d qdisc" output :
> > >
> > > qdisc mq 0: dev eth2 root
> > >   Sent 29421597369 bytes 20301716 pkt (dropped 0, overlimits 0 requeues 371)
> > >   backlog 0b 0p requeues 371
> > 
> > Sure enough:
> > 
> > $ tc -s -d qdisc
> > qdisc mq 0: dev eth0 root
> >   Sent 2212158799862 bytes 1938268098 pkt (dropped 0, overlimits 0 
> > requeues 4975139)
> >   backlog 0b 0p requeues 4975139
> > 
> > rick jones
> 
> Device's work better if the driver proactively manages stop_queue/wake_queue.
> Old devices used TX_BUSY, but newer devices tend to manage the queue
> themselves.
> 

Some 'new' drivers like igb can be fooled in case skb is gso segmented ?

Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
MAX_SKB_FRAGS*4

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 89d576c..989da36 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -4370,7 +4370,7 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
 	igb_tx_map(tx_ring, first, hdr_len);
 
 	/* Make sure there is space in the ring for the next send. */
-	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS + 4);
+	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS * 4);
 
 	return NETDEV_TX_OK;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-15 19:00         ` Eric Dumazet
@ 2011-12-15 22:22           ` Rick Jones
  2011-12-16  4:27             ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Rick Jones @ 2011-12-15 22:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Vijay Subramanian, tcpdump-workers, netdev

On 12/15/2011 11:00 AM, Eric Dumazet wrote:
>> Device's work better if the driver proactively manages stop_queue/wake_queue.
>> Old devices used TX_BUSY, but newer devices tend to manage the queue
>> themselves.
>>
>
> Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
>
> Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
> descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
> MAX_SKB_FRAGS*4
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 89d576c..989da36 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -4370,7 +4370,7 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
>   	igb_tx_map(tx_ring, first, hdr_len);
>
>   	/* Make sure there is space in the ring for the next send. */
> -	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS + 4);
> +	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS * 4);
>
>   	return NETDEV_TX_OK;


Is there a minimum transmit queue length here?  I get the impression 
that MAX_SKB_FRAGS is at least 16 and is 18 on a system with 4096 byte 
pages.  The previous addition then would be OK so long as the TX queue 
was always at least 22 entries in size, but now it would have to always 
be at least 72?

I guess things are "OK" at the moment:

raj@tardy:~/net-next/drivers/net/ethernet/intel/igb$ grep IGB_MIN_TXD *.[ch]
igb_ethtool.c:	new_tx_count = max_t(u16, new_tx_count, IGB_MIN_TXD);
igb.h:#define IGB_MIN_TXD                       80

but is that getting a little close?

rick jones

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-15 22:22           ` Rick Jones
@ 2011-12-16  4:27             ` Eric Dumazet
  2011-12-16 18:28               ` Jesse Brandeburg
  2011-12-16 19:35               ` Rick Jones
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2011-12-16  4:27 UTC (permalink / raw)
  To: Rick Jones
  Cc: Stephen Hemminger, Vijay Subramanian, tcpdump-workers, netdev,
	Matthew Vick, Jeff Kirsher

Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
> >> Device's work better if the driver proactively manages stop_queue/wake_queue.
> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
> >> themselves.
> >>
> >
> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
> >
> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
> > MAX_SKB_FRAGS*4
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> > index 89d576c..989da36 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -4370,7 +4370,7 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
> >   	igb_tx_map(tx_ring, first, hdr_len);
> >
> >   	/* Make sure there is space in the ring for the next send. */
> > -	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS + 4);
> > +	igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS * 4);
> >
> >   	return NETDEV_TX_OK;
> 
> 
> Is there a minimum transmit queue length here?  I get the impression 
> that MAX_SKB_FRAGS is at least 16 and is 18 on a system with 4096 byte 
> pages.  The previous addition then would be OK so long as the TX queue 
> was always at least 22 entries in size, but now it would have to always 
> be at least 72?
> 
> I guess things are "OK" at the moment:
> 
> raj@tardy:~/net-next/drivers/net/ethernet/intel/igb$ grep IGB_MIN_TXD *.[ch]
> igb_ethtool.c:	new_tx_count = max_t(u16, new_tx_count, IGB_MIN_TXD);
> igb.h:#define IGB_MIN_TXD                       80
> 
> but is that getting a little close?
> 
> rick jones

Sure !

I only pointed out a possible problem, and not gave a full patch, since
we also need to change the opposite threshold (when we XON the queue at
TX completion)

You can see its not even consistent with the minimum for a single TSO
frame ! Most probably your high requeue numbers come from this too low
value given the real requirements of the hardware (4 + nr_frags
descriptors per skb)

/* How many Tx Descriptors do we need to call netif_wake_queue ? */ 
#define IGB_TX_QUEUE_WAKE   16


Maybe we should CC Intel guys

Could you try following patch ?

Thanks !

diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index c69feeb..93ce118 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -51,8 +51,8 @@ struct igb_adapter;
 /* TX/RX descriptor defines */
 #define IGB_DEFAULT_TXD                  256
 #define IGB_DEFAULT_TX_WORK		 128
-#define IGB_MIN_TXD                       80
-#define IGB_MAX_TXD                     4096
+#define IGB_MIN_TXD		max_t(unsigned, 80U, IGB_TX_QUEUE_WAKE * 2)
+#define IGB_MAX_TXD             4096
 
 #define IGB_DEFAULT_RXD                  256
 #define IGB_MIN_RXD                       80
@@ -121,8 +121,11 @@ struct vf_data_storage {
 #define IGB_RXBUFFER_16384 16384
 #define IGB_RX_HDR_LEN     IGB_RXBUFFER_512
 
-/* How many Tx Descriptors do we need to call netif_wake_queue ? */
-#define IGB_TX_QUEUE_WAKE	16
+/* How many Tx Descriptors should be available
+ * before calling netif_wake_subqueue() ?
+ */
+#define IGB_TX_QUEUE_WAKE	(MAX_SKB_FRAGS * 4)
+
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define IGB_RX_BUFFER_WRITE	16	/* Must be power of 2 */
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-16  4:27             ` Eric Dumazet
@ 2011-12-16 18:28               ` Jesse Brandeburg
  2011-12-16 19:34                 ` Eric Dumazet
  2011-12-16 19:35               ` Rick Jones
  1 sibling, 1 reply; 15+ messages in thread
From: Jesse Brandeburg @ 2011-12-16 18:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Stephen Hemminger, Vijay Subramanian, tcpdump-workers,
	netdev, Matthew Vick, Jeff Kirsher

On Thu, Dec 15, 2011 at 8:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
>> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
>> >> Device's work better if the driver proactively manages stop_queue/wake_queue.
>> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
>> >> themselves.
>> >>
>> >
>> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
>> >
>> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
>> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
>> > MAX_SKB_FRAGS*4

can you please help me understand the need for MAX_SKB_FRAGS * 4 as
the requirement?  Currently driver uses logic like

in hard_start_tx: hey I just finished a tx, I should stop the qdisc if
I don't have room (in tx descriptors) for a worst case transmit skb
(MAX_SKB_FRAGS + 4) the next time I'm called.
when cleaning from interrupt: My cleanup is done, do I have enough
free tx descriptors (should be MAX_SKB_FRAGS + 4) for a worst case
transmit?  If yes, restart qdisc.

I'm missing the jump from the above logic to using MAX_SKB_FRAGS * 4
(== (18 * 4) == 72) as the minimum number of descriptors I need for a
worst case TSO.  Each descriptor can point to up to 16kB of contiguous
memory, typically we use 1 for offload context setup, 1 for skb->data,
and 1 for each page.  I think we may be overestimating with
MAX_SKB_FRAGS + 4, but that should be no big deal.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-16 18:28               ` Jesse Brandeburg
@ 2011-12-16 19:34                 ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2011-12-16 19:34 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Rick Jones, Stephen Hemminger, Vijay Subramanian, tcpdump-workers,
	netdev, Matthew Vick, Jeff Kirsher

Le vendredi 16 décembre 2011 à 10:28 -0800, Jesse Brandeburg a écrit :
> On Thu, Dec 15, 2011 at 8:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
> >> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
> >> >> Device's work better if the driver proactively manages stop_queue/wake_queue.
> >> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
> >> >> themselves.
> >> >>
> >> >
> >> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
> >> >
> >> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
> >> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
> >> > MAX_SKB_FRAGS*4
> 
> can you please help me understand the need for MAX_SKB_FRAGS * 4 as
> the requirement?  Currently driver uses logic like
> 
> in hard_start_tx: hey I just finished a tx, I should stop the qdisc if
> I don't have room (in tx descriptors) for a worst case transmit skb
> (MAX_SKB_FRAGS + 4) the next time I'm called.
> when cleaning from interrupt: My cleanup is done, do I have enough
> free tx descriptors (should be MAX_SKB_FRAGS + 4) for a worst case
> transmit?  If yes, restart qdisc.
> 
> I'm missing the jump from the above logic to using MAX_SKB_FRAGS * 4
> (== (18 * 4) == 72) as the minimum number of descriptors I need for a
> worst case TSO.  Each descriptor can point to up to 16kB of contiguous
> memory, typically we use 1 for offload context setup, 1 for skb->data,
> and 1 for each page.  I think we may be overestimating with
> MAX_SKB_FRAGS + 4, but that should be no big deal.

Did you read my second patch ?

Problem is you wakeup the queue too soon (16 available descriptors,
while a full TSO packet needs more than that)

How would you explain high 'requeues' number if it was not the problem ?

Also, its suboptimal to wakeup the queue if available space is very low,
since only _one_ packet may be dequeued from qdisc (you pay high cost in
cache line bouncing)

My first patch was about a very rare event : A full TSO packet is
segmented in gso_segment() [ say if you dynamically disable sg on eth
device and an old tcp buffer is retransmitted ] : You end with 16 skbs
delivered to NIC : In this case we can hit tx ring limit at 4th or 5th
skb, and Rick complains tcpdump outputs some packets several times ;)

Since igb needs 4 descriptors for linear skb, I said : 4 *
MAX_SKB_FRAGS, but real problem is addressed in my second patch, I
believe ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-16  4:27             ` Eric Dumazet
  2011-12-16 18:28               ` Jesse Brandeburg
@ 2011-12-16 19:35               ` Rick Jones
  2011-12-16 19:44                 ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: Rick Jones @ 2011-12-16 19:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Vijay Subramanian, tcpdump-workers, netdev,
	Matthew Vick, Jeff Kirsher


>> but is that getting a little close?
>>
>> rick jones
>
> Sure !
>
> I only pointed out a possible problem, and not gave a full patch, since
> we also need to change the opposite threshold (when we XON the queue at
> TX completion)
>
> You can see its not even consistent with the minimum for a single TSO
> frame ! Most probably your high requeue numbers come from this too low
> value given the real requirements of the hardware (4 + nr_frags
> descriptors per skb)
>
> /* How many Tx Descriptors do we need to call netif_wake_queue ? */
> #define IGB_TX_QUEUE_WAKE   16
>
>
> Maybe we should CC Intel guys
>
> Could you try following patch ?

I would *love* to.  All my accessible igb-driven hardware is in an 
environment locked to the kernels already there :(  Not that it makes it 
more possible for me to do it, but I suspect it does not require 30 
receivers to reproduce the dups with netperf TCP_STREAM.  Particularly 
if the tx queue len is at 256 it may only take 6 or 8. In fact let me 
try that now...

Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one 
system one can still see the duplicates in the packet trace taken on the 
sender.

Perhaps we can trouble the Intel guys to try to reproduce what I've seen?

rick

>
> Thanks !
>
> diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
> index c69feeb..93ce118 100644
> --- a/drivers/net/ethernet/intel/igb/igb.h
> +++ b/drivers/net/ethernet/intel/igb/igb.h
> @@ -51,8 +51,8 @@ struct igb_adapter;
>   /* TX/RX descriptor defines */
>   #define IGB_DEFAULT_TXD                  256
>   #define IGB_DEFAULT_TX_WORK		 128
> -#define IGB_MIN_TXD                       80
> -#define IGB_MAX_TXD                     4096
> +#define IGB_MIN_TXD		max_t(unsigned, 80U, IGB_TX_QUEUE_WAKE * 2)
> +#define IGB_MAX_TXD             4096
>
>   #define IGB_DEFAULT_RXD                  256
>   #define IGB_MIN_RXD                       80
> @@ -121,8 +121,11 @@ struct vf_data_storage {
>   #define IGB_RXBUFFER_16384 16384
>   #define IGB_RX_HDR_LEN     IGB_RXBUFFER_512
>
> -/* How many Tx Descriptors do we need to call netif_wake_queue ? */
> -#define IGB_TX_QUEUE_WAKE	16
> +/* How many Tx Descriptors should be available
> + * before calling netif_wake_subqueue() ?
> + */
> +#define IGB_TX_QUEUE_WAKE	(MAX_SKB_FRAGS * 4)
> +
>   /* How many Rx Buffers do we bundle into one write to the hardware ? */
>   #define IGB_RX_BUFFER_WRITE	16	/* Must be power of 2 */
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-16 19:35               ` Rick Jones
@ 2011-12-16 19:44                 ` Eric Dumazet
  2011-12-20 21:21                   ` Wyborny, Carolyn
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2011-12-16 19:44 UTC (permalink / raw)
  To: Rick Jones
  Cc: Stephen Hemminger, Vijay Subramanian, tcpdump-workers, netdev,
	Matthew Vick, Jeff Kirsher

Le vendredi 16 décembre 2011 à 11:35 -0800, Rick Jones a écrit :

> I would *love* to.  All my accessible igb-driven hardware is in an 
> environment locked to the kernels already there :(  Not that it makes it 
> more possible for me to do it, but I suspect it does not require 30 
> receivers to reproduce the dups with netperf TCP_STREAM.  Particularly 
> if the tx queue len is at 256 it may only take 6 or 8. In fact let me 
> try that now...
> 
> Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one 
> system one can still see the duplicates in the packet trace taken on the 
> sender.
> 
> Perhaps we can trouble the Intel guys to try to reproduce what I've seen?
> 

I do have an igb card somewhere (in fact two dual ports), I'll do the
test myself !

Thanks

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: twice past the taps, thence out to net?
  2011-12-16 19:44                 ` Eric Dumazet
@ 2011-12-20 21:21                   ` Wyborny, Carolyn
  0 siblings, 0 replies; 15+ messages in thread
From: Wyborny, Carolyn @ 2011-12-20 21:21 UTC (permalink / raw)
  To: Eric Dumazet, Rick Jones
  Cc: Stephen Hemminger, Vijay Subramanian,
	tcpdump-workers@lists.tcpdump.org, netdev@vger.kernel.org,
	Vick, Matthew, Kirsher, Jeffrey T



>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
>On Behalf Of Eric Dumazet
>Sent: Friday, December 16, 2011 11:45 AM
>To: Rick Jones
>Cc: Stephen Hemminger; Vijay Subramanian; tcpdump-
>workers@lists.tcpdump.org; netdev@vger.kernel.org; Vick, Matthew;
>Kirsher, Jeffrey T
>Subject: Re: twice past the taps, thence out to net?
>
>Le vendredi 16 décembre 2011 à 11:35 -0800, Rick Jones a écrit :
>
>> I would *love* to.  All my accessible igb-driven hardware is in an
>> environment locked to the kernels already there :(  Not that it makes
>it
>> more possible for me to do it, but I suspect it does not require 30
>> receivers to reproduce the dups with netperf TCP_STREAM.  Particularly
>> if the tx queue len is at 256 it may only take 6 or 8. In fact let me
>> try that now...
>>
>> Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one
>> system one can still see the duplicates in the packet trace taken on
>the
>> sender.
>>
>> Perhaps we can trouble the Intel guys to try to reproduce what I've
>seen?
>>
>
>I do have an igb card somewhere (in fact two dual ports), I'll do the
>test myself !
>
>Thanks
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

Let me know if I can do anything to assist.  Sorry to have overlooked this thread for a bit.

Thanks,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: twice past the taps, thence out to net?
  2011-12-15 18:32     ` Rick Jones
  2011-12-15 18:44       ` Stephen Hemminger
@ 2011-12-15 18:54       ` Eric Dumazet
  1 sibling, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2011-12-15 18:54 UTC (permalink / raw)
  To: Rick Jones; +Cc: Vijay Subramanian, tcpdump-workers, netdev

Le jeudi 15 décembre 2011 à 10:32 -0800, Rick Jones a écrit :
> > More exactly, we call dev_queue_xmit_nit() from dev_hard_start_xmit()
> > _before_ giving skb to device driver.
> >
> > If device driver returns NETDEV_TX_BUSY, and a qdisc was setup on the
> > device, packet is requeued.
> >
> > Later, when queue is allowed to send again packets, packet is
> > retransmitted (and traced a second time in dev_queue_xmit_nit())
> 
> Is this then an unintended consequence bug, or a known feature?
> 

Its a well known feature, some people attempted to remove it ;)

http://answers.softpicks.net/answers/topic/-PATCH-tcpdump-may-trace-some-outbound-packets-twice--2204640-1.htm

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-12-20 21:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-14 19:27 twice past the taps, thence out to net? Rick Jones
2011-12-15  0:58 ` Benjamin Poirier
2011-12-15  2:12 ` Vijay Subramanian
2011-12-15 17:43   ` Eric Dumazet
2011-12-15 18:32     ` Rick Jones
2011-12-15 18:44       ` Stephen Hemminger
2011-12-15 19:00         ` Eric Dumazet
2011-12-15 22:22           ` Rick Jones
2011-12-16  4:27             ` Eric Dumazet
2011-12-16 18:28               ` Jesse Brandeburg
2011-12-16 19:34                 ` Eric Dumazet
2011-12-16 19:35               ` Rick Jones
2011-12-16 19:44                 ` Eric Dumazet
2011-12-20 21:21                   ` Wyborny, Carolyn
2011-12-15 18:54       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).