Re: [E1000-devel] e1000 jumbo problems

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [E1000-devel] e1000 jumbo problems
       [not found] <40D883C2.7010106@draigBrady.com>
@ 2004-06-23 17:35 ` P
  2004-06-24  5:49   ` TCP receiver's window calculation problem Cheng Jin
       [not found]   ` <41b516cb040623114825a9c555@mail.gmail.com>
  0 siblings, 2 replies; 9+ messages in thread
From: P @ 2004-06-23 17:35 UTC (permalink / raw)
  To: e1000-devel; +Cc: netdev

P@draigBrady.com wrote:
> Another related issue, is that the driver uses 4KiB buffers
> for MTUs in the 1500 -> 2000 range which seems a bit silly.
> Any particular reason for that?

I changed the driver to use 2KiB buffers for frames in the
1518 -> 2048 range (BSEX=0, LPE=1). This breaks however
as packets are not dropped that are larger than the max specified?
Instead they're scribbled into memory causing a lockup after a while.

I noticed in e1000_change_mtu() that adapter->hw.max_frame_size
is only set after e1000_down();e1000_up(); Is that correct?

Are there any anwsers for the general questions I had even?

1. Is there a public dev tree available for the e1000 driver?
2. Are there programming docs for the various GigE chipsets?

thanks,
Pádraig.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* TCP receiver's window calculation problem
  2004-06-23 17:35 ` [E1000-devel] e1000 jumbo problems P
@ 2004-06-24  5:49   ` Cheng Jin
  2004-06-24 17:43     ` John Heffner
       [not found]   ` <41b516cb040623114825a9c555@mail.gmail.com>
  1 sibling, 1 reply; 9+ messages in thread
From: Cheng Jin @ 2004-06-24  5:49 UTC (permalink / raw)
  To: netdev@oss.sgi.com; +Cc: fast-support

Hi,

We have been running some iperf experiments over long-latency 
high-capacity networks for protocol testing.  We noticed a strange 
receiver's window limitation of 3,147,776 bytes even when the iperf
server was setup to request 32 MB of socket buffer (for which kernel 
grants twice that).

After doing printk with various window calculation functions at the
receiver, we believe there may be a possible problem with 
tp->rcv_ssthresh calculation in __tcp_grow_window in tcp_input.c.

With input parameters of tcp memory of 64 MB, a jumbo MTU (9000 bytes) 
setting at the receiver, which gives a skb_true_size of 16660 bytes, and a
standard MTU (1500 byte) at the sender side that yields a skb_len of 1448 
bytes. tp->rcv_ssthresh gets stuck at 3,148,472 (see the code segment 
below).  Because the tcp receiver's window needs to be in multiple of mss 
(/1448 then *1448) and  window scaling (>>10 and then <<10), the sender sees
a limit of 3,147,776 bytes.

I include an example code (stripped away the data structs and expanded 
whatever macros there are) that reproduces this problem.  The function
__tcp_grow_window itself may have problems for other combinations of 
input.

#include <stdio.h>
#include <stdlib.h>

typedef unsigned int __u32;

static int
__tcp_grow_window(__u32 rcv_ssthresh, __u32 tcp_full_space,
					__u32 skb_true_size, __u32 skb_len)
{
	int truesize = skb_true_size*3/8;
	int window = tcp_full_space*3/8;

	while ( rcv_ssthresh <= window ) {
		if ( truesize <= skb_len )
			return 2896;

		truesize >>= 1;
		window >>= 1;
	}
	return 0;
}

int main()
{

	__u32 iperf_mem = 64*1024*1024;
	__u32 skb_true_size = 16660;
	__u32 skb_len = 1448;
	__u32 rcv_ssthresh = 3148472;

	int i, incr;

	for (i=0; i<1000; ++i)
	{
		incr = __tcp_grow_window(rcv_ssthresh, iperf_mem, skb_true_size, skb_len);
		printf("i=%d incr=%d\n", i, incr);
	}
}

Cheng

-- 
Lab # 626 395 8820

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TCP receiver's window calculation problem
  2004-06-24  5:49   ` TCP receiver's window calculation problem Cheng Jin
@ 2004-06-24 17:43     ` John Heffner
  2004-06-24 19:18       ` Cheng Jin
  0 siblings, 1 reply; 9+ messages in thread
From: John Heffner @ 2004-06-24 17:43 UTC (permalink / raw)
  To: Cheng Jin; +Cc: netdev@oss.sgi.com, fast-support

I've run in to this problem, too.  This code prevents advertising more
rcvbuf space than you are likely to need.  This is good for something like
an X11 connection, but obviously very bad for the bulk transfer mixed MTU
case.  Apparently some drivers use multiple packet buffer sizes which
helps, but at least e1000 and sk98lin do not.  If you don't want to
overrun the rcvbuf bounds, then the only other recourse is to coalesce
packets, which works well but is pretty expensive.  This will happen
already if you take out the rcv_ssthresh bound.

I think the most desirable answer is to not have a hard per-connection
memory bound, but this is problematic because of denial-of-service
conerns.

  -John


On Wed, 23 Jun 2004, Cheng Jin wrote:

>
> Hi,
>
> We have been running some iperf experiments over long-latency
> high-capacity networks for protocol testing.  We noticed a strange
> receiver's window limitation of 3,147,776 bytes even when the iperf
> server was setup to request 32 MB of socket buffer (for which kernel
> grants twice that).
>
> After doing printk with various window calculation functions at the
> receiver, we believe there may be a possible problem with
> tp->rcv_ssthresh calculation in __tcp_grow_window in tcp_input.c.
>
> With input parameters of tcp memory of 64 MB, a jumbo MTU (9000 bytes)
> setting at the receiver, which gives a skb_true_size of 16660 bytes, and a
> standard MTU (1500 byte) at the sender side that yields a skb_len of 1448
> bytes. tp->rcv_ssthresh gets stuck at 3,148,472 (see the code segment
> below).  Because the tcp receiver's window needs to be in multiple of mss
> (/1448 then *1448) and  window scaling (>>10 and then <<10), the sender sees
> a limit of 3,147,776 bytes.
>
> I include an example code (stripped away the data structs and expanded
> whatever macros there are) that reproduces this problem.  The function
> __tcp_grow_window itself may have problems for other combinations of
> input.
>
> #include <stdio.h>
> #include <stdlib.h>
>
> typedef unsigned int __u32;
>
> static int
> __tcp_grow_window(__u32 rcv_ssthresh, __u32 tcp_full_space,
> 					__u32 skb_true_size, __u32 skb_len)
> {
> 	int truesize = skb_true_size*3/8;
> 	int window = tcp_full_space*3/8;
>
> 	while ( rcv_ssthresh <= window ) {
> 		if ( truesize <= skb_len )
> 			return 2896;
>
> 		truesize >>= 1;
> 		window >>= 1;
> 	}
> 	return 0;
> }
>
>
> int main()
> {
>
> 	__u32 iperf_mem = 64*1024*1024;
> 	__u32 skb_true_size = 16660;
> 	__u32 skb_len = 1448;
> 	__u32 rcv_ssthresh = 3148472;
>
> 	int i, incr;
>
> 	for (i=0; i<1000; ++i)
> 	{
> 		incr = __tcp_grow_window(rcv_ssthresh, iperf_mem, skb_true_size, skb_len);
> 		printf("i=%d incr=%d\n", i, incr);
> 	}
> }
>
>
> Cheng
>
> --
> Lab # 626 395 8820
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TCP receiver's window calculation problem
  2004-06-24 17:43     ` John Heffner
@ 2004-06-24 19:18       ` Cheng Jin
  2004-06-24 19:26         ` John Heffner
  0 siblings, 1 reply; 9+ messages in thread
From: Cheng Jin @ 2004-06-24 19:18 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu

Hi, John,

Thanks for confirming this problem.

> I've run in to this problem, too.  This code prevents advertising more
> rcvbuf space than you are likely to need.  This is good for something like
> an X11 connection, but obviously very bad for the bulk transfer mixed MTU

I would think this is already taken care of at the sender by application
limited cwnd so cwnd wouldn't increase beyond what is being actually used.

> I think the most desirable answer is to not have a hard per-connection
> memory bound, but this is problematic because of denial-of-service
> conerns.

I think having a default limit on tcp memory is acceptable to prevent DoS, 
but when a user increases the memory limit by explicitly setting tcp_rmem,
that should take effect.  The code itself shouldnt pose any limit like it 
does now.

Actually, I am not clear what that window-calculation algorithm is.  Is it 
recommended by some RFC?

Cheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TCP receiver's window calculation problem
  2004-06-24 19:18       ` Cheng Jin
@ 2004-06-24 19:26         ` John Heffner
  2004-06-25  6:37           ` Cheng Jin
  0 siblings, 1 reply; 9+ messages in thread
From: John Heffner @ 2004-06-24 19:26 UTC (permalink / raw)
  To: Cheng Jin; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu

On Thu, 24 Jun 2004, Cheng Jin wrote:

> I think having a default limit on tcp memory is acceptable to prevent DoS,
> but when a user increases the memory limit by explicitly setting tcp_rmem,
> that should take effect.  The code itself shouldnt pose any limit like it
> does now.

The core of the problem is that you are describing a truesize of each skb
at about 16k, but each of those only contains < 1500 bytes of payload.
You are wasting 90% of your socket memory.  Announcing a 3 MB window with
a 30 MB socket buffer is the right thing to do, from a certain point of
view.  OTOH, it kills performance.

> Actually, I am not clear what that window-calculation algorithm is.  Is it
> recommended by some RFC?

No, it's not standard.  I'm not sure who wrote this code.

  -John

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TCP receiver's window calculation problem
  2004-06-24 19:26         ` John Heffner
@ 2004-06-25  6:37           ` Cheng Jin
  2004-06-25 13:43             ` John Heffner
  0 siblings, 1 reply; 9+ messages in thread
From: Cheng Jin @ 2004-06-25  6:37 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu

John,

> > I think having a default limit on tcp memory is acceptable to prevent DoS,
> > but when a user increases the memory limit by explicitly setting tcp_rmem,
> > that should take effect.  The code itself shouldnt pose any limit like it
> > does now.
> 
> The core of the problem is that you are describing a truesize of each skb
> at about 16k, but each of those only contains < 1500 bytes of payload.
> You are wasting 90% of your socket memory.  Announcing a 3 MB window with
> a 30 MB socket buffer is the right thing to do, from a certain point of
> view.  OTOH, it kills performance.

The receiver is set to use a 9000 MTU, but the sender uses a 1500-byte 
MTU, which is not really a pathological case.  It would have made more
sense for the receiver to allocate skbs of the right size as incoming 
packets are received.  

Is it due to effciency reasons that the skbs are just fixed in size 
according to the set MTU on the interface card? I suppose that the 
receiver has no real way of knowing the right MTU size at the sender.

Thanks,

Cheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TCP receiver's window calculation problem
  2004-06-25  6:37           ` Cheng Jin
@ 2004-06-25 13:43             ` John Heffner
  0 siblings, 0 replies; 9+ messages in thread
From: John Heffner @ 2004-06-25 13:43 UTC (permalink / raw)
  To: Cheng Jin; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu

On Thu, 24 Jun 2004, Cheng Jin wrote:

> The receiver is set to use a 9000 MTU, but the sender uses a 1500-byte
> MTU, which is not really a pathological case.  It would have made more
> sense for the receiver to allocate skbs of the right size as incoming
> packets are received.

Some drivers are apparently optimized for this case.  I have not confirmed
this.


> Is it due to effciency reasons that the skbs are just fixed in size
> according to the set MTU on the interface card? I suppose that the
> receiver has no real way of knowing the right MTU size at the sender.

I haven't looked too hard in to this.  Any device driver people want to
chime in?

  -John

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <41b516cb040623114825a9c555@mail.gmail.com>]

* Re: [E1000-devel] e1000 jumbo problems
       [not found]   ` <41b516cb040623114825a9c555@mail.gmail.com>
@ 2004-06-24 10:36     ` P
  2004-07-01 19:51     ` [PATCH] " P
  1 sibling, 0 replies; 9+ messages in thread
From: P @ 2004-06-24 10:36 UTC (permalink / raw)
  To: Chris Leech; +Cc: e1000-devel, netdev

Chris Leech wrote:
>>>Another related issue, is that the driver uses 4KiB buffers
>>>for MTUs in the 1500 -> 2000 range which seems a bit silly.
>>>Any particular reason for that?
> 
> 
> It is wasteful, but does anyone actually use an MTU in the range of
> 1501 - 2030?  It seems silly to me to go with a non-standard frame
> size, but not go up to something that might give you a performance
> benefit (9k).

I'm seeing it with MPLS in some configs.
MPLS labels are just prepended onto ethernet frames giving frames
up to 1546 bytes. Using 4KiB frames for this situation is wasteful
of memory but more importantly for my application it has a noticeable
impact on receive performance.

>>I changed the driver to use 2KiB buffers for frames in the
>>1518 -> 2048 range (BSEX=0, LPE=1). This breaks however
>>as packets are not dropped that are larger than the max specified?
>>Instead they're scribbled into memory causing a lockup after a while.
> 
> 
> That sounds right, if you actually got the RCTL register set
> correctly.  In e1000_setup_rctl the adapter->rx_buffer_len is used to
> set that register, and it's currently written to only set LPE if the
> buffer size is bigger than 2k (thus, why 4k buffers are used even when
> the MTU is in the 1501 - 2030 range).  To use 2k buffers for slightly
> large frames, you'd want some new flag in the adapter for LPE (or
> check netdev->mtu I guess) and do something like: rctl |=
> E1000_RCTL_SZ_2048 | E1000_RCTL_LPE

yep, that's what I did.

> e1000 devices don't have a programmable MTU for receive filtering,
> they drop anything larger than 1518 unless LPE (long packet enable) is
> set.  If LPE is set they accept anything that fits in the FIFO and has
> a valid FCS.

thanks for that. What I'm noticing now is the same thing
happens with the official driver (5.2.52 or 5.2.30.1).
I.E. set the MTU to 4000 for e.g., then send in frames
larger than 4096 and they're accepted? Doing this for
a while causes mem to get corrupted.

> An MTU setting needs to be valid across your ethernet, why is the
> e1000 receiving a frame larger than the MTU? (jabber should be rare) 
> But, if the length of receive buffers matches what was set in RCTL,
> larger than expected valid frames will spill over to the next buffer
> and be dropped in the driver without corrupting memory.

Are the buffers in contiguous mem? What happens for the last buffer?

>>I noticed in e1000_change_mtu() that adapter->hw.max_frame_size
>>is only set after e1000_down();e1000_up(); Is that correct?
> 
> There might be a slight race there (I'll think about it some more),
> but it's not something that would cause memory corruption. 
> hw.max_frame_size is only used in a software workaround for 82543
> based copper gigabit cards (vendor:device 8086:1004) when paired with
> certain link partners.

fair enough.

> 
>>Are there any anwsers for the general questions I had even?
>>
>>1. Is there a public dev tree available for the e1000 driver?
> 
> No, the best source base to work from is what is in the 2.6 kernel
> tree (or Jeff's net-drivers tree).  We keep that as up to date as
> possible, and it's always fairly close to our internal development
> sources.
> 
>>2. Are there programming docs for the various GigE chipsets?
> 
> Not publicly available at this time.

thanks a million,
Pádraig.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] Re: [E1000-devel] e1000 jumbo problems
       [not found]   ` <41b516cb040623114825a9c555@mail.gmail.com>
  2004-06-24 10:36     ` [E1000-devel] e1000 jumbo problems P
@ 2004-07-01 19:51     ` P
  1 sibling, 0 replies; 9+ messages in thread
From: P @ 2004-07-01 19:51 UTC (permalink / raw)
  To: Chris Leech; +Cc: e1000-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 2515 bytes --]

This patch is not for applying, just for discussion.
comments below...

Chris Leech wrote:
>>>Another related issue, is that the driver uses 4KiB buffers
>>>for MTUs in the 1500 -> 2000 range which seems a bit silly.
>>>Any particular reason for that?
> 
> 
> It is wasteful, but does anyone actually use an MTU in the range of
> 1501 - 2030?  It seems silly to me to go with a non-standard frame
> size, but not go up to something that might give you a performance
> benefit (9k).
>  
> 
>>I changed the driver to use 2KiB buffers for frames in the
>>1518 -> 2048 range (BSEX=0, LPE=1). This breaks however
>>as packets are not dropped that are larger than the max specified?
>>Instead they're scribbled into memory causing a lockup after a while.
> 
> 
> That sounds right, if you actually got the RCTL register set
> correctly.  In e1000_setup_rctl the adapter->rx_buffer_len is used to
> set that register, and it's currently written to only set LPE if the
> buffer size is bigger than 2k (thus, why 4k buffers are used even when
> the MTU is in the 1501 - 2030 range).  To use 2k buffers for slightly
> large frames, you'd want some new flag in the adapter for LPE (or
> check netdev->mtu I guess) and do something like: rctl |=
> E1000_RCTL_SZ_2048 | E1000_RCTL_LPE
> 
> e1000 devices don't have a programmable MTU for receive filtering,
> they drop anything larger than 1518 unless LPE (long packet enable) is
> set.  If LPE is set they accept anything that fits in the FIFO and has
> a valid FCS.

More accurately e1000s accept anything (even greater than a FIFO).
When a large packet is written into multiple FIFOs, only the
last rx descriptor has the EOP (end of packet) flag set.
The driver doesn't handle this at all currently and will
drop the initial buffers (because they don't have the EOP set) which
is fine, but it will accept the last buffer (part of the packet).

I've attached a patch that fixes this. Also the patch drops packets
that fit within a buffer but are larger than MTU. So in summary
the patch will stop packets > MTU being accepted by the driver.

Note also this patch changes to using 2KiB buffers (from 4KiB)
for MTUs between 1500 and 2030, and also it enables large frame
reception (LFE) always, but ingore these as they're just for
debugging.

The patch makes my system completely stable now for MTUs <= 2500,
However I can still get the system to freeze repeatedly by sending
packets larger than this.

cheers,
Pádraig.

[-- Attachment #2: e1000-smallMTU.diff --]
[-- Type: application/x-texinfo, Size: 5497 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-07-01 19:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <40D883C2.7010106@draigBrady.com>
2004-06-23 17:35 ` [E1000-devel] e1000 jumbo problems P
2004-06-24  5:49   ` TCP receiver's window calculation problem Cheng Jin
2004-06-24 17:43     ` John Heffner
2004-06-24 19:18       ` Cheng Jin
2004-06-24 19:26         ` John Heffner
2004-06-25  6:37           ` Cheng Jin
2004-06-25 13:43             ` John Heffner
     [not found]   ` <41b516cb040623114825a9c555@mail.gmail.com>
2004-06-24 10:36     ` [E1000-devel] e1000 jumbo problems P
2004-07-01 19:51     ` [PATCH] " P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).