* Re: [E1000-devel] e1000 jumbo problems [not found] <40D883C2.7010106@draigBrady.com> @ 2004-06-23 17:35 ` P 2004-06-24 5:49 ` TCP receiver's window calculation problem Cheng Jin [not found] ` <41b516cb040623114825a9c555@mail.gmail.com> 0 siblings, 2 replies; 9+ messages in thread From: P @ 2004-06-23 17:35 UTC (permalink / raw) To: e1000-devel; +Cc: netdev P@draigBrady.com wrote: > Another related issue, is that the driver uses 4KiB buffers > for MTUs in the 1500 -> 2000 range which seems a bit silly. > Any particular reason for that? I changed the driver to use 2KiB buffers for frames in the 1518 -> 2048 range (BSEX=0, LPE=1). This breaks however as packets are not dropped that are larger than the max specified? Instead they're scribbled into memory causing a lockup after a while. I noticed in e1000_change_mtu() that adapter->hw.max_frame_size is only set after e1000_down();e1000_up(); Is that correct? Are there any anwsers for the general questions I had even? 1. Is there a public dev tree available for the e1000 driver? 2. Are there programming docs for the various GigE chipsets? thanks, Pádraig. ^ permalink raw reply [flat|nested] 9+ messages in thread
* TCP receiver's window calculation problem 2004-06-23 17:35 ` [E1000-devel] e1000 jumbo problems P @ 2004-06-24 5:49 ` Cheng Jin 2004-06-24 17:43 ` John Heffner [not found] ` <41b516cb040623114825a9c555@mail.gmail.com> 1 sibling, 1 reply; 9+ messages in thread From: Cheng Jin @ 2004-06-24 5:49 UTC (permalink / raw) To: netdev@oss.sgi.com; +Cc: fast-support Hi, We have been running some iperf experiments over long-latency high-capacity networks for protocol testing. We noticed a strange receiver's window limitation of 3,147,776 bytes even when the iperf server was setup to request 32 MB of socket buffer (for which kernel grants twice that). After doing printk with various window calculation functions at the receiver, we believe there may be a possible problem with tp->rcv_ssthresh calculation in __tcp_grow_window in tcp_input.c. With input parameters of tcp memory of 64 MB, a jumbo MTU (9000 bytes) setting at the receiver, which gives a skb_true_size of 16660 bytes, and a standard MTU (1500 byte) at the sender side that yields a skb_len of 1448 bytes. tp->rcv_ssthresh gets stuck at 3,148,472 (see the code segment below). Because the tcp receiver's window needs to be in multiple of mss (/1448 then *1448) and window scaling (>>10 and then <<10), the sender sees a limit of 3,147,776 bytes. I include an example code (stripped away the data structs and expanded whatever macros there are) that reproduces this problem. The function __tcp_grow_window itself may have problems for other combinations of input. #include <stdio.h> #include <stdlib.h> typedef unsigned int __u32; static int __tcp_grow_window(__u32 rcv_ssthresh, __u32 tcp_full_space, __u32 skb_true_size, __u32 skb_len) { int truesize = skb_true_size*3/8; int window = tcp_full_space*3/8; while ( rcv_ssthresh <= window ) { if ( truesize <= skb_len ) return 2896; truesize >>= 1; window >>= 1; } return 0; } int main() { __u32 iperf_mem = 64*1024*1024; __u32 skb_true_size = 16660; __u32 skb_len = 1448; __u32 rcv_ssthresh = 3148472; int i, incr; for (i=0; i<1000; ++i) { incr = __tcp_grow_window(rcv_ssthresh, iperf_mem, skb_true_size, skb_len); printf("i=%d incr=%d\n", i, incr); } } Cheng -- Lab # 626 395 8820 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: TCP receiver's window calculation problem 2004-06-24 5:49 ` TCP receiver's window calculation problem Cheng Jin @ 2004-06-24 17:43 ` John Heffner 2004-06-24 19:18 ` Cheng Jin 0 siblings, 1 reply; 9+ messages in thread From: John Heffner @ 2004-06-24 17:43 UTC (permalink / raw) To: Cheng Jin; +Cc: netdev@oss.sgi.com, fast-support I've run in to this problem, too. This code prevents advertising more rcvbuf space than you are likely to need. This is good for something like an X11 connection, but obviously very bad for the bulk transfer mixed MTU case. Apparently some drivers use multiple packet buffer sizes which helps, but at least e1000 and sk98lin do not. If you don't want to overrun the rcvbuf bounds, then the only other recourse is to coalesce packets, which works well but is pretty expensive. This will happen already if you take out the rcv_ssthresh bound. I think the most desirable answer is to not have a hard per-connection memory bound, but this is problematic because of denial-of-service conerns. -John On Wed, 23 Jun 2004, Cheng Jin wrote: > > Hi, > > We have been running some iperf experiments over long-latency > high-capacity networks for protocol testing. We noticed a strange > receiver's window limitation of 3,147,776 bytes even when the iperf > server was setup to request 32 MB of socket buffer (for which kernel > grants twice that). > > After doing printk with various window calculation functions at the > receiver, we believe there may be a possible problem with > tp->rcv_ssthresh calculation in __tcp_grow_window in tcp_input.c. > > With input parameters of tcp memory of 64 MB, a jumbo MTU (9000 bytes) > setting at the receiver, which gives a skb_true_size of 16660 bytes, and a > standard MTU (1500 byte) at the sender side that yields a skb_len of 1448 > bytes. tp->rcv_ssthresh gets stuck at 3,148,472 (see the code segment > below). Because the tcp receiver's window needs to be in multiple of mss > (/1448 then *1448) and window scaling (>>10 and then <<10), the sender sees > a limit of 3,147,776 bytes. > > I include an example code (stripped away the data structs and expanded > whatever macros there are) that reproduces this problem. The function > __tcp_grow_window itself may have problems for other combinations of > input. > > #include <stdio.h> > #include <stdlib.h> > > typedef unsigned int __u32; > > static int > __tcp_grow_window(__u32 rcv_ssthresh, __u32 tcp_full_space, > __u32 skb_true_size, __u32 skb_len) > { > int truesize = skb_true_size*3/8; > int window = tcp_full_space*3/8; > > while ( rcv_ssthresh <= window ) { > if ( truesize <= skb_len ) > return 2896; > > truesize >>= 1; > window >>= 1; > } > return 0; > } > > > int main() > { > > __u32 iperf_mem = 64*1024*1024; > __u32 skb_true_size = 16660; > __u32 skb_len = 1448; > __u32 rcv_ssthresh = 3148472; > > int i, incr; > > for (i=0; i<1000; ++i) > { > incr = __tcp_grow_window(rcv_ssthresh, iperf_mem, skb_true_size, skb_len); > printf("i=%d incr=%d\n", i, incr); > } > } > > > Cheng > > -- > Lab # 626 395 8820 > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: TCP receiver's window calculation problem 2004-06-24 17:43 ` John Heffner @ 2004-06-24 19:18 ` Cheng Jin 2004-06-24 19:26 ` John Heffner 0 siblings, 1 reply; 9+ messages in thread From: Cheng Jin @ 2004-06-24 19:18 UTC (permalink / raw) To: John Heffner; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu Hi, John, Thanks for confirming this problem. > I've run in to this problem, too. This code prevents advertising more > rcvbuf space than you are likely to need. This is good for something like > an X11 connection, but obviously very bad for the bulk transfer mixed MTU I would think this is already taken care of at the sender by application limited cwnd so cwnd wouldn't increase beyond what is being actually used. > I think the most desirable answer is to not have a hard per-connection > memory bound, but this is problematic because of denial-of-service > conerns. I think having a default limit on tcp memory is acceptable to prevent DoS, but when a user increases the memory limit by explicitly setting tcp_rmem, that should take effect. The code itself shouldnt pose any limit like it does now. Actually, I am not clear what that window-calculation algorithm is. Is it recommended by some RFC? Cheng ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: TCP receiver's window calculation problem 2004-06-24 19:18 ` Cheng Jin @ 2004-06-24 19:26 ` John Heffner 2004-06-25 6:37 ` Cheng Jin 0 siblings, 1 reply; 9+ messages in thread From: John Heffner @ 2004-06-24 19:26 UTC (permalink / raw) To: Cheng Jin; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu On Thu, 24 Jun 2004, Cheng Jin wrote: > I think having a default limit on tcp memory is acceptable to prevent DoS, > but when a user increases the memory limit by explicitly setting tcp_rmem, > that should take effect. The code itself shouldnt pose any limit like it > does now. The core of the problem is that you are describing a truesize of each skb at about 16k, but each of those only contains < 1500 bytes of payload. You are wasting 90% of your socket memory. Announcing a 3 MB window with a 30 MB socket buffer is the right thing to do, from a certain point of view. OTOH, it kills performance. > Actually, I am not clear what that window-calculation algorithm is. Is it > recommended by some RFC? No, it's not standard. I'm not sure who wrote this code. -John ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: TCP receiver's window calculation problem 2004-06-24 19:26 ` John Heffner @ 2004-06-25 6:37 ` Cheng Jin 2004-06-25 13:43 ` John Heffner 0 siblings, 1 reply; 9+ messages in thread From: Cheng Jin @ 2004-06-25 6:37 UTC (permalink / raw) To: John Heffner; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu John, > > I think having a default limit on tcp memory is acceptable to prevent DoS, > > but when a user increases the memory limit by explicitly setting tcp_rmem, > > that should take effect. The code itself shouldnt pose any limit like it > > does now. > > The core of the problem is that you are describing a truesize of each skb > at about 16k, but each of those only contains < 1500 bytes of payload. > You are wasting 90% of your socket memory. Announcing a 3 MB window with > a 30 MB socket buffer is the right thing to do, from a certain point of > view. OTOH, it kills performance. The receiver is set to use a 9000 MTU, but the sender uses a 1500-byte MTU, which is not really a pathological case. It would have made more sense for the receiver to allocate skbs of the right size as incoming packets are received. Is it due to effciency reasons that the skbs are just fixed in size according to the set MTU on the interface card? I suppose that the receiver has no real way of knowing the right MTU size at the sender. Thanks, Cheng ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: TCP receiver's window calculation problem 2004-06-25 6:37 ` Cheng Jin @ 2004-06-25 13:43 ` John Heffner 0 siblings, 0 replies; 9+ messages in thread From: John Heffner @ 2004-06-25 13:43 UTC (permalink / raw) To: Cheng Jin; +Cc: netdev@oss.sgi.com, fast-support@cs.caltech.edu On Thu, 24 Jun 2004, Cheng Jin wrote: > The receiver is set to use a 9000 MTU, but the sender uses a 1500-byte > MTU, which is not really a pathological case. It would have made more > sense for the receiver to allocate skbs of the right size as incoming > packets are received. Some drivers are apparently optimized for this case. I have not confirmed this. > Is it due to effciency reasons that the skbs are just fixed in size > according to the set MTU on the interface card? I suppose that the > receiver has no real way of knowing the right MTU size at the sender. I haven't looked too hard in to this. Any device driver people want to chime in? -John ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <41b516cb040623114825a9c555@mail.gmail.com>]
* Re: [E1000-devel] e1000 jumbo problems [not found] ` <41b516cb040623114825a9c555@mail.gmail.com> @ 2004-06-24 10:36 ` P 2004-07-01 19:51 ` [PATCH] " P 1 sibling, 0 replies; 9+ messages in thread From: P @ 2004-06-24 10:36 UTC (permalink / raw) To: Chris Leech; +Cc: e1000-devel, netdev Chris Leech wrote: >>>Another related issue, is that the driver uses 4KiB buffers >>>for MTUs in the 1500 -> 2000 range which seems a bit silly. >>>Any particular reason for that? > > > It is wasteful, but does anyone actually use an MTU in the range of > 1501 - 2030? It seems silly to me to go with a non-standard frame > size, but not go up to something that might give you a performance > benefit (9k). I'm seeing it with MPLS in some configs. MPLS labels are just prepended onto ethernet frames giving frames up to 1546 bytes. Using 4KiB frames for this situation is wasteful of memory but more importantly for my application it has a noticeable impact on receive performance. >>I changed the driver to use 2KiB buffers for frames in the >>1518 -> 2048 range (BSEX=0, LPE=1). This breaks however >>as packets are not dropped that are larger than the max specified? >>Instead they're scribbled into memory causing a lockup after a while. > > > That sounds right, if you actually got the RCTL register set > correctly. In e1000_setup_rctl the adapter->rx_buffer_len is used to > set that register, and it's currently written to only set LPE if the > buffer size is bigger than 2k (thus, why 4k buffers are used even when > the MTU is in the 1501 - 2030 range). To use 2k buffers for slightly > large frames, you'd want some new flag in the adapter for LPE (or > check netdev->mtu I guess) and do something like: rctl |= > E1000_RCTL_SZ_2048 | E1000_RCTL_LPE yep, that's what I did. > e1000 devices don't have a programmable MTU for receive filtering, > they drop anything larger than 1518 unless LPE (long packet enable) is > set. If LPE is set they accept anything that fits in the FIFO and has > a valid FCS. thanks for that. What I'm noticing now is the same thing happens with the official driver (5.2.52 or 5.2.30.1). I.E. set the MTU to 4000 for e.g., then send in frames larger than 4096 and they're accepted? Doing this for a while causes mem to get corrupted. > An MTU setting needs to be valid across your ethernet, why is the > e1000 receiving a frame larger than the MTU? (jabber should be rare) > But, if the length of receive buffers matches what was set in RCTL, > larger than expected valid frames will spill over to the next buffer > and be dropped in the driver without corrupting memory. Are the buffers in contiguous mem? What happens for the last buffer? >>I noticed in e1000_change_mtu() that adapter->hw.max_frame_size >>is only set after e1000_down();e1000_up(); Is that correct? > > There might be a slight race there (I'll think about it some more), > but it's not something that would cause memory corruption. > hw.max_frame_size is only used in a software workaround for 82543 > based copper gigabit cards (vendor:device 8086:1004) when paired with > certain link partners. fair enough. > >>Are there any anwsers for the general questions I had even? >> >>1. Is there a public dev tree available for the e1000 driver? > > No, the best source base to work from is what is in the 2.6 kernel > tree (or Jeff's net-drivers tree). We keep that as up to date as > possible, and it's always fairly close to our internal development > sources. > >>2. Are there programming docs for the various GigE chipsets? > > Not publicly available at this time. thanks a million, Pádraig. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] Re: [E1000-devel] e1000 jumbo problems [not found] ` <41b516cb040623114825a9c555@mail.gmail.com> 2004-06-24 10:36 ` [E1000-devel] e1000 jumbo problems P @ 2004-07-01 19:51 ` P 1 sibling, 0 replies; 9+ messages in thread From: P @ 2004-07-01 19:51 UTC (permalink / raw) To: Chris Leech; +Cc: e1000-devel, netdev [-- Attachment #1: Type: text/plain, Size: 2515 bytes --] This patch is not for applying, just for discussion. comments below... Chris Leech wrote: >>>Another related issue, is that the driver uses 4KiB buffers >>>for MTUs in the 1500 -> 2000 range which seems a bit silly. >>>Any particular reason for that? > > > It is wasteful, but does anyone actually use an MTU in the range of > 1501 - 2030? It seems silly to me to go with a non-standard frame > size, but not go up to something that might give you a performance > benefit (9k). > > >>I changed the driver to use 2KiB buffers for frames in the >>1518 -> 2048 range (BSEX=0, LPE=1). This breaks however >>as packets are not dropped that are larger than the max specified? >>Instead they're scribbled into memory causing a lockup after a while. > > > That sounds right, if you actually got the RCTL register set > correctly. In e1000_setup_rctl the adapter->rx_buffer_len is used to > set that register, and it's currently written to only set LPE if the > buffer size is bigger than 2k (thus, why 4k buffers are used even when > the MTU is in the 1501 - 2030 range). To use 2k buffers for slightly > large frames, you'd want some new flag in the adapter for LPE (or > check netdev->mtu I guess) and do something like: rctl |= > E1000_RCTL_SZ_2048 | E1000_RCTL_LPE > > e1000 devices don't have a programmable MTU for receive filtering, > they drop anything larger than 1518 unless LPE (long packet enable) is > set. If LPE is set they accept anything that fits in the FIFO and has > a valid FCS. More accurately e1000s accept anything (even greater than a FIFO). When a large packet is written into multiple FIFOs, only the last rx descriptor has the EOP (end of packet) flag set. The driver doesn't handle this at all currently and will drop the initial buffers (because they don't have the EOP set) which is fine, but it will accept the last buffer (part of the packet). I've attached a patch that fixes this. Also the patch drops packets that fit within a buffer but are larger than MTU. So in summary the patch will stop packets > MTU being accepted by the driver. Note also this patch changes to using 2KiB buffers (from 4KiB) for MTUs between 1500 and 2030, and also it enables large frame reception (LFE) always, but ingore these as they're just for debugging. The patch makes my system completely stable now for MTUs <= 2500, However I can still get the system to freeze repeatedly by sending packets larger than this. cheers, Pádraig. [-- Attachment #2: e1000-smallMTU.diff --] [-- Type: application/x-texinfo, Size: 5497 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2004-07-01 19:51 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <40D883C2.7010106@draigBrady.com>
2004-06-23 17:35 ` [E1000-devel] e1000 jumbo problems P
2004-06-24 5:49 ` TCP receiver's window calculation problem Cheng Jin
2004-06-24 17:43 ` John Heffner
2004-06-24 19:18 ` Cheng Jin
2004-06-24 19:26 ` John Heffner
2004-06-25 6:37 ` Cheng Jin
2004-06-25 13:43 ` John Heffner
[not found] ` <41b516cb040623114825a9c555@mail.gmail.com>
2004-06-24 10:36 ` [E1000-devel] e1000 jumbo problems P
2004-07-01 19:51 ` [PATCH] " P
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).