Re: bizarre network timing problem

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: bizarre network timing problem
       [not found] <20071017205127.GA21334@codeblau.de>
@ 2007-10-17 21:17 ` Chuck Ebbert
  2007-10-17 22:00   ` Felix von Leitner
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2007-10-17 21:17 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: linux-kernel, Netdev

On 10/17/2007 04:51 PM, Felix von Leitner wrote:
> I wrote a small read-only SMB server, and wanted to see how fast it was.
> So I used smbget to download a moderately large file from it via localhost.
> smbget only got ~70 KB/sec.
> 
> This is what the view from strace -tt on the server is:
> 
> 22:44:58.812467 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59
> 22:44:58.812619 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b46b8e5e000
> 22:44:58.812729 fcntl(9, F_GETFL)       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
> 22:44:58.812847 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=13323248792850399240}}) = 0
> 22:44:58.812946 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLOUT, {u32=8, u64=18251433459580936}}) = 0
> 22:44:58.813039 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1
> 22:44:58.813132 setsockopt(8, SOL_TCP, TCP_CORK, [1], 4) = 0
> 22:44:58.813215 write(8, "\0\0\372<\377SMB.\0\0\0\0\200A\300\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 64) = 64
> 22:44:58.813323 sendfile(8, 9, [128000], 64000) = 64000
> 22:44:58.813430 setsockopt(8, SOL_TCP, TCP_CORK, [0], 4) = 0
> 22:44:58.813511 munmap(0x2b46b8e5e000, 8192) = 0
> 22:44:58.813600 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1
> 22:44:58.813693 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=8}}) = 0
> 22:44:58.813778 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=18252000395264008}}) = 0
> 22:44:58.813869 epoll_wait(7, {}, 100, 441) = 0
> 22:44:59.255789 epoll_wait(7, {{EPOLLIN, {u32=8, u64=18252000395264008}}}, 100, 999) = 1
> 22:44:59.688519 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59
> 
> 
> As you can see, the time difference between reading the query and writing the
> result is very small, but there is a big delay before receiving the next request.
> 
> This is the view from a sniffer on the lo interface:
> 
> 1192653899.688385    127.0.0.1 -> 127.0.0.1    SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 192000
> 1192653899.688399    127.0.0.1 -> 127.0.0.1    TCP 445 > 42990 [ACK] Seq=192660 Ack=779 Win=33920 Len=0 TSV=359208 TSER=359208
> 1192653899.895725    127.0.0.1 -> 127.0.0.1    SMB [TCP Window Full] Read AndX Response, FID: 0x0001, 64000 bytes
> 1192653899.895793    127.0.0.1 -> 127.0.0.1    TCP 42990 > 445 [ACK] Seq=779 Ack=204948 Win=12288 Len=0 TSV=359260 TSER=359260
> 1192653899.895805    127.0.0.1 -> 127.0.0.1    NBSS NBSS Continuation Message
> 1192653899.935725    127.0.0.1 -> 127.0.0.1    TCP 42990 > 445 [ACK] Seq=779 Ack=209044 Win=12288 Len=0 TSV=359270 TSER=359260
> 1192653900.147739    127.0.0.1 -> 127.0.0.1    NBSS [TCP Window Full] NBSS Continuation Message
> 1192653900.147767    127.0.0.1 -> 127.0.0.1    TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=0 Len=0 TSV=359323 TSER=359323
> 1192653900.147807    127.0.0.1 -> 127.0.0.1    TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=12288 Len=0 TSV=359323 TSER=359323
> 1192653900.147815    127.0.0.1 -> 127.0.0.1    NBSS NBSS Continuation Message
> 1192653900.147837    127.0.0.1 -> 127.0.0.1    TCP 42990 > 445 [ACK] Seq=779 Ack=225428 Win=12288 Len=0 TSV=359323 TSER=359323
> 1192653900.355754    127.0.0.1 -> 127.0.0.1    NBSS [TCP Window Full] NBSS Continuation Message
> 1192653900.355782    127.0.0.1 -> 127.0.0.1    TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=0 Len=0 TSV=359375 TSER=359375
> 1192653900.355820    127.0.0.1 -> 127.0.0.1    TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=12288 Len=0 TSV=359375 TSER=359375
> 1192653900.355829    127.0.0.1 -> 127.0.0.1    NBSS NBSS Continuation Message
> 1192653900.355849    127.0.0.1 -> 127.0.0.1    TCP 42990 > 445 [ACK] Seq=779 Ack=241812 Win=12288 Len=0 TSV=359375 TSER=359375
> 1192653900.563766    127.0.0.1 -> 127.0.0.1    NBSS [TCP Window Full] NBSS Continuation Message
> 1192653900.563794    127.0.0.1 -> 127.0.0.1    TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=0 Len=0 TSV=359427 TSER=359427
> 1192653900.563831    127.0.0.1 -> 127.0.0.1    TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=12288 Len=0 TSV=359427 TSER=359427
> 1192653900.563839    127.0.0.1 -> 127.0.0.1    NBSS NBSS Continuation Message
> 1192653900.563858    127.0.0.1 -> 127.0.0.1    TCP 42990 > 445 [ACK] Seq=779 Ack=256724 Win=12288 Len=0 TSV=359427 TSER=359427
> 1192653900.564444    127.0.0.1 -> 127.0.0.1    SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 256000
> 
> 
> Note the delay between sending the response and getting the reply.
> Also note that there is almost no delay between getting the reply and sending
> the next request.
> 
> My understanding of TCP_CORK from the tcp(7) man page is that it should flush
> out the data immediately, but the network trace seems to suggest that there is
> a 200 ms delay between the request and the outgoing data.  tcp(7) says there is
> a 200 ms delay for sending out data when the socket is in corked mode, so
> uncorking does not appear to work.
> 
> 
> Now for the strange part: the same code works without a 200 ms delay
> over ethernet.  The delay only appears on the loopback interface.
> 
> Also, on my home desktop, which is also running 2.6.22, there is no
> delay even on loopback.
> 
> I tried 2.6.23 on my notebook, but the effect is the same.
> 
> Any ideas what could cause this?
> 

(cc: netdev)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-10-17 21:17 ` bizarre network timing problem Chuck Ebbert
@ 2007-10-17 22:00   ` Felix von Leitner
  2007-10-17 22:16     ` Rick Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Felix von Leitner @ 2007-10-17 22:00 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, Netdev

Thus spake Chuck Ebbert (cebbert@redhat.com):
> > Any ideas what could cause this?
> (cc: netdev)

Maybe I should mention this, too:

accept(5, {sa_family=AF_INET6, sin6_port=htons(59821), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [18446744069414584348]) = 8
setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0

And if it would be the Nagle algorithm, it should also impact the
ethernet case.

Felix

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-10-17 22:00   ` Felix von Leitner
@ 2007-10-17 22:16     ` Rick Jones
  2007-10-18  9:42       ` Felix von Leitner
  0 siblings, 1 reply; 11+ messages in thread
From: Rick Jones @ 2007-10-17 22:16 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Chuck Ebbert, linux-kernel, Netdev

the packet trace was a bit too cooked perhaps, but there were indications that 
at times the TCP window was going to zero - perhaps something with window 
updates or persist timers?

rick jones

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-10-17 22:16     ` Rick Jones
@ 2007-10-18  9:42       ` Felix von Leitner
  2007-10-18 17:22         ` Rick Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Felix von Leitner @ 2007-10-18  9:42 UTC (permalink / raw)
  To: Rick Jones; +Cc: Chuck Ebbert, linux-kernel, Netdev

> the packet trace was a bit too cooked perhaps, but there were indications 
> that at times the TCP window was going to zero - perhaps something with 
> window updates or persist timers?

Does TCP use different window sizes on loopback?  Why is this not
happening on ethernet?

How could I test this theory?

My initial idea was that it has something todo with the different MTU on
loopback.  My initial block size was 16k, but the problem stayed when I
changed it to 64k.

Felix

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-10-18  9:42       ` Felix von Leitner
@ 2007-10-18 17:22         ` Rick Jones
  2007-11-02 22:11           ` Felix von Leitner
  0 siblings, 1 reply; 11+ messages in thread
From: Rick Jones @ 2007-10-18 17:22 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Chuck Ebbert, linux-kernel, Netdev

Felix von Leitner wrote:
>>the packet trace was a bit too cooked perhaps, but there were indications 
>>that at times the TCP window was going to zero - perhaps something with 
>>window updates or persist timers?
> 
> 
> Does TCP use different window sizes on loopback?  Why is this not
> happening on ethernet?

I don't think it uses different window sizes on loopback, but with the 
autotuning it can be difficult to say a priori what the window size will be. 
What one can say with confidence is that the MTU and thus the MSS will be 
different between loopback and ethernet.

> 
> How could I test this theory?

Can you take another trace that isn't so "cooked?"  One that just sticks with 
TCP-level and below stuff?

If SMB is a one-request-at-a-time protocol (I can never remember), you could 
simulate it with a netperf TCP_RR test by passing suitable values to the 
test-specific -r option:

netperf -H <remote> -t TCP_RR -- -r <req>,<rsp>

If that shows similar behaviour then you can ass-u-me it isn't your application. 
  One caveat though is that TCP_CORK mode in netperf is very primitive and may 
not match what you are doing, however, that may be a good thing.

http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/  or
ftp://ftp.netperf.org/netperf/

to get the current netperf bits.  It is also possible to get multiple 
transactions in flight at one time if you configure netperf with --enable-burst, 
which will then enable a test-specific -b option.  With the latest netperf you 
cna also switch the output of a TCP_RR test to bits or bytes per second a la the 
_STREAM tests.

rick jones

> 
> My initial idea was that it has something todo with the different MTU on
> loopback.  My initial block size was 16k, but the problem stayed when I
> changed it to 64k.
> 
> Felix

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-10-18 17:22         ` Rick Jones
@ 2007-11-02 22:11           ` Felix von Leitner
  2007-11-02 22:33             ` Rick Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Felix von Leitner @ 2007-11-02 22:11 UTC (permalink / raw)
  To: Rick Jones; +Cc: Chuck Ebbert, linux-kernel, Netdev

Thus spake Rick Jones (rick.jones2@hp.com):
> >How could I test this theory?
> Can you take another trace that isn't so "cooked?"  One that just sticks 
> with TCP-level and below stuff?

Sorry for taking so long.  Here is a tcpdump.  The side on port 445 is
the SMB server using TCP_CORK.

23:03:20.283772 IP 127.0.0.1.33230 > 127.0.0.1.445: S 1503927325:1503927325(0) win 32792 <mss 16396,sackOK,timestamp 9451736 0,nop,wscale 7>
23:03:20.283774 IP 127.0.0.1.445 > 127.0.0.1.33230: S 1513925692:1513925692(0) ack 1503927326 win 32768 <mss 16396,sackOK,timestamp 9451737 9451736,nop,wscale 7>
23:03:20.283797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 1 win 257 <nop,nop,timestamp 9451737 9451737>
23:03:20.295851 IP 127.0.0.1.33230 > 127.0.0.1.445: P 1:195(194) ack 1 win 257 <nop,nop,timestamp 9451740 9451737>
23:03:20.295881 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.295959 IP 127.0.0.1.445 > 127.0.0.1.33230: P 1:87(86) ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.295998 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
23:03:20.296063 IP 127.0.0.1.33230 > 127.0.0.1.445: P 195:287(92) ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
23:03:20.296096 IP 127.0.0.1.445 > 127.0.0.1.33230: P 87:181(94) ack 287 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296135 IP 127.0.0.1.33230 > 127.0.0.1.445: P 287:373(86) ack 181 win 255 <nop,nop,timestamp 9451740 9451740>
23:03:20.296163 IP 127.0.0.1.445 > 127.0.0.1.33230: P 181:239(58) ack 373 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296201 IP 127.0.0.1.33230 > 127.0.0.1.445: P 373:459(86) ack 239 win 255 <nop,nop,timestamp 9451740 9451740>
23:03:20.296245 IP 127.0.0.1.445 > 127.0.0.1.33230: P 239:309(70) ack 459 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296286 IP 127.0.0.1.33230 > 127.0.0.1.445: P 459:535(76) ack 309 win 254 <nop,nop,timestamp 9451740 9451740>
23:03:20.296314 IP 127.0.0.1.445 > 127.0.0.1.33230: P 309:461(152) ack 535 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296361 IP 127.0.0.1.33230 > 127.0.0.1.445: P 535:594(59) ack 461 win 253 <nop,nop,timestamp 9451740 9451740>
23:03:20.296400 IP 127.0.0.1.445 > 127.0.0.1.33230: . 461:16845(16384) ack 594 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.335748 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 16845 win 125 <nop,nop,timestamp 9451750 9451740>

[note the .2 sec pause]

23:03:20.547763 IP 127.0.0.1.445 > 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 265 <nop,nop,timestamp 9451803 9451750>
23:03:20.547797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 0 <nop,nop,timestamp 9451803 9451803>
23:03:20.547855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 96 <nop,nop,timestamp 9451803 9451803>
23:03:20.547863 IP 127.0.0.1.445 > 127.0.0.1.33230: P 32845:33229(384) ack 594 win 265 <nop,nop,timestamp 9451803 9451803>
23:03:20.547890 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 33229 win 96 <nop,nop,timestamp 9451803 9451803>

[note the .2 sec pause]

23:03:20.755775 IP 127.0.0.1.445 > 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 265 <nop,nop,timestamp 9451855 9451803>
23:03:20.755855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 45517 win 96 <nop,nop,timestamp 9451855 9451855>
23:03:20.755868 IP 127.0.0.1.445 > 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 265 <nop,nop,timestamp 9451855 9451855>
23:03:20.755898 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 49613 win 96 <nop,nop,timestamp 9451855 9451855>

[another one]

23:03:20.963789 IP 127.0.0.1.445 > 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 265 <nop,nop,timestamp 9451907 9451855>
23:03:20.963871 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 61901 win 96 <nop,nop,timestamp 9451907 9451907>
23:03:20.963885 IP 127.0.0.1.445 > 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 265 <nop,nop,timestamp 9451907 9451907>
23:03:20.963909 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
23:03:20.964101 IP 127.0.0.1.33230 > 127.0.0.1.445: P 594:653(59) ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
23:03:21.003790 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 653 win 265 <nop,nop,timestamp 9451917 9451907>
23:03:21.171811 IP 127.0.0.1.445 > 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 265 <nop,nop,timestamp 9451959 9451907>

You get the idea.

Anyway, now THIS is the interesting case, because we have two packets in
the answer, and you see the first half of the answer leaving immediately
(when I wanted the whole answer to be sent) but the second only leaving
after the .2 sec delay.

> If SMB is a one-request-at-a-time protocol (I can never remember),

It is.

> you 
> could simulate it with a netperf TCP_RR test by passing suitable values to 
> the test-specific -r option:

> netperf -H <remote> -t TCP_RR -- -r <req>,<rsp>

> If that shows similar behaviour then you can ass-u-me it isn't your 
> application.

Oh I'm pretty sure it's not my application, because my application performs
well over ethernet, which is after all its purpose.  Also I see the
write, the TCP uncork, then a pause, and then the packet leaving.

Felix

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-11-02 22:11           ` Felix von Leitner
@ 2007-11-02 22:33             ` Rick Jones
  2007-11-02 22:38               ` Felix von Leitner
  0 siblings, 1 reply; 11+ messages in thread
From: Rick Jones @ 2007-11-02 22:33 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Chuck Ebbert, linux-kernel, Netdev

Felix von Leitner wrote:
> Thus spake Rick Jones (rick.jones2@hp.com):
> 
>>>How could I test this theory?
>>
>>Can you take another trace that isn't so "cooked?"  One that just sticks 
>>with TCP-level and below stuff?
> 
> 
> Sorry for taking so long.  Here is a tcpdump.  The side on port 445 is
> the SMB server using TCP_CORK.
> 
> 23:03:20.283772 IP 127.0.0.1.33230 > 127.0.0.1.445: S 1503927325:1503927325(0) win 32792 <mss 16396,sackOK,timestamp 9451736 0,nop,wscale 7>
> 23:03:20.283774 IP 127.0.0.1.445 > 127.0.0.1.33230: S 1513925692:1513925692(0) ack 1503927326 win 32768 <mss 16396,sackOK,timestamp 9451737 9451736,nop,wscale 7>
 >
> 23:03:20.283797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 1 win 257 <nop,nop,timestamp 9451737 9451737>
> 23:03:20.295851 IP 127.0.0.1.33230 > 127.0.0.1.445: P 1:195(194) ack 1 win 257 <nop,nop,timestamp 9451740 9451737>
> 23:03:20.295881 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.295959 IP 127.0.0.1.445 > 127.0.0.1.33230: P 1:87(86) ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.295998 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296063 IP 127.0.0.1.33230 > 127.0.0.1.445: P 195:287(92) ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296096 IP 127.0.0.1.445 > 127.0.0.1.33230: P 87:181(94) ack 287 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296135 IP 127.0.0.1.33230 > 127.0.0.1.445: P 287:373(86) ack 181 win 255 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296163 IP 127.0.0.1.445 > 127.0.0.1.33230: P 181:239(58) ack 373 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296201 IP 127.0.0.1.33230 > 127.0.0.1.445: P 373:459(86) ack 239 win 255 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296245 IP 127.0.0.1.445 > 127.0.0.1.33230: P 239:309(70) ack 459 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296286 IP 127.0.0.1.33230 > 127.0.0.1.445: P 459:535(76) ack 309 win 254 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296314 IP 127.0.0.1.445 > 127.0.0.1.33230: P 309:461(152) ack 535 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296361 IP 127.0.0.1.33230 > 127.0.0.1.445: P 535:594(59) ack 461 win 253 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296400 IP 127.0.0.1.445 > 127.0.0.1.33230: . 461:16845(16384) ack 594 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.335748 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 16845 win 125 <nop,nop,timestamp 9451750 9451740>
> 
> [note the .2 sec pause]

I wonder if the ack 16384 win 125 not updating the window is part of it?  With a 
  window scale of 7, the advertised window of 125 is only 16000 bytes, and it 
looks based on what follows that TCP has another 16384 to send, so my guess is 
that TCP was waiting to have enough window, the persist timer expired and TCP 
then had to say "oh well, send what I can"  Probably a coupling with this being 
less than the MSS (16396) involved too.

> 23:03:20.547763 IP 127.0.0.1.445 > 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 265 <nop,nop,timestamp 9451803 9451750>
> 23:03:20.547797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 0 <nop,nop,timestamp 9451803 9451803>

Notice that an ACK comes-back with a zero window in it - that means that by this 
point the receiver still hasn't consumed the 16384+16000 bytes sent to id.

> 23:03:20.547855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 96 <nop,nop,timestamp 9451803 9451803>

Now the receiver has pulled some data, on the order of 96*128 bytes so TCP can 
now go ahead and send the remaining 384 bytes.

> 23:03:20.547863 IP 127.0.0.1.445 > 127.0.0.1.33230: P 32845:33229(384) ack 594 win 265 <nop,nop,timestamp 9451803 9451803>
> 23:03:20.547890 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 33229 win 96 <nop,nop,timestamp 9451803 9451803>
> 
> [note the .2 sec pause]

I'll bet that 96 * 128 is 12288 and we have another persist timer expiring.

I also wonder if the behaviour might be different if you were using send() 
rather than sendfile() - just random musings...

> 23:03:20.755775 IP 127.0.0.1.445 > 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 265 <nop,nop,timestamp 9451855 9451803>
> 23:03:20.755855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 45517 win 96 <nop,nop,timestamp 9451855 9451855>
> 23:03:20.755868 IP 127.0.0.1.445 > 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 265 <nop,nop,timestamp 9451855 9451855>
> 23:03:20.755898 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 49613 win 96 <nop,nop,timestamp 9451855 9451855>
> 
> [another one]
> 
> 23:03:20.963789 IP 127.0.0.1.445 > 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 265 <nop,nop,timestamp 9451907 9451855>
> 23:03:20.963871 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 61901 win 96 <nop,nop,timestamp 9451907 9451907>
> 23:03:20.963885 IP 127.0.0.1.445 > 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 265 <nop,nop,timestamp 9451907 9451907>
> 23:03:20.963909 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
> 23:03:20.964101 IP 127.0.0.1.33230 > 127.0.0.1.445: P 594:653(59) ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
> 23:03:21.003790 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 653 win 265 <nop,nop,timestamp 9451917 9451907>
> 23:03:21.171811 IP 127.0.0.1.445 > 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 265 <nop,nop,timestamp 9451959 9451907>
> 
> You get the idea.
> 
> Anyway, now THIS is the interesting case, because we have two packets in
> the answer, and you see the first half of the answer leaving immediately
> (when I wanted the whole answer to be sent) but the second only leaving
> after the .2 sec delay.

And it wasn't waiting for an ACK/window-update.  You could try:

ifconfig lo mtu 1500

and see what happens then.

>>If SMB is a one-request-at-a-time protocol (I can never remember),
> It is.

Joy.

>> you could simulate it with a netperf TCP_RR test by passing suitable values
>> to the test-specific -r option:
> 
>>netperf -H <remote> -t TCP_RR -- -r <req>,<rsp>
> 
>>If that shows similar behaviour then you can ass-u-me it isn't your 
>>application.
> 
> 
> Oh I'm pretty sure it's not my application, because my application performs
> well over ethernet, which is after all its purpose.  Also I see the
> write, the TCP uncork, then a pause, and then the packet leaving.

Well, a wise old engineer tried to teach me that the proper spelling is ass-u-me 
:) so just for grins, you might try the TCP_RR test anyway :)  And even if your 
application is correct (although I wonder why the receiver isn't sucking 
data-out very quickly...) if you can reproduce the problem with netperf it will 
be easier for others to do so.

rick jones

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-11-02 22:33             ` Rick Jones
@ 2007-11-02 22:38               ` Felix von Leitner
  2007-11-02 22:58                 ` Rick Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Felix von Leitner @ 2007-11-02 22:38 UTC (permalink / raw)
  To: Rick Jones; +Cc: Chuck Ebbert, linux-kernel, Netdev

Thus spake Rick Jones (rick.jones2@hp.com):
> >Oh I'm pretty sure it's not my application, because my application performs
> >well over ethernet, which is after all its purpose.  Also I see the
> >write, the TCP uncork, then a pause, and then the packet leaving.
> Well, a wise old engineer tried to teach me that the proper spelling is 
> ass-u-me :) so just for grins, you might try the TCP_RR test anyway :)  And 
> even if your application is correct (although I wonder why the receiver 
> isn't sucking data-out very quickly...) if you can reproduce the problem 
> with netperf it will be easier for others to do so.

My application is only the server, the receiver is smbget from Samba, so
I don't feel responsible for it :-)

Still, when run over Ethernet, it works fine without waiting for
timeouts to expire.

To reproduce this:

  - smbget is from samba, you probably already have this
  - gatling (my server) can be gotten from
    cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co dietlibc libowfat gatling

dietlibc is not strictly needed, but it's my environment.
First built dietlibc, then libowfat, then gatling.

Felix

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-11-02 22:38               ` Felix von Leitner
@ 2007-11-02 22:58                 ` Rick Jones
  2007-11-02 23:23                   ` Felix von Leitner
  0 siblings, 1 reply; 11+ messages in thread
From: Rick Jones @ 2007-11-02 22:58 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Chuck Ebbert, linux-kernel, Netdev

Felix von Leitner wrote:
> Thus spake Rick Jones (rick.jones2@hp.com):
> 
>>>Oh I'm pretty sure it's not my application, because my application performs
>>>well over ethernet, which is after all its purpose.  Also I see the
>>>write, the TCP uncork, then a pause, and then the packet leaving.
>>
>>Well, a wise old engineer tried to teach me that the proper spelling is 
>>ass-u-me :) so just for grins, you might try the TCP_RR test anyway :)  And 
>>even if your application is correct (although I wonder why the receiver 
>>isn't sucking data-out very quickly...) if you can reproduce the problem 
>>with netperf it will be easier for others to do so.
> 
> 
> My application is only the server, the receiver is smbget from Samba, so
> I don't feel responsible for it :-)

Might want to strace it anyway... no good deed (such as reporting a potential 
issue) goes unpunished :)

> Still, when run over Ethernet, it works fine without waiting for
> timeouts to expire.

Past performance is no guarantee of current correctness :)  And over an 
Ethernet, there will be a very different set of both timings and TCP segment 
sizes compared to loopback.

My guess is that you will find setting the lo mtu to 1500 a very interesting 
experiment.

> 
> To reproduce this:
> 
>   - smbget is from samba, you probably already have this
>   - gatling (my server) can be gotten from
>     cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co dietlibc libowfat gatling
> 
> dietlibc is not strictly needed, but it's my environment.
> First built dietlibc, then libowfat, then gatling.
> 
> Felix

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-11-02 22:58                 ` Rick Jones
@ 2007-11-02 23:23                   ` Felix von Leitner
  2007-11-06 21:12                     ` Rick Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Felix von Leitner @ 2007-11-02 23:23 UTC (permalink / raw)
  To: Rick Jones; +Cc: Chuck Ebbert, linux-kernel, Netdev

Thus spake Rick Jones (rick.jones2@hp.com):
> Past performance is no guarantee of current correctness :)  And over an 
> Ethernet, there will be a very different set of both timings and TCP 
> segment sizes compared to loopback.

> My guess is that you will find setting the lo mtu to 1500 a very 
> interesting experiment.

Setting the MTU on lo to 1500 eliminates the problem and gives me double
digit MB/sec throughput.

Felix

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bizarre network timing problem
  2007-11-02 23:23                   ` Felix von Leitner
@ 2007-11-06 21:12                     ` Rick Jones
  0 siblings, 0 replies; 11+ messages in thread
From: Rick Jones @ 2007-11-06 21:12 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Chuck Ebbert, linux-kernel, Netdev

Felix von Leitner wrote:
> Thus spake Rick Jones (rick.jones2@hp.com):
> 
>>Past performance is no guarantee of current correctness :)  And over an 
>>Ethernet, there will be a very different set of both timings and TCP 
>>segment sizes compared to loopback.
> 
>>My guess is that you will find setting the lo mtu to 1500 a very 
>>interesting experiment.
> 
> Setting the MTU on lo to 1500 eliminates the problem and gives me double
> digit MB/sec throughput.

I'm not in a position at the moment to test it as my IPoIB systems are offline, 
and not sure you are either, but I will note that with IPoIB bits circa OFED1.2 
the default MTU for IPoIB goes up to 65520 bytes.  If indeed the problem you 
were seeing was related to sub-mss sends and window probing and such, it might 
appear on IPoIB in addition to loopback.

rick jones

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-11-06 21:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20071017205127.GA21334@codeblau.de>
2007-10-17 21:17 ` bizarre network timing problem Chuck Ebbert
2007-10-17 22:00   ` Felix von Leitner
2007-10-17 22:16     ` Rick Jones
2007-10-18  9:42       ` Felix von Leitner
2007-10-18 17:22         ` Rick Jones
2007-11-02 22:11           ` Felix von Leitner
2007-11-02 22:33             ` Rick Jones
2007-11-02 22:38               ` Felix von Leitner
2007-11-02 22:58                 ` Rick Jones
2007-11-02 23:23                   ` Felix von Leitner
2007-11-06 21:12                     ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).