* TCP sockets stalling - help! (long)
@ 2009-11-25 4:25 Asdo
2009-11-25 11:48 ` Ilpo Järvinen
2009-11-26 22:33 ` Frederic Leroy
0 siblings, 2 replies; 11+ messages in thread
From: Asdo @ 2009-11-25 4:25 UTC (permalink / raw)
To: netdev, e1000-devel
Hi netdev and e1000 people,
I have a weird problem here.
We have a Linux "Storage" server acting as a Samba, SCP and SFTP
fileserver. All clients are Windows: XP and Vista, 32 and 64 bits. The
building has gigabit network. We transfer up to many TB at a time.
I have tried from many clients to the same Server and there is always
the same problem: after hours that a transfer is running the TCP socket
suddenly stalls: 0 bytes/sec. It doesn't slow down, it stalls suddenly.
It never disconnects though.
From that moment onward, the TCP socket stays in a stalling condition
forever (I saw myself at least 10 hours) and both sides see it
Established forever. This is in case of no other networking activity on
the same network card.
In case of other networking activity someone told me that he did see the
transfer restart (=continue) by itself (it was an SFTP that time). I
think I saw this phenomenon myself once during the tcpdump at the bottom
of this email.
This problem happens on every protocol (Samba, SCP, SFTP) and apparently
any client (for sure I tested on at least 1 Vista 64bit and 1 XP 32bit).
Also happens in the middle of transferring one file, so not necessarily
at file change. The files are being pushed from the clients to the linux
Server.
The gigabit switch between the machines was changed 2 times (the brands
also were different) without results.
I was able to obtain apparently reliable nonstop transfer if I performed
6 simultaneous transfers with Samba from the same client to the same
server. OTOH with just 1 - 3 simultaneous transfers only all of them
will eventually stall. Note that with 2 transfers the speed of each one
is already smaller than the speed of a single transfer (approx
40+40MB/sec vs 60MB/sec) however with 6 transfers it's obviously much
smaller (approx 6x12MB/sec). The bug looks like some kind of a race
condition to me.
On Samba another strange fact is that if there are multiple transfers
(i.e. 2) from the same client to the same server (please note that in
this case with Samba all transfers pass through the same TCP socket) the
two transfers stall *independently* one to the other. First after a
random time you see one stalling, and eventually after another random
time also the other one stalls. This tells that the socket actually
kinda works... I can even stop one stalled transfer from the client (the
stop operation *works* and gets a reply from the Samba server: I see
messages sniffing the Ethernet), and start it again and it would run,
and the TCP socket is always the same when I restart the transfer.
I obtained apparently reliable transfers with Samba if I introduced rate
limiting on our SMC 8024L2 gigabit switch . Even if the rate limiting
was set so high (930Mbps) that the transfer speed was actually not
influenced, the Samba transfer (1 single transfer, not multiple)
apparently worked reliably. Maybe this helps with the race condition by
separating packets better. The SFTP protocol hanged anyway though, even
on rate limiting, however consider that SFTP from windows goes very slow
at 15Mbps due to a problem of encryption performance of WinSCP, so that
one is really way lower than the rate limit and its packets might not
get separated.
The server is a linux Ubuntu 9.10 with kernel ubuntu 2.6.31-14-server .
The ethernets are two integrated Intels (I tested both ethernet ports):
Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet
Controller (Copper) (rev 01)
the driver is kernel's e1000
This is the stack trace of server processes (Server side) during an SFTP
stall:
root 327 [CUT] \_ sshd: myuser [priv]
1011 380 [CUT] \_ sshd: myuser
1011 382 [CUT] \_ /usr/lib/openssh/sftp-server
root@mystorage:/proc# cat /proc/327/stack (this is sshd)
[<ffffffff814b428a>] unix_stream_data_wait+0xaa/0x110
[<ffffffff814b50cd>] unix_stream_recvmsg+0x36d/0x570
[<ffffffff81426399>] sock_aio_read+0x149/0x150
[<ffffffff8111e232>] do_sync_read+0xf2/0x130
[<ffffffff8111e8e1>] vfs_read+0x181/0x1a0
[<ffffffff8111edec>] sys_read+0x4c/0x80
[<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
root@mystorage:/proc# cat /proc/380/stack (this is the second sshd)
[<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
[<ffffffff8112ef4f>] do_select+0x58f/0x6b0
[<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
[<ffffffff8112fc32>] sys_select+0x42/0x110
[<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
root@mystorage:/proc# cat /proc/382/stack (this is sftp-server)
[<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
[<ffffffff8112ef4f>] do_select+0x58f/0x6b0
[<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
[<ffffffff8112fc32>] sys_select+0x42/0x110
[<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
so everything seems normal, it just seems the socket is waiting for data...
I am really puzzled because... I don't know how the Samba protocol
works, but for SFTP which I believe is a simple protocol, I believe the
client is just pushing raw bytes onto a simple TCP socket so it should
keep doing that while it receives Acks. OTOH if it does not receive Acks
the socket would disconnect. However in my case during a stall the
socket stays Established (I checked at BOTH sides!) but the client stops
pushing bytes to the server. (That's for a stall in the middle of a file
on SFTP protocol, which *does* happen) How on earth can that be??
(update: see tcpdump below, it might explain)
I am also thinking at a hardware problem. The server is running on a
Tyan mainboard S5396WA2NRF and we had another problem with this computer
(mobo?), apparently hardware related, with 3ware disk controllers, same
problem as this guy also running another Tyan:
http://www.webhostingtalk.com/archive/index.php/t-449670.html
which I thought might be reconducible to a lost interrupt. Even on a
lost interrupt I can't understand how to solve the Ack puzzle... this is
not a TOE network card!
This is what I did try up to now (and didn't help):
Changed some Samba parameters trying various values (clearly not useful
for SFTP btw). I ended up with a pretty common configuration, which
anyway didn't help:
socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=8192
SO_SNDBUF=8192
read raw = yes
write raw = yes
kernel oplocks = yes
max xmit = 2048
getwd cache = yes
Changed the following parameters in /proc/sys/net:
1 -> 0 > ipv6/bindv6only (for Samba but it was bound already and I
couldn't reboot. SFTP tests were done in IPv4 though)
1 -> 0 > tcp_sack
1 -> 0 > tcp_dsack
0 -> 1 > tcp_low_latency
2 -> 0 > tcp_frto
cubic -> reno > tcp_congestion_control
3 -> 6 > tcp_retries1
3 -> 10 > tcp_tso_win_divisor
1 -> 0 > tcp_slow_start_after_idle
Then I disabled the following offloads using ethtool:
rx-checksumming on --> off
tx-checksumming on --> off
scatter-gather on --> off
tcp-segmentation-offload: on --> off
generic-segmentation-offload: on --> off
Nothing helped.
This is the SFTP socket when it's running (Server side):
# cat /proc/net/tcp
2: 0F12A8C0:0016 2612A8C0:C0C6 01 00000000:00000000 02:000AB698
00000000 0 0 267816 4 ffff88084b1e6180 24 4 1 11 -1
Later on, when it has been stuck for a while:
# cat /proc/net/tcp
2: 0F12A8C0:0016 2612A8C0:C0C6 01 00000000:00000000 02:0002AC6B
00000000 0 0 267816 2 ffff88084b1e6180 27 4 1 21 -1
(numbers reference:
http://www.mjmwired.net/kernel/Documentation/networking/proc_net_tcp.txt )
as you can see from the last values, the timeout and window have
enlarged a bit, but nothing dramatic. Apparently they are very similar.
This is another SFTP stall after I disabled the offloading via ethtool:
(different transfer, different socket from above)
2: 0F12A8C0:0016 2512A8C0:09BC 01 00000000:00000000 02:000602F4
00000000 0 0 5301100 2 ffff88084ac02080 28 4 0 13 -1
so the values are similar again, apart the ack.pingpong which has become
0 (something to do with offloading?)
This night I decided to actually tcpdump some traffic at the server
side. This dump has a chance of being incorrect/misleading because the
client I could use was already transferring other stuff to the server
via Samba. I used SFTP protocol for this test but due to the
simultaneous Samba transfer I could not obtain a long SFTP hangup. The
longest hangup I obtained is 1 minute. I hope this is really a
manifestation of the bug in question, and not a simple temporary
starvation at server side.
Here is the relevant excerpt:
00:54:36.303031 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 12309129:12309229, ack 997119732, win 65535, length 100
00:54:36.303216 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.],
seq 997119732:997120040, ack 12309229, win 64871, length 308
00:54:36.303450 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 12309229:12309297, ack 997120040, win 65535, length 68
00:54:36.303566 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.],
seq 997120040:997120228, ack 12309297, win 64803, length 188
00:54:36.303645 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 12309297:12309365, ack 997120228, win 65535, length 68
00:54:36.304414 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.],
seq 997120228:997120432, ack 12309365, win 64735, length 204
00:54:36.345583 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
997120432, win 65535, length 0 *** PROBLEM IS HERE ***
00:55:27.171687 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 12309365:12309425, ack 997120432, win 65535, length 60
00:55:27.342682 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.],
seq 997120432:997120944, ack 12309425, win 64675, length 512
00:55:27.342726 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
997120944, win 65535, length 0
00:55:27.342775 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [.], seq
997120944:997122404, ack 12309425, win 64675, length 1460
00:55:27.342787 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
997122404, win 65535, length 0
00:55:27.342792 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.],
seq 997122404:997122612, ack 12309425, win 64675, length 208
00:55:27.342795 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
997122612, win 65535, length 0
It seems to me the TCP stack at the Server returns an ack to the client
but waits 1 minutes before being able to give the further 60 bytes to
the client (these are application-generated bytes, right?) which are
probably needed at client-side for the functioning of the SFTP protocol.
I suspect the SFTP-server application receives the last client data just
at the end of the minute of delay (at 00:55:27.1something)
Probably if the dump was made without concomitant Samba activity, the
sftp-server application would have never received such last client bytes
and that's why the hangup lasts forever.
However the TCP socket remains established because the TCP stack works
and acks are sent properly
Could this be a good interpretation of the bug?
I don't know what are those 60 bytes though, can you confirm is it a
part of the SFTP or SSH protocol?
I observed other similar micro-hangups:
...
00:50:59.210115 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
3346661444, win 65535, length 0
00:51:14.137978 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 7757313:7757457, ack 3346661444, win 65535, length 144
...
00:52:02.495785 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 8612969:8613037, ack 3711382248, win 65535, length 68
00:52:02.507752 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.],
seq 3711382248:3711382452, ack 8613037, win 65399, length 204
00:52:02.545585 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
3711382452, win 65535, length 0
00:52:14.568489 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 8613037:8613097, ack 3711382452, win 65535, length 60
00:52:14.572162 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.],
seq 3711382452:3711382964, ack 8613097, win 65339, length 512
00:52:14.572184 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
3711382964, win 65535, length 0
00:52:14.572312 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [.], seq
3711382964:3711384424, ack 8613097, win 65339, length 1460
...
00:55:30.464957 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
1040635920, win 65535, length 0
00:55:32.230205 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 12410749:12410809, ack 1040635920, win 65535, length 60
...
00:55:54.404027 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
1362376840, win 65535, length 0
00:56:08.349224 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 13157437:13157497, ack 1362376840, win 65535, length 60
...
00:56:28.494333 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
1424904580, win 65535, length 0
00:56:31.450352 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.],
seq 13307333:13307393, ack 1424904580, win 65535, length 60
Update: actually I just found a disk performance problem at Server side
so that the disks couldn't go faster than 70MB/sec (stripe_cache_size
was too low and readahead was too low), and I'm thinking if this could
be the reason for indeed triggering the bug: if there was a starvation
of the TCP/IP stack or network card driver at the wrong time (because
i.e. the CPUs are doing something else like accessing the disks), could
that prevent the 60-bytes packet from ever being transmitted...? (in
case of no other incoming ethernet packets which can give the TCPIP or
e1000 driver a second chance to run)
Thanks for your help
Asdo
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-25 4:25 TCP sockets stalling - help! (long) Asdo
@ 2009-11-25 11:48 ` Ilpo Järvinen
2009-11-25 13:07 ` Asdo
2009-11-26 22:33 ` Frederic Leroy
1 sibling, 1 reply; 11+ messages in thread
From: Ilpo Järvinen @ 2009-11-25 11:48 UTC (permalink / raw)
To: Asdo; +Cc: Netdev, e1000-devel
On Wed, 25 Nov 2009, Asdo wrote:
> Hi netdev and e1000 people,
>
> I have a weird problem here.
>
> We have a Linux "Storage" server acting as a Samba, SCP and SFTP fileserver.
> All clients are Windows: XP and Vista, 32 and 64 bits. The building has
> gigabit network. We transfer up to many TB at a time.
>
> I have tried from many clients to the same Server and there is always the same
> problem: after hours that a transfer is running the TCP socket suddenly
> stalls: 0 bytes/sec. It doesn't slow down, it stalls suddenly. It never
> disconnects though.
>
> From that moment onward, the TCP socket stays in a stalling condition forever
> (I saw myself at least 10 hours) and both sides see it Established forever.
> This is in case of no other networking activity on the same network card.
> In case of other networking activity someone told me that he did see the
> transfer restart (=continue) by itself (it was an SFTP that time). I think I
> saw this phenomenon myself once during the tcpdump at the bottom of this
> email.
>
> This problem happens on every protocol (Samba, SCP, SFTP) and apparently any
> client (for sure I tested on at least 1 Vista 64bit and 1 XP 32bit).
>
> Also happens in the middle of transferring one file, so not necessarily at
> file change. The files are being pushed from the clients to the linux Server.
>
> The gigabit switch between the machines was changed 2 times (the brands also
> were different) without results.
>
> I was able to obtain apparently reliable nonstop transfer if I performed 6
> simultaneous transfers with Samba from the same client to the same server.
> OTOH with just 1 - 3 simultaneous transfers only all of them will eventually
> stall. Note that with 2 transfers the speed of each one is already smaller
> than the speed of a single transfer (approx 40+40MB/sec vs 60MB/sec) however
> with 6 transfers it's obviously much smaller (approx 6x12MB/sec). The bug
> looks like some kind of a race condition to me.
>
> On Samba another strange fact is that if there are multiple transfers (i.e. 2)
> from the same client to the same server (please note that in this case with
> Samba all transfers pass through the same TCP socket) the two transfers stall
> *independently* one to the other. First after a random time you see one
> stalling, and eventually after another random time also the other one stalls.
> This tells that the socket actually kinda works... I can even stop one stalled
> transfer from the client (the stop operation *works* and gets a reply from the
> Samba server: I see messages sniffing the Ethernet), and start it again and it
> would run, and the TCP socket is always the same when I restart the transfer.
>
> I obtained apparently reliable transfers with Samba if I introduced rate
> limiting on our SMC 8024L2 gigabit switch . Even if the rate limiting was set
> so high (930Mbps) that the transfer speed was actually not influenced, the
> Samba transfer (1 single transfer, not multiple) apparently worked reliably.
> Maybe this helps with the race condition by separating packets better. The
> SFTP protocol hanged anyway though, even on rate limiting, however consider
> that SFTP from windows goes very slow at 15Mbps due to a problem of encryption
> performance of WinSCP, so that one is really way lower than the rate limit and
> its packets might not get separated.
>
> The server is a linux Ubuntu 9.10 with kernel ubuntu 2.6.31-14-server . The
> ethernets are two integrated Intels (I tested both ethernet ports):
> Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller
> (Copper) (rev 01)
> the driver is kernel's e1000
>
> This is the stack trace of server processes (Server side) during an SFTP
> stall:
>
> root 327 [CUT] \_ sshd: myuser [priv]
> 1011 380 [CUT] \_ sshd: myuser
> 1011 382 [CUT] \_ /usr/lib/openssh/sftp-server
>
> root@mystorage:/proc# cat /proc/327/stack (this is sshd)
> [<ffffffff814b428a>] unix_stream_data_wait+0xaa/0x110
> [<ffffffff814b50cd>] unix_stream_recvmsg+0x36d/0x570
> [<ffffffff81426399>] sock_aio_read+0x149/0x150
> [<ffffffff8111e232>] do_sync_read+0xf2/0x130
> [<ffffffff8111e8e1>] vfs_read+0x181/0x1a0
> [<ffffffff8111edec>] sys_read+0x4c/0x80
> [<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> root@mystorage:/proc# cat /proc/380/stack (this is the second sshd)
> [<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
> [<ffffffff8112ef4f>] do_select+0x58f/0x6b0
> [<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
> [<ffffffff8112fc32>] sys_select+0x42/0x110
> [<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> root@mystorage:/proc# cat /proc/382/stack (this is sftp-server)
> [<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
> [<ffffffff8112ef4f>] do_select+0x58f/0x6b0
> [<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
> [<ffffffff8112fc32>] sys_select+0x42/0x110
> [<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> so everything seems normal, it just seems the socket is waiting for data...
>
> I am really puzzled because... I don't know how the Samba protocol works, but
> for SFTP which I believe is a simple protocol, I believe the client is just
> pushing raw bytes onto a simple TCP socket so it should keep doing that while
> it receives Acks. OTOH if it does not receive Acks the socket would
> disconnect. However in my case during a stall the socket stays Established (I
> checked at BOTH sides!) but the client stops pushing bytes to the server.
> (That's for a stall in the middle of a file on SFTP protocol, which *does*
> happen) How on earth can that be?? (update: see tcpdump below, it might
> explain)
>
> I am also thinking at a hardware problem. The server is running on a Tyan
> mainboard S5396WA2NRF and we had another problem with this computer (mobo?),
> apparently hardware related, with 3ware disk controllers, same problem as this
> guy also running another Tyan:
> http://www.webhostingtalk.com/archive/index.php/t-449670.html
> which I thought might be reconducible to a lost interrupt. Even on a lost
> interrupt I can't understand how to solve the Ack puzzle... this is not a TOE
> network card!
>
> This is what I did try up to now (and didn't help):
> Changed some Samba parameters trying various values (clearly not useful for
> SFTP btw). I ended up with a pretty common configuration, which anyway didn't
> help:
> socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=8192 SO_SNDBUF=8192
> read raw = yes
> write raw = yes
> kernel oplocks = yes
> max xmit = 2048
> getwd cache = yes
>
>
> Changed the following parameters in /proc/sys/net:
> 1 -> 0 > ipv6/bindv6only (for Samba but it was bound already and I couldn't
> reboot. SFTP tests were done in IPv4 though)
> 1 -> 0 > tcp_sack
> 1 -> 0 > tcp_dsack
> 0 -> 1 > tcp_low_latency
> 2 -> 0 > tcp_frto
> cubic -> reno > tcp_congestion_control
> 3 -> 6 > tcp_retries1
> 3 -> 10 > tcp_tso_win_divisor
> 1 -> 0 > tcp_slow_start_after_idle
>
>
> Then I disabled the following offloads using ethtool:
> rx-checksumming on --> off
> tx-checksumming on --> off
> scatter-gather on --> off
> tcp-segmentation-offload: on --> off
> generic-segmentation-offload: on --> off
>
> Nothing helped.
>
>
> This is the SFTP socket when it's running (Server side):
> # cat /proc/net/tcp
> 2: 0F12A8C0:0016 2612A8C0:C0C6 01 00000000:00000000 02:000AB698 00000000
> 0 0 267816 4 ffff88084b1e6180 24 4 1 11 -1
>
> Later on, when it has been stuck for a while:
> # cat /proc/net/tcp
> 2: 0F12A8C0:0016 2612A8C0:C0C6 01 00000000:00000000 02:0002AC6B 00000000
> 0 0 267816 2 ffff88084b1e6180 27 4 1 21 -1
> (numbers reference:
> http://www.mjmwired.net/kernel/Documentation/networking/proc_net_tcp.txt )
>
> as you can see from the last values, the timeout and window have enlarged a
> bit, but nothing dramatic. Apparently they are very similar.
This end has no data to send (the numbers after " 01 ", first one is the
size of the send queue), so it is correct behavior from TCP side.
> This is another SFTP stall after I disabled the offloading via ethtool:
> (different transfer, different socket from above)
> 2: 0F12A8C0:0016 2512A8C0:09BC 01 00000000:00000000 02:000602F4 00000000
> 0 0 5301100 2 ffff88084ac02080 28 4 0 13 -1
> so the values are similar again, apart the ack.pingpong which has become 0
> (something to do with offloading?)
Likewise, no data to send here either at the moment you took this snippet.
> This night I decided to actually tcpdump some traffic at the server side. This
> dump has a chance of being incorrect/misleading because the client I could use
> was already transferring other stuff to the server via Samba. I used SFTP
> protocol for this test but due to the simultaneous Samba transfer I could not
> obtain a long SFTP hangup. The longest hangup I obtained is 1 minute. I hope
> this is really a manifestation of the bug in question, and not a simple
> temporary starvation at server side.
>
> Here is the relevant excerpt:
> 00:54:36.303031 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 12309129:12309229, ack 997119732, win 65535, length 100
> 00:54:36.303216 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], seq
> 997119732:997120040, ack 12309229, win 64871, length 308
> 00:54:36.303450 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 12309229:12309297, ack 997120040, win 65535, length 68
> 00:54:36.303566 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], seq
> 997120040:997120228, ack 12309297, win 64803, length 188
> 00:54:36.303645 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 12309297:12309365, ack 997120228, win 65535, length 68
> 00:54:36.304414 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], seq
> 997120228:997120432, ack 12309365, win 64735, length 204
> 00:54:36.345583 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 997120432, win 65535, length 0 *** PROBLEM IS HERE ***
> 00:55:27.171687 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 12309365:12309425, ack 997120432, win 65535, length 60
> 00:55:27.342682 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], seq
> 997120432:997120944, ack 12309425, win 64675, length 512
> 00:55:27.342726 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 997120944, win 65535, length 0
> 00:55:27.342775 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [.], seq
> 997120944:997122404, ack 12309425, win 64675, length 1460
> 00:55:27.342787 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 997122404, win 65535, length 0
> 00:55:27.342792 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], seq
> 997122404:997122612, ack 12309425, win 64675, length 208
> 00:55:27.342795 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 997122612, win 65535, length 0
>
> It seems to me the TCP stack at the Server returns an ack to the client but
> waits 1 minutes before being able to give the further 60 bytes to the client
> (these are application-generated bytes, right?) which are probably needed at
> client-side for the functioning of the SFTP protocol.
> I suspect the SFTP-server application receives the last client data just at
> the end of the minute of delay (at 00:55:27.1something)
> Probably if the dump was made without concomitant Samba activity, the
> sftp-server application would have never received such last client bytes and
> that's why the hangup lasts forever.
> However the TCP socket remains established because the TCP stack works and
> acks are sent properly
> Could this be a good interpretation of the bug?
>
> I don't know what are those 60 bytes though, can you confirm is it a part of
> the SFTP or SSH protocol?
>
> I observed other similar micro-hangups:
> ...
> 00:50:59.210115 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 3346661444, win 65535, length 0
> 00:51:14.137978 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 7757313:7757457, ack 3346661444, win 65535, length 144
> ...
> 00:52:02.495785 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 8612969:8613037, ack 3711382248, win 65535, length 68
> 00:52:02.507752 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], seq
> 3711382248:3711382452, ack 8613037, win 65399, length 204
> 00:52:02.545585 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 3711382452, win 65535, length 0
> 00:52:14.568489 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 8613037:8613097, ack 3711382452, win 65535, length 60
> 00:52:14.572162 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], seq
> 3711382452:3711382964, ack 8613097, win 65339, length 512
> 00:52:14.572184 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 3711382964, win 65535, length 0
> 00:52:14.572312 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [.], seq
> 3711382964:3711384424, ack 8613097, win 65339, length 1460
> ...
> 00:55:30.464957 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 1040635920, win 65535, length 0
> 00:55:32.230205 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 12410749:12410809, ack 1040635920, win 65535, length 60
> ...
> 00:55:54.404027 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 1362376840, win 65535, length 0
> 00:56:08.349224 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 13157437:13157497, ack 1362376840, win 65535, length 60
> ...
> 00:56:28.494333 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack
> 1424904580, win 65535, length 0
> 00:56:31.450352 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], seq
> 13307333:13307393, ack 1424904580, win 65535, length 60
>
> Update: actually I just found a disk performance problem at Server side so
> that the disks couldn't go faster than 70MB/sec (stripe_cache_size was too low
> and readahead was too low), and I'm thinking if this could be the reason for
> indeed triggering the bug: if there was a starvation of the TCP/IP stack or
> network card driver at the wrong time (because i.e. the CPUs are doing
> something else like accessing the disks), could that prevent the 60-bytes
> packet from ever being transmitted...? (in case of no other incoming ethernet
> packets which can give the TCPIP or e1000 driver a second chance to run)
...I'd next try strace the sftp server to see what it was doing
during the stall.
--
i.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-25 11:48 ` Ilpo Järvinen
@ 2009-11-25 13:07 ` Asdo
2009-11-25 15:36 ` Asdo
0 siblings, 1 reply; 11+ messages in thread
From: Asdo @ 2009-11-25 13:07 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: e1000-devel, Netdev
Ilpo Järvinen wrote:
> ...I'd next try strace the sftp server to see what it was doing
> during the stall.
>
Thanks for your help Ilpo
Isn't the strace equivalent to the stack trace I obtained via cat
/proc/pid/stack reported previously? That was at the time of the stall
I'm thinking the strace would slow down sftp-server very deeply...
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-25 13:07 ` Asdo
@ 2009-11-25 15:36 ` Asdo
2009-11-25 16:38 ` Asdo
0 siblings, 1 reply; 11+ messages in thread
From: Asdo @ 2009-11-25 15:36 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: e1000-devel, Netdev
[-- Attachment #1.1: Type: text/plain, Size: 3381 bytes --]
Asdo wrote:
> Ilpo Järvinen wrote:
>
>> ...I'd next try strace the sftp server to see what it was doing
>> during the stall.
>>
>>
> Thanks for your help Ilpo
>
> Isn't the strace equivalent to the stack trace I obtained via cat
> /proc/pid/stack reported previously? That was at the time of the stall
>
> I'm thinking the strace would slow down sftp-server very deeply...
>
I found out that if I attach the strace I can see at least the last
function call.
The SFTP it's hanged right now so I did that:
root@mystorage:/root# strace -p 11475
Process 11475 attached - interrupt to quit
select(5, [3], [], NULL, NULL
(stuck here forever... doesn't move)
(it's strange the first option of select is 5, shouldn't it be 4 from
man select? A bug of strace maybe?)
root@mystorage:/root# cat /proc/11475/stack
[<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
[<ffffffff8112ef4f>] do_select+0x58f/0x6b0
[<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
[<ffffffff8112fc32>] sys_select+0x42/0x110
[<ffffffff8101225b>] tracesys+0xd9/0xde
[<ffffffffffffffff>] 0xffffffffffffffff
And this is from cat /proc/net/tcp
2: 0F12A8C0:0016 2512A8C0:0FBD 01 00000000:00000000 02:00009144
00000000 0 0 5326251 2 ffff88085408ce00 26 4 1 9 4
The select refers to open files so here they are:
root@mystorage:/proc/11475/fd# ll
total 0
lr-x------ 1 ccosentino wetlab 64 2009-11-25 14:43 0 -> pipe:[5326309]
l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 1 -> pipe:[5326310]
l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 2 -> pipe:[5326311]
lr-x------ 1 ccosentino wetlab 64 2009-11-25 14:43 3 -> pipe:[5326309]
l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 4 -> pipe:[5326310]
l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 5 ->
/path/to/file_being_saved.filepart
I tried to send SIGSTOP and then SIGCONT to see if I could make it make
a loop and then reenter into the select. I'm not sure it really did
that, what do you think? This is the strace:
root@mystorage:/root# strace -p 11475 2>&1 | tee sftpstrace.dmp
Process 11475 attached - interrupt to quit
select(5, [3], [], NULL, NULL) = ? ERESTARTNOHAND (To be restarted)
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
select(5, [3], [], NULL, NULL) = ? ERESTARTNOHAND (To be restarted)
--- SIGCONT (Continued) @ 0 (0) ---
select(5, [3], [], NULL, NULL
(hanged again here)
Do you think this info is enough or I really have to strace it since the
beginning?
If it is a race condition it might not happen if the sftp-server is
deeply slowed down by the strace.
If I had a way to make it continue right now we could get the rest of
the strace... But it's not so easy, I tried starting a Samba transfer
but it did not unlock the SFTP this time. SIGSTOP + SIGCONT also didn't
work.
BTW people using the Storage also experienced data loss while pushing
files in it: appartently data disappeared from the middle of a file they
were saving to the Storage.
To me looks like another hint that application-level data which has been
received via network by TCP stack is trapped there and not being pushed
to the application.
Or the data might even be trapped into the anonymous sockets between
sshd and sftp-server.
Thanks for your help
[-- Attachment #2: Type: text/plain, Size: 354 bytes --]
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
[-- Attachment #3: Type: text/plain, Size: 164 bytes --]
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-25 15:36 ` Asdo
@ 2009-11-25 16:38 ` Asdo
2009-11-25 22:29 ` [E1000-devel] " Ilpo Järvinen
0 siblings, 1 reply; 11+ messages in thread
From: Asdo @ 2009-11-25 16:38 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: e1000-devel, Netdev
Asdo wrote:
> Asdo wrote:
>> Ilpo Järvinen wrote:
>>
>>> ...I'd next try strace the sftp server to see what it was doing
>>> during the stall.
>>>
>> Thanks for your help Ilpo
>>
>> Isn't the strace equivalent to the stack trace I obtained via cat
>> /proc/pid/stack reported previously? That was at the time of the stall
>>
>> I'm thinking the strace would slow down sftp-server very deeply...
>>
>
> I found out that if I attach the strace I can see at least the last
> function call.
> The SFTP it's hanged right now so I did that:
>
> root@mystorage:/root# strace -p 11475
> Process 11475 attached - interrupt to quit
> select(5, [3], [], NULL, NULL
>
> (stuck here forever... doesn't move)
> (it's strange the first option of select is 5, shouldn't it be 4 from
> man select? A bug of strace maybe?)
>
> root@mystorage:/root# cat /proc/11475/stack
> [<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
> [<ffffffff8112ef4f>] do_select+0x58f/0x6b0
> [<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
> [<ffffffff8112fc32>] sys_select+0x42/0x110
> [<ffffffff8101225b>] tracesys+0xd9/0xde
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> And this is from cat /proc/net/tcp 2: 0F12A8C0:0016 2512A8C0:0FBD 01
> 00000000:00000000 02:00009144 00000000 0 0 5326251 2
> ffff88085408ce00 26 4 1 9 4
>
> The select refers to open files so here they are:
>
>
> root@mystorage:/proc/11475/fd# ll
> total 0
> lr-x------ 1 ccosentino wetlab 64 2009-11-25 14:43 0 -> pipe:[5326309]
> l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 1 -> pipe:[5326310]
> l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 2 -> pipe:[5326311]
> lr-x------ 1 ccosentino wetlab 64 2009-11-25 14:43 3 -> pipe:[5326309]
> l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 4 -> pipe:[5326310]
> l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 5 ->
> /path/to/file_being_saved.filepart
>
> I tried to send SIGSTOP and then SIGCONT to see if I could make it
> make a loop and then reenter into the select. I'm not sure it really
> did that, what do you think? This is the strace:
> root@mystorage:/root# strace -p 11475 2>&1 | tee sftpstrace.dmp
> Process 11475 attached - interrupt to quit
> select(5, [3], [], NULL, NULL) = ? ERESTARTNOHAND (To be
> restarted)
> --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> select(5, [3], [], NULL, NULL) = ? ERESTARTNOHAND (To be
> restarted)
> --- SIGCONT (Continued) @ 0 (0) ---
> select(5, [3], [], NULL, NULL
>
> (hanged again here)
>
>
> Do you think this info is enough or I really have to strace it since
> the beginning?
> If it is a race condition it might not happen if the sftp-server is
> deeply slowed down by the strace.
> If I had a way to make it continue right now we could get the rest of
> the strace... But it's not so easy, I tried starting a Samba transfer
> but it did not unlock the SFTP this time. SIGSTOP + SIGCONT also
> didn't work.
>
> BTW people using the Storage also experienced data loss while pushing
> files in it: appartently data disappeared from the middle of a file
> they were saving to the Storage.
> To me looks like another hint that application-level data which has
> been received via network by TCP stack is trapped there and not being
> pushed to the application.
> Or the data might even be trapped into the anonymous sockets between
> sshd and sftp-server.
>
> Thanks for your help
I thought that it would be more meaningful to try the cat stack, the
strace and the SIGSTOP SIGCONT tricks on the sshd which actually holds
the TCP socket, instead of the sftp-server. Here they are, on the sshd:
root@mystorage:/root# strace -p 11449 2>&1 | tee -a sshd1strace1.dmp
Process 11449 attached - interrupt to quit
read(5, 0x7fff4e956220, 4) = ? ERESTARTSYS (To be restarted)
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
read(5, 0x7fff4e956220, 4) = ? ERESTARTSYS (To be restarted)
--- SIGCONT (Continued) @ 0 (0) ---
read(5,
(hanged again here after the SIGSTOP + SIGCONT, unfortunately)
root@mystorage:/root# cat /proc/11449/stack
[<ffffffff814b428a>] unix_stream_data_wait+0xaa/0x110
[<ffffffff814b50cd>] unix_stream_recvmsg+0x36d/0x570
[<ffffffff81426399>] sock_aio_read+0x149/0x150
[<ffffffff8111e232>] do_sync_read+0xf2/0x130
[<ffffffff8111e8e1>] vfs_read+0x181/0x1a0
[<ffffffff8111edec>] sys_read+0x4c/0x80
[<ffffffff8101225b>] tracesys+0xd9/0xde
[<ffffffffffffffff>] 0xffffffffffffffff
root@mystorage:/proc/11449/fd# ll
total 0
lrwx------ 1 root root 64 2009-11-25 14:43 0 -> /dev/null
lrwx------ 1 root root 64 2009-11-25 14:43 1 -> /dev/null
lrwx------ 1 root root 64 2009-11-25 14:43 2 -> /dev/null
lrwx------ 1 root root 64 2009-11-25 14:43 3 -> socket:[5326251]
lrwx------ 1 root root 64 2009-11-25 14:43 5 -> socket:[5326303]
I would like to try to attach a program of mine (a python interactive)
to "socket:[5326303]" and try to read data from there to see if
something comes out of the TCP stack, but I don't know where to find the
socket:[5326303] on the filesystem. Is it mapped to a file anywhere in
Linux?
Thank you
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [E1000-devel] TCP sockets stalling - help! (long)
2009-11-25 16:38 ` Asdo
@ 2009-11-25 22:29 ` Ilpo Järvinen
0 siblings, 0 replies; 11+ messages in thread
From: Ilpo Järvinen @ 2009-11-25 22:29 UTC (permalink / raw)
To: Asdo; +Cc: e1000-devel, Netdev
[-- Attachment #1: Type: TEXT/PLAIN, Size: 6731 bytes --]
On Wed, 25 Nov 2009, Asdo wrote:
> Asdo wrote:
> > Asdo wrote:
> > > Ilpo Järvinen wrote:
> > >
> > > > ...I'd next try strace the sftp server to see what it was doing during
> > > > the stall.
> > > >
> > > Thanks for your help Ilpo
> > >
> > > Isn't the strace equivalent to the stack trace I obtained via cat
> > > /proc/pid/stack reported previously? That was at the time of the stall
> > >
> > > I'm thinking the strace would slow down sftp-server very deeply...
> > >
> >
> > I found out that if I attach the strace I can see at least the last function
> > call.
> > The SFTP it's hanged right now so I did that:
> >
> > root@mystorage:/root# strace -p 11475
> > Process 11475 attached - interrupt to quit
> > select(5, [3], [], NULL, NULL
> >
> > (stuck here forever... doesn't move)
> > (it's strange the first option of select is 5, shouldn't it be 4 from man
> > select? A bug of strace maybe?)
> >
> > root@mystorage:/root# cat /proc/11475/stack
> > [<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
> > [<ffffffff8112ef4f>] do_select+0x58f/0x6b0
> > [<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
> > [<ffffffff8112fc32>] sys_select+0x42/0x110
> > [<ffffffff8101225b>] tracesys+0xd9/0xde
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > And this is from cat /proc/net/tcp 2: 0F12A8C0:0016 2512A8C0:0FBD 01
> > 00000000:00000000 02:00009144 00000000 0 0 5326251 2
> > ffff88085408ce00 26 4 1 9 4
> >
> > The select refers to open files so here they are:
> >
> >
> > root@mystorage:/proc/11475/fd# ll
> > total 0
> > lr-x------ 1 ccosentino wetlab 64 2009-11-25 14:43 0 -> pipe:[5326309]
> > l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 1 -> pipe:[5326310]
> > l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 2 -> pipe:[5326311]
> > lr-x------ 1 ccosentino wetlab 64 2009-11-25 14:43 3 -> pipe:[5326309]
> > l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 4 -> pipe:[5326310]
> > l-wx------ 1 ccosentino wetlab 64 2009-11-25 14:43 5 ->
> > /path/to/file_being_saved.filepart
> >
> > I tried to send SIGSTOP and then SIGCONT to see if I could make it make a
> > loop and then reenter into the select. I'm not sure it really did that, what
> > do you think? This is the strace:
> > root@mystorage:/root# strace -p 11475 2>&1 | tee sftpstrace.dmp
> > Process 11475 attached - interrupt to quit
> > select(5, [3], [], NULL, NULL) = ? ERESTARTNOHAND (To be restarted)
> > --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> > --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> > select(5, [3], [], NULL, NULL) = ? ERESTARTNOHAND (To be restarted)
> > --- SIGCONT (Continued) @ 0 (0) ---
> > select(5, [3], [], NULL, NULL
> >
> > (hanged again here)
> >
> >
> > Do you think this info is enough or I really have to strace it since the
> > beginning?
> > If it is a race condition it might not happen if the sftp-server is deeply
> > slowed down by the strace.
> > If I had a way to make it continue right now we could get the rest of the
> > strace... But it's not so easy, I tried starting a Samba transfer but it did
> > not unlock the SFTP this time. SIGSTOP + SIGCONT also didn't work.
> >
> > BTW people using the Storage also experienced data loss while pushing files
> > in it: appartently data disappeared from the middle of a file they were
> > saving to the Storage.
> > To me looks like another hint that application-level data which has been
> > received via network by TCP stack is trapped there and not being pushed to
> > the application.
> > Or the data might even be trapped into the anonymous sockets between sshd
> > and sftp-server.
> >
> > Thanks for your help
>
> I thought that it would be more meaningful to try the cat stack, the strace
> and the SIGSTOP SIGCONT tricks on the sshd which actually holds the TCP
> socket, instead of the sftp-server. Here they are, on the sshd:
>
> root@mystorage:/root# strace -p 11449 2>&1 | tee -a sshd1strace1.dmp
> Process 11449 attached - interrupt to quit
> read(5, 0x7fff4e956220, 4) = ? ERESTARTSYS (To be restarted)
> --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> read(5, 0x7fff4e956220, 4) = ? ERESTARTSYS (To be restarted)
> --- SIGCONT (Continued) @ 0 (0) ---
> read(5,
>
> (hanged again here after the SIGSTOP + SIGCONT, unfortunately)
>
> root@mystorage:/root# cat /proc/11449/stack
> [<ffffffff814b428a>] unix_stream_data_wait+0xaa/0x110
> [<ffffffff814b50cd>] unix_stream_recvmsg+0x36d/0x570
> [<ffffffff81426399>] sock_aio_read+0x149/0x150
> [<ffffffff8111e232>] do_sync_read+0xf2/0x130
> [<ffffffff8111e8e1>] vfs_read+0x181/0x1a0
> [<ffffffff8111edec>] sys_read+0x4c/0x80
> [<ffffffff8101225b>] tracesys+0xd9/0xde
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> root@mystorage:/proc/11449/fd# ll
> total 0
> lrwx------ 1 root root 64 2009-11-25 14:43 0 -> /dev/null
> lrwx------ 1 root root 64 2009-11-25 14:43 1 -> /dev/null
> lrwx------ 1 root root 64 2009-11-25 14:43 2 -> /dev/null
> lrwx------ 1 root root 64 2009-11-25 14:43 3 -> socket:[5326251]
> lrwx------ 1 root root 64 2009-11-25 14:43 5 -> socket:[5326303]
>
>
> I would like to try to attach a program of mine (a python interactive) to
> "socket:[5326303]" and try to read data from there to see if something comes
> out of the TCP stack, but I don't know where to find the socket:[5326303] on
> the filesystem. Is it mapped to a file anywhere in Linux?
This line still tells that there is nothing to read from the socket:
> > And this is from cat /proc/net/tcp 2: 0F12A8C0:0016 2512A8C0:0FBD 01
> > 00000000:00000000 02:00009144 00000000 0 0 5326251 2
Let me reiterate, the first zeros here are the write side queue, the
latter are for the receive (not including ofo queue but if tcpdump doesn't
show ofo that's not significant detail here).
> > ffff88085408ce00 26 4 1 9 4
To me it seems that this particular problem so far shows very little
problem on TCP side. Ie., TCP just "stalls" because it is not fed with
anything.
What I would want to make absolutely sure, is that sftp-server was waiting
for a pipe which comes from sshd, I don't think we have confirmed that
just yet.
Also, I'd want to get a tcpdump, strace (the latter attach might be ok
but I'm not sure just yet) and /proc/net/tcp from a consistent case rather
than trying to mix multiple captures from different stalls. This is very
important since your captures are already in disagreement on what is
happening.
In addition, I want to see TCP timestamps (probably you need more -vvv to
tcpdump to show them), however, if there are no timestamps in use at all
and 1gpbs speeds, it could potentially explain the missing data already.
--
i.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-25 4:25 TCP sockets stalling - help! (long) Asdo
2009-11-25 11:48 ` Ilpo Järvinen
@ 2009-11-26 22:33 ` Frederic Leroy
2009-11-27 9:10 ` Ilpo Järvinen
2009-11-27 12:56 ` Asdo
1 sibling, 2 replies; 11+ messages in thread
From: Frederic Leroy @ 2009-11-26 22:33 UTC (permalink / raw)
To: Asdo; +Cc: netdev, e1000-devel
Le Wed, 25 Nov 2009 05:25:28 +0100,
Asdo <asdo@shiftmail.org> a écrit :
> Hi netdev and e1000 people,
>
> I have a weird problem here.
> [...]
I have the same kind of problem at home. I can't copy big files over
ssh unless I limit the rate ( ssh -l 8000 ... )
After reading your mail, I make some quick tests ( see below).
Results in short :
- netcat always work.
- ssh stalls, nearly immediately.
I thought fault cames from a kernel update or my wifi card update.
Apparently, I was wrong for the hardware.
Here are the test I just made :
network:
SG ----- eth_100_Mbps ----- ISV_BOX ))) wifi_~1.5Mb ((( HOUBA
host HOUBA:
----
gentoo - updated frequently (once a week)
kernel vanilla : Linux houba 2.6.32-rc6-fredo4-04866-g47caac1 #5
PREEMPT Fri Nov 6 14:20:20 CET 2009 i686 Intel(R) Pentium(R) M
processor 1400MHz GenuineIntel GNU/Linux
host SG:
----
debian stable - not updated recently
kernel vanilla : Linux sg 2.6.32-rc5-fredo3 #2 Mon Oct 19 02:25:29 CEST
2009 i586 GNU/Linux
/dev/sda is a USB disk mounted on media/big
TESTS
-----
HOUBA command for tests :
netcat -l -p 2000 | pv > /dev/null
TEST1 - OK : ~5min
fredo@sg:~/remote_wake$ cat /dev/null | netcat houba 2000
TEST2 - OK : ~5min
fredo@sg:~/remote_wake$ cat /dev/sda | netcat houba 2000
TEST3 - OK :
fredo@sg:~/remote_wake$
cat /media/big/home/fredo/cdanslair/cdanslair_20091125.wmv | netcat -q
0 houba 2000
result : 291MO 0:03:32 [1,37MO/s]
TEST4 - FAILED
fredo@sg:~/remote_wake$
scp /media/big/home/fredo/cdanslair/cdanslair_20091125.wmv houba:
Stalled within seconds (5216Kb transferred) :
cdanslair_20091125.wmv 1% 5216KB 0.0KB/s - stalled -
--
Frédéric Leroy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-26 22:33 ` Frederic Leroy
@ 2009-11-27 9:10 ` Ilpo Järvinen
2009-11-27 12:56 ` Asdo
1 sibling, 0 replies; 11+ messages in thread
From: Ilpo Järvinen @ 2009-11-27 9:10 UTC (permalink / raw)
To: Frederic Leroy; +Cc: Asdo, Netdev, e1000-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 782 bytes --]
On Thu, 26 Nov 2009, Frederic Leroy wrote:
> Le Wed, 25 Nov 2009 05:25:28 +0100,
> Asdo <asdo@shiftmail.org> a écrit :
>
> > Hi netdev and e1000 people,
> >
> > I have a weird problem here.
> > [...]
>
> I have the same kind of problem at home. I can't copy big files over
> ssh unless I limit the rate ( ssh -l 8000 ... )
> After reading your mail, I make some quick tests ( see below).
>
> Results in short :
> - netcat always work.
> - ssh stalls, nearly immediately.
One important result is missing though... could you perhaps get a tcpdump
from the stall and get also the info available in /proc/net/tcp for the
connection (during the stall), that would help some in figuring out what
is going on... ...Preferably in new/renamed thread, don't forget to cc me.
--
i.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-26 22:33 ` Frederic Leroy
2009-11-27 9:10 ` Ilpo Järvinen
@ 2009-11-27 12:56 ` Asdo
2009-11-27 20:53 ` Frederic Leroy
1 sibling, 1 reply; 11+ messages in thread
From: Asdo @ 2009-11-27 12:56 UTC (permalink / raw)
To: Frederic Leroy; +Cc: e1000-devel, netdev, Ilpo Järvinen
[-- Attachment #1.1: Type: text/plain, Size: 1423 bytes --]
Frederic Leroy wrote:
> Le Wed, 25 Nov 2009 05:25:28 +0100,
> Asdo <asdo@shiftmail.org> a écrit :
>
>
>> Hi netdev and e1000 people,
>>
>> I have a weird problem here.
>> [...]
>>
>
> I have the same kind of problem at home. I can't copy big files over
> ssh unless I limit the rate ( ssh -l 8000 ... )
> After reading your mail, I make some quick tests ( see below).
>
> Results in short :
> - netcat always work.
> - ssh stalls, nearly immediately.
>
Good God it's true then!
I think I have an idea for why Netcat works while scp doesn't:
probably SCP pushing data requires replies from the application layer
(the 60 bytes I was telling about for SFTP) while netcat doesnt, it just
needs the Acks to keep pushing data.
Still if I am right, with Netcat you might end up with a corrupted file
on the other side (shorter). Or maybe not, depending on how exactly the
bug is made. Could you check this? Filesize and md5sum.
I think I just reproduced the bug on another machine, different hardware
(Supermicro mainboard), same Ubuntu, same kernel, same integrated Intel
network card. It just takes 5 hours or so for me to trigger.
Do you have Intel network card, with e1000e driver on SG? (see lsmod)
(sorry I was wrong before, it's e1000e in my case not e1000) Otherwise
it's probably the TCP layer...
Would you paste here your lspci and lsmod?
Thank you
[-- Attachment #2: Type: text/plain, Size: 354 bytes --]
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
[-- Attachment #3: Type: text/plain, Size: 164 bytes --]
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-27 12:56 ` Asdo
@ 2009-11-27 20:53 ` Frederic Leroy
2009-11-27 21:00 ` Ilpo Järvinen
0 siblings, 1 reply; 11+ messages in thread
From: Frederic Leroy @ 2009-11-27 20:53 UTC (permalink / raw)
To: Asdo; +Cc: netdev, e1000-devel, Ilpo Järvinen
Le Fri, 27 Nov 2009 13:56:13 +0100,
Asdo <asdo@shiftmail.org> a écrit :
> Frederic Leroy wrote:
> > Le Wed, 25 Nov 2009 05:25:28 +0100,
> > Asdo <asdo@shiftmail.org> a écrit :
> >
> >
> >> Hi netdev and e1000 people,
> >>
> >> I have a weird problem here.
> >> [...]
> > I have the same kind of problem at home. I can't copy big files over
> > [...]
> Good God it's true then!
Yes :)
> Still if I am right, with Netcat you might end up with a corrupted
> file on the other side (shorter). Or maybe not, depending on how
> exactly the bug is made. Could you check this? Filesize and md5sum.
md5sum are ok.
I can't reproduce the problem on wire with a switch ( using both
10 and 100Mbp FD )
> Do you have Intel network card, with e1000e driver on SG? (see lsmod)
> (sorry I was wrong before, it's e1000e in my case not e1000)
> Otherwise it's probably the TCP layer...
Sorry, I forgot to precise the network hardware.
I forgot a noname 100Mbp switch between the isv box and sg host.
SG:
00:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
using module 8139cp
Devel:
02:02.0 Network controller: Broadcom Corporation BCM4306 802.11b/g
Wireless LAN Controller (rev 03)
using b43 driver.
02:08.0 Ethernet controller: Intel Corporation 82801DB PRO/100 VE (MOB)
Ethernet Controller (rev 81)
using e100 in kernel ( no module )
My isv box is broadcom based, but I don't know which chipset for
wireless.
Since our hardware are different, I don't think it's coming from
hardware driver. I'm preparing traces for Ilpo
--
Frédéric Leroy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: TCP sockets stalling - help! (long)
2009-11-27 20:53 ` Frederic Leroy
@ 2009-11-27 21:00 ` Ilpo Järvinen
0 siblings, 0 replies; 11+ messages in thread
From: Ilpo Järvinen @ 2009-11-27 21:00 UTC (permalink / raw)
To: Frederic Leroy; +Cc: Asdo, Netdev, e1000-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1910 bytes --]
On Fri, 27 Nov 2009, Frederic Leroy wrote:
> Le Fri, 27 Nov 2009 13:56:13 +0100,
> Asdo <asdo@shiftmail.org> a écrit :
>
> > Frederic Leroy wrote:
> > > Le Wed, 25 Nov 2009 05:25:28 +0100,
> > > Asdo <asdo@shiftmail.org> a écrit :
> > >
> > >
> > >> Hi netdev and e1000 people,
> > >>
> > >> I have a weird problem here.
> > >> [...]
> > > I have the same kind of problem at home. I can't copy big files over
> > > [...]
> > Good God it's true then!
>
> Yes :)
>
> > Still if I am right, with Netcat you might end up with a corrupted
> > file on the other side (shorter). Or maybe not, depending on how
> > exactly the bug is made. Could you check this? Filesize and md5sum.
>
> md5sum are ok.
> I can't reproduce the problem on wire with a switch ( using both
> 10 and 100Mbp FD )
>
> > Do you have Intel network card, with e1000e driver on SG? (see lsmod)
> > (sorry I was wrong before, it's e1000e in my case not e1000)
> > Otherwise it's probably the TCP layer...
>
> Sorry, I forgot to precise the network hardware.
> I forgot a noname 100Mbp switch between the isv box and sg host.
>
> SG:
>
> 00:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL-8139/8139C/8139C+ (rev 10)
> using module 8139cp
>
> Devel:
>
> 02:02.0 Network controller: Broadcom Corporation BCM4306 802.11b/g
> Wireless LAN Controller (rev 03)
> using b43 driver.
>
> 02:08.0 Ethernet controller: Intel Corporation 82801DB PRO/100 VE (MOB)
> Ethernet Controller (rev 81)
> using e100 in kernel ( no module )
>
> My isv box is broadcom based, but I don't know which chipset for
> wireless.
>
> Since our hardware are different, I don't think it's coming from
> hardware driver. I'm preparing traces for Ilpo
One worth of checking would be ssh versions if they match (if they have
broken something, though it's unlikely). I'm also relatively convinced
that hw issues are not what'll find.
--
i.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-11-27 21:00 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-25 4:25 TCP sockets stalling - help! (long) Asdo
2009-11-25 11:48 ` Ilpo Järvinen
2009-11-25 13:07 ` Asdo
2009-11-25 15:36 ` Asdo
2009-11-25 16:38 ` Asdo
2009-11-25 22:29 ` [E1000-devel] " Ilpo Järvinen
2009-11-26 22:33 ` Frederic Leroy
2009-11-27 9:10 ` Ilpo Järvinen
2009-11-27 12:56 ` Asdo
2009-11-27 20:53 ` Frederic Leroy
2009-11-27 21:00 ` Ilpo Järvinen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).