netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
@ 2008-10-01 12:52 Ilpo Järvinen
  2008-10-01 16:27 ` Dâniel Fraga
  0 siblings, 1 reply; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-01 12:52 UTC (permalink / raw)
  To: Dâniel Fraga, tglx, David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1059 bytes --]

Hi Daniel,

I forward part of the details to public knowledge (and for Thomas mainly).

I also put the epoll & accept only log available for him at
http://www.cs.helsinki.fi/u/ijjarvin/tcp/epoll_accept.txt
as it won't contain encryption key, etc. related bits.

Can you Daniel please confirm what exactly was the status about the effect 
of disabling ntpd?

-- 
 i.

---------- Forwarded message ----------
From: Dâniel Fraga <fragabr@gmail.com>
To: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Subject: Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround
Date: Tue, 23 Sep 2008 00:56:17 -0300

On Mon, 22 Sep 2008 14:22:12 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> When using nmap to resolve, take note on exact timestamp (including 
> seconds). E.g., 
> $ date > nmap.ts; nmap ...

	Hi Ilpo, so we can hurry up this debug, I'm sending before,
during and after the stall, just to you:

	Obs.: the stall happened at +- 13:06:47... you'll notice lots of epoll_wait.

	The nmap was done at +- 13:07:15.

	Thanks again! :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-01 12:52 [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) Ilpo Järvinen
@ 2008-10-01 16:27 ` Dâniel Fraga
  2008-10-01 20:05   ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-01 16:27 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: tglx, David Miller, Netdev

On Wed, 1 Oct 2008 15:52:19 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Hi Daniel,
> 
> I forward part of the details to public knowledge (and for Thomas mainly).
> 
> I also put the epoll & accept only log available for him at
> http://www.cs.helsinki.fi/u/ijjarvin/tcp/epoll_accept.txt
> as it won't contain encryption key, etc. related bits.
> 
> Can you Daniel please confirm what exactly was the status about the effect 
> of disabling ntpd?

	Hi Ilpo, ok, thanks. Following suggestion of kernel developers
from bugzilla, I tested with 2.6.27-rc7 and 2.6.27-rc8. The problem
happens less frequently, but still happens.

	Disabling ntpd helps to not "trigger" the problem often. With
ntpd enabled the problem happens at least once a day. Without ntpd, it
doesn't happen anymore or seldom happens. Disabling high resolution
timers helps too.

	I'll follow your previous suggestions, but I didn't have time
yet. If anyone needs more info, just ask. Thanks.

	Ps: just to clarify for Thomas. The problem has started in
2.6.25 kernel and remain in 2.6.26 until now (2.6.27-rc8). Something
(probably related to timers) changed in 2.6.25 which causes this,
although the effect is seem on network stalling.


-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-01 16:27 ` Dâniel Fraga
@ 2008-10-01 20:05   ` Thomas Gleixner
  2008-10-01 20:46     ` Dâniel Fraga
  2008-10-01 21:14     ` Dâniel Fraga
  0 siblings, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2008-10-01 20:05 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Ilpo Järvinen, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2606 bytes --]

On Wed, 1 Oct 2008, Dâniel Fraga wrote:
> On Wed, 1 Oct 2008 15:52:19 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > Hi Daniel,
> > 
> > I forward part of the details to public knowledge (and for Thomas mainly).
> > 
> > I also put the epoll & accept only log available for him at
> > http://www.cs.helsinki.fi/u/ijjarvin/tcp/epoll_accept.txt
> > as it won't contain encryption key, etc. related bits.
> > 
> > Can you Daniel please confirm what exactly was the status about the effect 
> > of disabling ntpd?
> 
> 	Hi Ilpo, ok, thanks. Following suggestion of kernel developers
> from bugzilla, I tested with 2.6.27-rc7 and 2.6.27-rc8. The problem
> happens less frequently, but still happens.
> 
> 	Disabling ntpd helps to not "trigger" the problem often. With
> ntpd enabled the problem happens at least once a day. Without ntpd, it
> doesn't happen anymore or seldom happens. Disabling high resolution
> timers helps too.
> 
> 	I'll follow your previous suggestions, but I didn't have time
> yet. If anyone needs more info, just ask. Thanks.
> 
> 	Ps: just to clarify for Thomas. The problem has started in
> 2.6.25 kernel and remain in 2.6.26 until now (2.6.27-rc8). Something
> (probably related to timers) changed in 2.6.25 which causes this,
> although the effect is seem on network stalling.

Lots of things related to timers changed over the last kernel
versions, but the big changes were 2.6.21 for 32bit and 2.6.24 for
64bit, where the high resolution timer were enabled. Since then we
have only bigfixes and improvements in the timer related code. 2.6.25
has no fundamental changes in the timer code at all.

I think your observation vs. ntpd and hrtimer is just a red
herring. It influences the visibility of the problem by shifting
timings around, but it does not pinpoint them as the root cause.

I might be wrong as usual, but in that case I insist on "in dubio pro
reo". :)

I looked at the strace output and the point where you said it stalled.
We can see the epoll_wait() calls of all worker threads not come back
for 30 seconds. At the point where you said you poked with nmap at the
system they all come back out of the blue with exactly 1 event set each.

I really can not connect the timer system to that behaviour at all,
except there is some timer interaction deep inside of the networking
code causing this, but I have no clue where I should start digging.

One possibility to get deeper insight into this problem is to use the
function tracer and stop it once we notice the wreckage. 
listenoverflow should be a good point to stop it.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-01 20:05   ` Thomas Gleixner
@ 2008-10-01 20:46     ` Dâniel Fraga
  2008-10-01 21:14     ` Dâniel Fraga
  1 sibling, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-01 20:46 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Wed, 1 Oct 2008 22:05:09 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> Lots of things related to timers changed over the last kernel
> versions, but the big changes were 2.6.21 for 32bit and 2.6.24 for
> 64bit, where the high resolution timer were enabled. Since then we
> have only bigfixes and improvements in the timer related code. 2.6.25
> has no fundamental changes in the timer code at all.
> 
> I think your observation vs. ntpd and hrtimer is just a red
> herring. It influences the visibility of the problem by shifting
> timings around, but it does not pinpoint them as the root cause.
> 
> I might be wrong as usual, but in that case I insist on "in dubio pro
> reo". :)

	Ok :) I understand your point ;).

> One possibility to get deeper insight into this problem is to use the
> function tracer and stop it once we notice the wreckage. 
> listenoverflow should be a good point to stop it.

	Ok. I'll do it and return the results.

[*] Kernel Function Tracer

	Thank you very much! :)

	Ps: if we figure it out soon, maybe we can solve this before
2.6.27 final release since Linus said that rc8 would be the last one.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-01 20:05   ` Thomas Gleixner
  2008-10-01 20:46     ` Dâniel Fraga
@ 2008-10-01 21:14     ` Dâniel Fraga
  2008-10-02 10:42       ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-01 21:14 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Wed, 1 Oct 2008 22:05:09 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> One possibility to get deeper insight into this problem is to use the
> function tracer and stop it once we notice the wreckage. 
> listenoverflow should be a good point to stop it.

	Ok, I'm already running function tracer.

	I did the following:

echo ftrace > /debug/tracing/current_tracer
echo print-parent > /debug/tracing/iter_ctrl

	Is it enough to get what you suggest?

	If so, I'll leave it enabled:

echo 1 > /debug/tracing/tracing_enabled

	and stop and get the data just when the problem happens... then
I should know what I should grep, what's relevant.

	thank you again.


-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-01 21:14     ` Dâniel Fraga
@ 2008-10-02 10:42       ` Thomas Gleixner
  2008-10-02 11:17         ` Dâniel Fraga
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2008-10-02 10:42 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Ilpo Järvinen, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2427 bytes --]

On Wed, 1 Oct 2008, Dâniel Fraga wrote:
> On Wed, 1 Oct 2008 22:05:09 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > One possibility to get deeper insight into this problem is to use the
> > function tracer and stop it once we notice the wreckage. 
> > listenoverflow should be a good point to stop it.
> 
> 	Ok, I'm already running function tracer.
> 
> 	I did the following:
> 
> echo ftrace > /debug/tracing/current_tracer
> echo print-parent > /debug/tracing/iter_ctrl
> 
> 	Is it enough to get what you suggest?

Please add:

  echo sym-offset > /debug/tracing/iter_ctrl
  echo sym-addr > /debug/tracing/iter_ctrl
  echo sym-verbose > /debug/tracing/iter_ctrl

> 	If so, I'll leave it enabled:
> 
> echo 1 > /debug/tracing/tracing_enabled
> 
> 	and stop and get the data just when the problem happens... then
> I should know what I should grep, what's relevant.

I'm not sure, whether manual stop will be sufficient as the relevant
traces might have been gone by that time. 

Find below a patch which disables the tracer when the listen overflow
is detected. It could occassionally be hit in a legimate case as well,
but we did not come up with any better idea.

If it triggers, please cat /debug/tracing/trace to a file. Can you
bzip2 and upload it to some place. The trace does not contain any
sensible private information.

Thanks,

	tglx
---
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index f6e3af3..a1cecb0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -68,6 +68,12 @@ static void ftrace_list_func(unsigned long ip, unsigned long parent_ip)
 	};
 }
 
+void stop_event_tracer_now(void)
+{
+	tracer_enabled = 0;
+	printk(KERN_INFO "Tracer stopped\n");
+}
+
 /**
  * clear_ftrace_function - reset the ftrace function
  *
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f976fc5..527dd45 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -485,6 +485,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 	return newsk;
 }
 
+extern void stop_event_tracer_now(void);
+
 /*
  *	Process an incoming packet for SYN_RECV sockets represented
  *	as a request_sock.
@@ -687,6 +689,7 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
 	return child;
 
 listen_overflow:
+	stop_event_tracer_now();
 	if (!sysctl_tcp_abort_on_overflow) {
 		inet_rsk(req)->acked = 1;
 		return NULL;

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 10:42       ` Thomas Gleixner
@ 2008-10-02 11:17         ` Dâniel Fraga
  2008-10-02 12:08           ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-02 11:17 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Thu, 2 Oct 2008 12:42:44 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

>   echo sym-verbose > /debug/tracing/iter_ctrl

	Ok, this one I changed to:

echo verbose > /debug/tracing/iter_ctrl

	;) since there's no sym-verbose.

> Find below a patch which disables the tracer when the listen overflow
> is detected. It could occassionally be hit in a legimate case as well,
> but we did not come up with any better idea.

	Ok, I applied the patch, but it returns this error:

  CC      kernel/trace/ftrace.o
kernel/trace/ftrace.c: In function 'stop_event_tracer_now':
kernel/trace/ftrace.c:73: error: 'tracer_enabled' undeclared (first use in this function)
kernel/trace/ftrace.c:73: error: (Each undeclared identifier is reported only once
kernel/trace/ftrace.c:73: error: for each function it appears in.)
make[2]: *** [kernel/trace/ftrace.o] Error 1
make[1]: *** [kernel/trace] Error 2
make: *** [kernel] Error 2

> If it triggers, please cat /debug/tracing/trace to a file. Can you
> bzip2 and upload it to some place. The trace does not contain any
> sensible private information.

	Ok. Thanks.
-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 11:17         ` Dâniel Fraga
@ 2008-10-02 12:08           ` Thomas Gleixner
  2008-10-02 14:09             ` Dâniel Fraga
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2008-10-02 12:08 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Ilpo Järvinen, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1099 bytes --]

On Thu, 2 Oct 2008, Dâniel Fraga wrote:

> On Thu, 2 Oct 2008 12:42:44 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> >   echo sym-verbose > /debug/tracing/iter_ctrl
> 
> 	Ok, this one I changed to:
> 
> echo verbose > /debug/tracing/iter_ctrl
> 
> 	;) since there's no sym-verbose.
> 
> > Find below a patch which disables the tracer when the listen overflow
> > is detected. It could occassionally be hit in a legimate case as well,
> > but we did not come up with any better idea.
> 
> 	Ok, I applied the patch, but it returns this error:
> 
>   CC      kernel/trace/ftrace.o
> kernel/trace/ftrace.c: In function 'stop_event_tracer_now':
> kernel/trace/ftrace.c:73: error: 'tracer_enabled' undeclared (first use in this function)
> kernel/trace/ftrace.c:73: error: (Each undeclared identifier is reported only once
> kernel/trace/ftrace.c:73: error: for each function it appears in.)
> make[2]: *** [kernel/trace/ftrace.o] Error 1
> make[1]: *** [kernel/trace] Error 2
> make: *** [kernel] Error 2

grmbl. tracer_enabled should have been ftrace_enabled of course.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 12:08           ` Thomas Gleixner
@ 2008-10-02 14:09             ` Dâniel Fraga
  2008-10-02 19:16               ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-02 14:09 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Thu, 2 Oct 2008 14:08:14 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> grmbl. tracer_enabled should have been ftrace_enabled of course.

	Ok Thomas. I got a stall on my desktop.

	It happens when I try to use sudo and it just stay stalled (I
think trying to access /dev/log or something like that).

	I got the trace here:

http://www.abusar.org/ftrace/trace.txt.bz2

	after the stall.

	Could you please check if there's something wrong?

	I'd prefer if the stall happened in the server, but it still
didn't happen. Anyway, it's the same problem.

	Ps: this time I couldn't solve it with nmap (sometimes nmap
solves, sometimes it doesn't).

	Ps2: in this stall, the tracing wasn't disabled automatically
by your patch, because it happened in a different situation (when I
tried to use sudo). But I know it's the same problem, because the
"solution" I use is always the same (nmap) and until it's not stalled
anymore, I can't connect to any port on my machine.

	Ps3: if this trace is useless, wait until I get something on
the server.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 14:09             ` Dâniel Fraga
@ 2008-10-02 19:16               ` Thomas Gleixner
  2008-10-02 20:31                 ` Ilpo Järvinen
  2008-10-02 20:56                 ` [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) Dâniel Fraga
  0 siblings, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2008-10-02 19:16 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Ilpo Järvinen, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 998 bytes --]

On Thu, 2 Oct 2008, Dâniel Fraga wrote:

> On Thu, 2 Oct 2008 14:08:14 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > grmbl. tracer_enabled should have been ftrace_enabled of course.
> 
> 	Ok Thomas. I got a stall on my desktop.
> 
> 	It happens when I try to use sudo and it just stay stalled (I
> think trying to access /dev/log or something like that).
> 
> 	I got the trace here:
> 
> http://www.abusar.org/ftrace/trace.txt.bz2
> 
> 	after the stall.
> 
> 	Could you please check if there's something wrong?
> 
> 	I'd prefer if the stall happened in the server, but it still
> didn't happen. Anyway, it's the same problem.

Hmm, not sure. sudo is a different beast. All I can see in the trace
is that reiserfs is turning itself in circles (which is nothing new),
but there is nothing which might tell us about a stall. At the point
where you stopped the trace the relevant info was already replaced by
new stuff. It's a circular buffer which gets overwritten.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 19:16               ` Thomas Gleixner
@ 2008-10-02 20:31                 ` Ilpo Järvinen
  2008-10-02 20:59                   ` Dâniel Fraga
  2008-10-02 20:56                 ` [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) Dâniel Fraga
  1 sibling, 1 reply; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-02 20:31 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1447 bytes --]

On Thu, 2 Oct 2008, Thomas Gleixner wrote:

> On Thu, 2 Oct 2008, Dâniel Fraga wrote:
> 
> > On Thu, 2 Oct 2008 14:08:14 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > grmbl. tracer_enabled should have been ftrace_enabled of course.
> > 
> > 	Ok Thomas. I got a stall on my desktop.
> > 
> > 	It happens when I try to use sudo and it just stay stalled (I
> > think trying to access /dev/log or something like that).

Ah, I was kind of thinking that you were doing something net related 
using some raw socket thing (such as traceroute) when you earlier 
mentioned sudo (iirc, we never really discussed what you usually do with 
it when you see the stall).

> > 	I got the trace here:
> > 
> > http://www.abusar.org/ftrace/trace.txt.bz2
> > 
> > 	after the stall.
> > 
> > 	Could you please check if there's something wrong?
> > 
> > 	I'd prefer if the stall happened in the server, but it still
> > didn't happen. Anyway, it's the same problem.
> 
> Hmm, not sure. sudo is a different beast. All I can see in the trace
> is that reiserfs is turning itself in circles (which is nothing new),
> but there is nothing which might tell us about a stall. At the point
> where you stopped the trace the relevant info was already replaced by
> new stuff. It's a circular buffer which gets overwritten.

We might have had a change if there would have been a (failed) connection 
attempt in it but there seemed to be none.


-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 19:16               ` Thomas Gleixner
  2008-10-02 20:31                 ` Ilpo Järvinen
@ 2008-10-02 20:56                 ` Dâniel Fraga
  1 sibling, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-02 20:56 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Thu, 2 Oct 2008 21:16:32 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> Hmm, not sure. sudo is a different beast. All I can see in the trace
> is that reiserfs is turning itself in circles (which is nothing new),
> but there is nothing which might tell us about a stall. At the point
> where you stopped the trace the relevant info was already replaced by
> new stuff. It's a circular buffer which gets overwritten.

	Ok, no problem.

	I'll keep waiting for a stall on the server-side..

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 20:31                 ` Ilpo Järvinen
@ 2008-10-02 20:59                   ` Dâniel Fraga
  2008-10-02 21:19                     ` Ilpo Järvinen
  0 siblings, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-02 20:59 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Thu, 2 Oct 2008 23:31:28 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Ah, I was kind of thinking that you were doing something net related 
> using some raw socket thing (such as traceroute) when you earlier 
> mentioned sudo (iirc, we never really discussed what you usually do with 
> it when you see the stall).

	I read nntp groups everyday and I use leafnode. The stall on my
desktop happens when I run fetchnews (which must be run as user "news"
so I have to use sudo). Usually the stall on my desktop happens when
I'm using fetchnews or some other command run by sudo.

> We might have had a change if there would have been a (failed) connection 
> attempt in it but there seemed to be none.

	Ok, I'll wait for the breakpoint in the patch Thomas provided.
Now we need just a bit of patience ;).

	Thanks everybody.


-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 20:59                   ` Dâniel Fraga
@ 2008-10-02 21:19                     ` Ilpo Järvinen
  2008-10-02 21:39                       ` Dâniel Fraga
  2008-10-05 17:52                       ` Dâniel Fraga
  0 siblings, 2 replies; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-02 21:19 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 873 bytes --]

On Thu, 2 Oct 2008, Dâniel Fraga wrote:

> On Thu, 2 Oct 2008 23:31:28 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > Ah, I was kind of thinking that you were doing something net related 
> > using some raw socket thing (such as traceroute) when you earlier 
> > mentioned sudo (iirc, we never really discussed what you usually do with 
> > it when you see the stall).
> 
> 	I read nntp groups everyday and I use leafnode. The stall on my
> desktop happens when I run fetchnews (which must be run as user "news"
> so I have to use sudo). Usually the stall on my desktop happens when
> I'm using fetchnews or some other command run by sudo.

Yes yes, I'd be specificly interested to have some rough idea what that
"some other command" includes, especially as some seem to be non-net 
related since you mentioned e.g., that /dev/log thing?


-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 21:19                     ` Ilpo Järvinen
@ 2008-10-02 21:39                       ` Dâniel Fraga
  2008-10-05 17:52                       ` Dâniel Fraga
  1 sibling, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-02 21:39 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Fri, 3 Oct 2008 00:19:08 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Yes yes, I'd be specificly interested to have some rough idea what that
> "some other command" includes, especially as some seem to be non-net 
> related since you mentioned e.g., that /dev/log thing?

	Ah ok! :)

	Yes, please, take a look at this thread:

http://www.sudo.ws/pipermail/sudo-workers/2008-September/000626.html

	We use strace too and I noticed that it stalled when trying to
access /dev/log (at the end). Does it make sense?

	The sudo guy says it's an UDP connection, so this could explain
why I can't restore it with nmap -sS... 

	At the end, he thinks it could be a kernel issue... (what I
think too, because it never happened before 2.6.25).

	Any ideas?


-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-02 21:19                     ` Ilpo Järvinen
  2008-10-02 21:39                       ` Dâniel Fraga
@ 2008-10-05 17:52                       ` Dâniel Fraga
  2008-10-05 18:09                         ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-05 17:52 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Fri, 3 Oct 2008 00:19:08 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> On Thu, 2 Oct 2008, Dâniel Fraga wrote:
> 
> > On Thu, 2 Oct 2008 23:31:28 +0300 (EEST)
> > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> > 
> > > Ah, I was kind of thinking that you were doing something net related 
> > > using some raw socket thing (such as traceroute) when you earlier 
> > > mentioned sudo (iirc, we never really discussed what you usually do with 
> > > it when you see the stall).
> > 
> > 	I read nntp groups everyday and I use leafnode. The stall on my
> > desktop happens when I run fetchnews (which must be run as user "news"
> > so I have to use sudo). Usually the stall on my desktop happens when
> > I'm using fetchnews or some other command run by sudo.
> 
> Yes yes, I'd be specificly interested to have some rough idea what that
> "some other command" includes, especially as some seem to be non-net 
> related since you mentioned e.g., that /dev/log thing?

	Here we go folks, good and bad news:

1) the good news is that the Thomas' patch detected the stall, but...

2)

Oct  5 14:23:03 teleporto vmunix: Tracer stopped
Oct  5 14:23:13 teleporto last message repeated 2 times

	if it had really stopped, the message wouldn't be repeated 2
times.

	and

fraga@teleporto ~$ cat /debug/tracing/tracing_enabled 
1

	In other words, the tracing wasn't really stopped.

	So I couldn't get what Thomas want. Should I activated the
initial auto-checking of ftrace to see if it's really working? Very
strange the fact it wasn't stopped.

	***

	I was waiting for 2 days to a stall and I managed to get the
stall by using "sudo" multiple times. It seems that sudo really makes
this bug hit.
	
	And now? Should we revise Thomas patch? Maybe a bug in ftrace
code that prevents it from being stopped? It seems that:

ftrace_enabled = 0;

	isn't really stopping the tracing.

	Thank you.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-05 17:52                       ` Dâniel Fraga
@ 2008-10-05 18:09                         ` Thomas Gleixner
  2008-10-05 19:31                           ` Dâniel Fraga
                                             ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Thomas Gleixner @ 2008-10-05 18:09 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Ilpo Järvinen, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2221 bytes --]

n Sun, 5 Oct 2008, Dâniel Fraga wrote:

> On Fri, 3 Oct 2008 00:19:08 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > On Thu, 2 Oct 2008, Dâniel Fraga wrote:
> > 
> > > On Thu, 2 Oct 2008 23:31:28 +0300 (EEST)
> > > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> > > 
> > > > Ah, I was kind of thinking that you were doing something net related 
> > > > using some raw socket thing (such as traceroute) when you earlier 
> > > > mentioned sudo (iirc, we never really discussed what you usually do with 
> > > > it when you see the stall).
> > > 
> > > 	I read nntp groups everyday and I use leafnode. The stall on my
> > > desktop happens when I run fetchnews (which must be run as user "news"
> > > so I have to use sudo). Usually the stall on my desktop happens when
> > > I'm using fetchnews or some other command run by sudo.
> > 
> > Yes yes, I'd be specificly interested to have some rough idea what that
> > "some other command" includes, especially as some seem to be non-net 
> > related since you mentioned e.g., that /dev/log thing?
> 
> 	Here we go folks, good and bad news:
> 
> 1) the good news is that the Thomas' patch detected the stall, but...
> 
> 2)
> 
> Oct  5 14:23:03 teleporto vmunix: Tracer stopped
> Oct  5 14:23:13 teleporto last message repeated 2 times
> 
> 	if it had really stopped, the message wouldn't be repeated 2
> times.
> 
> 	and
> 
> fraga@teleporto ~$ cat /debug/tracing/tracing_enabled 
> 1
> 
> 	In other words, the tracing wasn't really stopped.
> 
> 	So I couldn't get what Thomas want. Should I activated the
> initial auto-checking of ftrace to see if it's really working? Very
> strange the fact it wasn't stopped.
> 
> 	***
> 
> 	I was waiting for 2 days to a stall and I managed to get the
> stall by using "sudo" multiple times. It seems that sudo really makes
> this bug hit.
> 	
> 	And now? Should we revise Thomas patch? Maybe a bug in ftrace
> code that prevents it from being stopped? It seems that:
> 
> ftrace_enabled = 0;
> 
> 	isn't really stopping the tracing.

did you get the content of /debug/tracing/trace nevertheless ?

The cat /debug/tracing/tracing_enabled == 1 might be just a user space
interface thingy.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-05 18:09                         ` Thomas Gleixner
@ 2008-10-05 19:31                           ` Dâniel Fraga
  2008-10-06  1:27                           ` Dâniel Fraga
  2008-10-08 19:34                           ` Dâniel Fraga
  2 siblings, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-05 19:31 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Sun, 5 Oct 2008 20:09:22 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> did you get the content of /debug/tracing/trace nevertheless ?

	Yes! I got several... maybe one is valid....

> The cat /debug/tracing/tracing_enabled == 1 might be just a user space
> interface thingy.

	Ok, here they are (I hope one of them is correct):

http://www.abusar.org/ftrace/

	These ones I get after I saw the kernel message "Tracer
stopped".

http://www.abusar.org/ftrace/trace.txt.bz2

http://www.abusar.org/ftrace/trace2.txt.bz2

http://www.abusar.org/ftrace/trace3.txt.bz2

http://www.abusar.org/ftrace/trace4.txt.bz2

http://www.abusar.org/ftrace/trace5.txt.bz2

	These ones were during a stall and after a nmap:

http://www.abusar.org/ftrace/trace-stop.txt.bz2

http://www.abusar.org/ftrace/trace-stop2.txt.bz2

	If none of these is valid, I will use monit to
check for Tracer stopped and force the tracing to be disabled.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-05 18:09                         ` Thomas Gleixner
  2008-10-05 19:31                           ` Dâniel Fraga
@ 2008-10-06  1:27                           ` Dâniel Fraga
  2008-10-08 19:34                           ` Dâniel Fraga
  2 siblings, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-06  1:27 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Sun, 5 Oct 2008 20:09:22 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> did you get the content of /debug/tracing/trace nevertheless ?
> 
> The cat /debug/tracing/tracing_enabled == 1 might be just a user space
> interface thingy.

	Hi Thomas, I added this one too (from my desktop):

http://www.abusar.org/ftrace/trace-desktop.txt.bz2

	I don't know if it will help, but if you could look the entries
generated by sudo and fetchnews, they were stalled. sudo has a lot of:

            sudo-1081  [01] 34460.250986: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b>
            sudo-1081  [01] 34460.250986: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b>
            sudo-1081  [01] 34460.250987: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b>
            sudo-1081  [01] 34460.250987: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b>
            sudo-1081  [01] 34460.250987: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b>
            sudo-1081  [01] 34460.250987: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b>
            sudo-1081  [01] 34460.250987: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b>
            sudo-1081  [01] 34460.250988: sys_close+0x14/0x100 <ffffffff8029e044> <-system_call_fastpath+0x16/0x1b <ffffffff8020bd1b>
            sudo-1081  [01] 34460.250988: _spin_lock+0x4/0x20 <ffffffff80453754> <-sys_close+0x3b/0x100 <ffffffff8029e06b>

	I don't know if it's normal...

	Thank you.


-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-05 18:09                         ` Thomas Gleixner
  2008-10-05 19:31                           ` Dâniel Fraga
  2008-10-06  1:27                           ` Dâniel Fraga
@ 2008-10-08 19:34                           ` Dâniel Fraga
  2008-10-08 19:56                             ` Thomas Gleixner
  2 siblings, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-08 19:34 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Sun, 5 Oct 2008 20:09:22 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> The cat /debug/tracing/tracing_enabled == 1 might be just a user space
> interface thingy.
> 
> Thanks,

	Thomas, Ilpo etc, I got a trace exactly after the patch
provided by Thomas detected the stall:

http://www.abusar.org/ftrace/trace-final.txt.bz2

	Thomas, since you didn't reply to the previous messages, I
assume the previous traces were invalid. But could you please check
this one?

	It would be wise to search for "monit" in the trace, since
monit activated the script "stop-tracing.sh" which stops the trace and
dump the trace output to the file. So what probably interest you is
what comes *before* "monit".

	"monit" appears for the first time in line 16921.

	To make it more clear, I configured monit with the following:

check file syslog with path /var/adm/messages
	if match "Tracer stopped" then exec "/home/fraga/stop-tracing.sh"

	And the stop-tracing.sh script is just that:

#!/bin/sh
/usr/local/bin/echo 0 > /debug/tracing/tracing_enabled
/usr/local/bin/cat /debug/tracing/trace > /home/fraga/trace.txt

	So I expect that this trace is right now.

	If monit was too slow in detecting the "Tracer stopped" log entry, I'll 
use a Perl script which is faster.

	Thanks, I wait for your reply.

	Ps: flood_mail.pl is just a perl script I use to prevent people from flooding
my smtp server with mail.

	Ps2: I just don't know if it would be better to stop the tracing *before*
dump the trace or it would be better to dump the trace without stopping it... I noticed
that even the "echo" I use to stop the tracing, generated a lot of useless tracing information.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-08 19:34                           ` Dâniel Fraga
@ 2008-10-08 19:56                             ` Thomas Gleixner
  2008-10-08 20:44                               ` Ilpo Järvinen
  2008-10-13  4:08                               ` Dâniel Fraga
  0 siblings, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2008-10-08 19:56 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Ilpo Järvinen, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 766 bytes --]

On Wed, 8 Oct 2008, Dâniel Fraga wrote:

> On Sun, 5 Oct 2008 20:09:22 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > The cat /debug/tracing/tracing_enabled == 1 might be just a user space
> > interface thingy.
> > 
> > Thanks,
> 
> 	Thomas, Ilpo etc, I got a trace exactly after the patch
> provided by Thomas detected the stall:
> 
> http://www.abusar.org/ftrace/trace-final.txt.bz2
> 
> 	Thomas, since you didn't reply to the previous messages, I
> assume the previous traces were invalid. But could you please check
> this one?

Will do. Sorry was distracted. Steven Rostedt posted a patch to stop
the tracer really from the kernel, I need to verify that it works on
.27. Will do tomorrow when my brain is more awake than now.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-08 19:56                             ` Thomas Gleixner
@ 2008-10-08 20:44                               ` Ilpo Järvinen
  2008-10-10  4:20                                 ` Dâniel Fraga
  2008-10-13  4:08                               ` Dâniel Fraga
  1 sibling, 1 reply; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-08 20:44 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Dâniel Fraga, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2002 bytes --]

On Wed, 8 Oct 2008, Thomas Gleixner wrote:

> On Wed, 8 Oct 2008, Dâniel Fraga wrote:
> 
> > On Sun, 5 Oct 2008 20:09:22 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > The cat /debug/tracing/tracing_enabled == 1 might be just a user space
> > > interface thingy.
> > > 
> > > Thanks,
> > 
> > 	Thomas, Ilpo etc, I got a trace exactly after the patch
> > provided by Thomas detected the stall:
> > 
> > http://www.abusar.org/ftrace/trace-final.txt.bz2
> > 
> > 	Thomas, since you didn't reply to the previous messages, I
> > assume the previous traces were invalid. But could you please check
> > this one?
>
> Will do. Sorry was distracted. Steven Rostedt posted a patch to stop
> the tracer really from the kernel, I need to verify that it works on
> .27. Will do tomorrow when my brain is more awake than now.

I hope we're not chasing a partial ghost btw since there seem to be,
after all, two "independent" failures which look similar, one at the 
server and the other at the client. So the strace we were looking earlier 
_might_ actually be from a case where the client was stalling instead of 
the server (we don't have /proc/net/tcp of that case so I cannot 
decide for sure though I might as well be wrong and it was a valid 
server case straced)... Anyway, this ftrace may still reveal something as 
it's clearly triggering due to server side listenoverflow which is known 
to happen during the server side event but the thing we want to find might 
be something else than what strace was giving.

Is there some compatible kd tool for ftrace that got included? I only 
found one from a kft patch but I suppose the format is different or 
something as I got no output from it, or I'm just stupid and don't 
understand how I'm supposed to use it with the logs Daniel gathered :-).

I'll try to take a look as well tomorrow or Friday but I'm pretty busy
already and will be travelling starting from Sunday to Wednesday so my 
responses might get a bit delayed.


-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-08 20:44                               ` Ilpo Järvinen
@ 2008-10-10  4:20                                 ` Dâniel Fraga
  2008-10-10  8:10                                   ` Thomas Gleixner
  2008-10-11  8:27                                   ` Ilpo Järvinen
  0 siblings, 2 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-10  4:20 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Wed, 8 Oct 2008 23:44:39 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> the server (we don't have /proc/net/tcp of that case so I cannot 

	Ok, now I got a trace that I think is the best one I got so far:

http://www.abusar.org/ftrace/trace-best.txt.bz2

	and the requested /proc/net/tcp by the time the Tracer was
stopped:

http://www.abusar.org/ftrace/trace-tcp.txt

	I hope that now it helps.

	As always, if you need more information, just ask.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-10  4:20                                 ` Dâniel Fraga
@ 2008-10-10  8:10                                   ` Thomas Gleixner
  2008-10-11  8:27                                   ` Ilpo Järvinen
  1 sibling, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2008-10-10  8:10 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Ilpo Järvinen, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 558 bytes --]

On Fri, 10 Oct 2008, Dâniel Fraga wrote:

> On Wed, 8 Oct 2008 23:44:39 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > the server (we don't have /proc/net/tcp of that case so I cannot 
> 
> 	Ok, now I got a trace that I think is the best one I got so far:
> 
> http://www.abusar.org/ftrace/trace-best.txt.bz2

Nothing which helps us really :( We stop the trace too late
and probably we need some dedicated tracepoints to see something
useful. I'm away over the weekend, but I might have some time to look
into that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-10  4:20                                 ` Dâniel Fraga
  2008-10-10  8:10                                   ` Thomas Gleixner
@ 2008-10-11  8:27                                   ` Ilpo Järvinen
  2008-10-11 13:38                                     ` Dâniel Fraga
  1 sibling, 1 reply; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-11  8:27 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 845 bytes --]

On Fri, 10 Oct 2008, Dâniel Fraga wrote:

> On Wed, 8 Oct 2008 23:44:39 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > the server (we don't have /proc/net/tcp of that case so I cannot 
> 
> 	Ok, now I got a trace that I think is the best one I got so far:
> 
> http://www.abusar.org/ftrace/trace-best.txt.bz2
> 
> 	and the requested /proc/net/tcp by the time the Tracer was
> stopped:
> 
> http://www.abusar.org/ftrace/trace-tcp.txt
> 
> 	I hope that now it helps.
> 
> 	As always, if you need more information, just ask.

This time there isn't any listening socket with queue to be accepted, not 
connections. So it's not exactly the same situation on the server, I'm 
sort of thinking that the accept() stuff we found earlier is just 
coincidence, I guess it could as well hang somewhere else on other
times.

-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-11  8:27                                   ` Ilpo Järvinen
@ 2008-10-11 13:38                                     ` Dâniel Fraga
  0 siblings, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-11 13:38 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Sat, 11 Oct 2008 11:27:50 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> This time there isn't any listening socket with queue to be accepted, not 
> connections. So it's not exactly the same situation on the server, I'm 
> sort of thinking that the accept() stuff we found earlier is just 
> coincidence, I guess it could as well hang somewhere else on other
> times.

	Ok, this was with 2.6.27-rc8. Now I'm using the final 2.6.27
version and, at least until now, no stall happened. 2.6.27 seems much
better than 2.6.25 and 2.6.26.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-08 19:56                             ` Thomas Gleixner
  2008-10-08 20:44                               ` Ilpo Järvinen
@ 2008-10-13  4:08                               ` Dâniel Fraga
  2008-10-13 13:35                                 ` Ilpo Järvinen
  1 sibling, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-13  4:08 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ilpo Järvinen, David Miller, Netdev

On Wed, 8 Oct 2008 21:56:00 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> Will do. Sorry was distracted. Steven Rostedt posted a patch to stop
> the tracer really from the kernel, I need to verify that it works on
> .27. Will do tomorrow when my brain is more awake than now.

	Hi Thomas, did you have time to verify that patch? If so, I'd
like to test it. Thanks.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-13  4:08                               ` Dâniel Fraga
@ 2008-10-13 13:35                                 ` Ilpo Järvinen
  2008-10-13 18:32                                   ` Dâniel Fraga
  2008-10-17 23:51                                   ` Dâniel Fraga
  0 siblings, 2 replies; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-13 13:35 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 639 bytes --]

On Mon, 13 Oct 2008, Dâniel Fraga wrote:

> On Wed, 8 Oct 2008 21:56:00 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > Will do. Sorry was distracted. Steven Rostedt posted a patch to stop
> > the tracer really from the kernel, I need to verify that it works on
> > .27. Will do tomorrow when my brain is more awake than now.
> 
> 	Hi Thomas, did you have time to verify that patch? If so, I'd
> like to test it. Thanks.

http://marc.info/?t=122365716200001&r=1&w=2

...might turn out related. It has somewhat similar problem though 
different syscall. Not enough info on what kernel version etc. 
there though.


-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-13 13:35                                 ` Ilpo Järvinen
@ 2008-10-13 18:32                                   ` Dâniel Fraga
  2008-10-13 18:48                                     ` Nicolas Cannasse
  2008-10-17 23:51                                   ` Dâniel Fraga
  1 sibling, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-13 18:32 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Thomas Gleixner, David Miller, Netdev, Nicolas Cannasse

On Mon, 13 Oct 2008 16:35:08 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> http://marc.info/?t=122365716200001&r=1&w=2
> 
> ...might turn out related. It has somewhat similar problem though 
> different syscall. Not enough info on what kernel version etc. 
> there though.

	Interesting. I CCed Nicolas Cannasse.

	Nicolas, could you please tell us what kernel version were you
using when you deal with this problem (from the link above: "recv()
hangs until SIGCHLD ?")?

	Thank you.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-13 18:32                                   ` Dâniel Fraga
@ 2008-10-13 18:48                                     ` Nicolas Cannasse
  0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Cannasse @ 2008-10-13 18:48 UTC (permalink / raw)
  To: Dâniel Fraga
  Cc: Ilpo Järvinen, Thomas Gleixner, David Miller, Netdev

Dâniel Fraga a écrit :
> On Mon, 13 Oct 2008 16:35:08 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
>> http://marc.info/?t=122365716200001&r=1&w=2
>>
>> ...might turn out related. It has somewhat similar problem though 
>> different syscall. Not enough info on what kernel version etc. 
>> there though.
> 
> 	Interesting. I CCed Nicolas Cannasse.
> 
> 	Nicolas, could you please tell us what kernel version were you
> using when you deal with this problem (from the link above: "recv()
> hangs until SIGCHLD ?")?

2.6.20-15-server #2 SMP

Best,
Nicolas



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-13 13:35                                 ` Ilpo Järvinen
  2008-10-13 18:32                                   ` Dâniel Fraga
@ 2008-10-17 23:51                                   ` Dâniel Fraga
  2008-10-20 21:48                                     ` Ilpo Järvinen
  1 sibling, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-17 23:51 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Mon, 13 Oct 2008 16:35:08 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

	Just reporting another stall event... Who knows it helps in some way...

> ...might turn out related. It has somewhat similar problem though 
> different syscall. Not enough info on what kernel version etc. 
> there though.

	Well, I can be wrong, but I think those stalls are related to USB (or USB helped to 
"solve" the stall temporarily)... I explain below (I have a monitor with usb hub connected):

Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0)
Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session closed for user root

	^---------- sudo opened the session at 19:35

Oct 17 19:38:16 tux sudo:    fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/rm libdvdread.a libdvdread.la libdvdread.so libdvdread.so.3 libdvdread.so.3.2.1 libdvdread.so.4 libdvdread
.so.4.1.2

	^---------- I tried to run this command with sudo. And it stalled. Notice that it was logged only on 19:38.

Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0)
Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session closed for user root
Oct 17 19:40:20 tux vmunix: ci000/000:21ub/-/-./-../-..:./nu/nu7

	^----------- Here is the stall. You can see that the log is corrupted, every other line.

Oct 17 19:40:20 tux vmunix: 6ipt S I 11 os Lgtc pia S os]o s-000:21112

	^----------- Every other line. Here is when I turned off my monitor with USB... Unless it was an extreme coincidence, I'm sure this event
of turning off and...

Oct 17 19:40:25 tux vmunix: sd 9:0:0:0: [sdb] Attached SCSI removable disk
Oct 17 19:40:25 tux scsi.agent[5498]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/9:0:0:0

	^---------- ...turning on again, helped to "solve" the stall. The stall ended when I turned off the monitor.

Oct 17 19:40:28 tux sudo:    fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/ls
Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0)
Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session closed for user root
Oct 17 19:40:30 tux scsi.agent[5310]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/type does not exist
Oct 17 19:40:35 tux scsi.agent[5493]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/type does not exist
Oct 17 20:12:37 tux vmunix: usb-storage: device found at 14
Oct 17 20:12:37 tux vmunix: usb-storage: waiting for device to settle before scanning
Oct 17 20:12:42 tux vmunix: scsi 10:0:0:0: Direct-Access     Generic  ls SC   53 Q  NI 
Oct 17 20:12:42 tux vmunix: k
Oct 17 20:12:42 tux scsi.agent[7439]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/10:0:0:0
Oct 17 20:12:47 tux vmunix: DROP INPUT: IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:15:e9:e2:00:db:08:00 SRC=192.168.0.1 DST=255.255.255.255 LEN=334 TOS=0x00 PREC=0x00 TTL=64 ID=6668 PROTO=UDP SPT=67 DPT=68 LEN=314 
Oct 17 20:12:47 tux scsi.agent[7252]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/type does not exist
Oct 17 20:12:53 tux scsi.agent[7435]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/type does not exist

	***
	
	At the server I have an usb device connected too...

	OR at eh end it's just a symptom of the problem... I don't
know. The interesting thing is that as soon as I turned off the monitor 
(at the exact moment), the speaker beeped, because I used TAB to auto
complete my command, and I knew that sudo wasn't stalled
anymore.

	Maybe what the kernel does when we remove a usb hub could give us a hint about the stalling issue...

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-17 23:51                                   ` Dâniel Fraga
@ 2008-10-20 21:48                                     ` Ilpo Järvinen
  2008-10-20 21:49                                       ` Ilpo Järvinen
  2008-10-22  2:12                                       ` Dâniel Fraga
  0 siblings, 2 replies; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-20 21:48 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4604 bytes --]

On Fri, 17 Oct 2008, Dâniel Fraga wrote:

> On Mon, 13 Oct 2008 16:35:08 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> 	Just reporting another stall event... Who knows it helps in some way...
> 
> > ...might turn out related. It has somewhat similar problem though 
> > different syscall. Not enough info on what kernel version etc. 
> > there though.
> 
> 	Well, I can be wrong, but I think those stalls are related to USB (or USB helped to 
> "solve" the stall temporarily)... I explain below (I have a monitor with usb hub connected):
> 
> Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0)
> Oct 17 19:35:55 tux sudo: pam_unix(sudo:session): session closed for user root
> 
> 	^---------- sudo opened the session at 19:35
> 
> Oct 17 19:38:16 tux sudo:    fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/rm libdvdread.a libdvdread.la libdvdread.so libdvdread.so.3 libdvdread.so.3.2.1 libdvdread.so.4 libdvdread
> .so.4.1.2
> 
> 	^---------- I tried to run this command with sudo. And it stalled. Notice that it was logged only on 19:38.
> 
> Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0)
> Oct 17 19:38:16 tux sudo: pam_unix(sudo:session): session closed for user root
> Oct 17 19:40:20 tux vmunix: ci000/000:21ub/-/-./-../-..:./nu/nu7
> 
> 	^----------- Here is the stall. You can see that the log is corrupted, every other line.
> 
> Oct 17 19:40:20 tux vmunix: 6ipt S I 11 os Lgtc pia S os]o s-000:21112
> 
> 	^----------- Every other line. Here is when I turned off my monitor with USB... Unless it was an extreme coincidence, I'm sure this event
> of turning off and...

You probably meant every other character... :-) The same as with the 
netfilter logs... I've two open question related to it, where did the 
every other letter go...? and why not a larger block but only a single 
char is missing (at a time)? I've no idea about printk's internals, maybe 
I'll try to ping somebody tomorrow about it.

> Oct 17 19:40:25 tux vmunix: sd 9:0:0:0: [sdb] Attached SCSI removable disk
> Oct 17 19:40:25 tux scsi.agent[5498]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/9:0:0:0
> 
> 	^---------- ...turning on again, helped to "solve" the stall. The stall ended when I turned off the monitor.
> 
> Oct 17 19:40:28 tux sudo:    fraga : TTY=pts/2 ; PWD=/usr/local/lib ; USER=root ; COMMAND=/usr/local/bin/ls
> Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session opened for user root by fraga(uid=0)
> Oct 17 19:40:28 tux sudo: pam_unix(sudo:session): session closed for user root
> Oct 17 19:40:30 tux scsi.agent[5310]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/type does not exist
> Oct 17 19:40:35 tux scsi.agent[5493]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host9/target9:0:0/type does not exist
> Oct 17 20:12:37 tux vmunix: usb-storage: device found at 14
> Oct 17 20:12:37 tux vmunix: usb-storage: waiting for device to settle before scanning
> Oct 17 20:12:42 tux vmunix: scsi 10:0:0:0: Direct-Access     Generic  ls SC   53 Q  NI 
> Oct 17 20:12:42 tux vmunix: k
> Oct 17 20:12:42 tux scsi.agent[7439]: disk at /devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/10:0:0:0
> Oct 17 20:12:47 tux vmunix: DROP INPUT: IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:15:e9:e2:00:db:08:00 SRC=192.168.0.1 DST=255.255.255.255 LEN=334 TOS=0x00 PREC=0x00 TTL=64 ID=6668 PROTO=UDP SPT=67 DPT=68 LEN=314 
> Oct 17 20:12:47 tux scsi.agent[7252]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/type does not exist
> Oct 17 20:12:53 tux scsi.agent[7435]: Attribute /sys/devices/pci0000:00/0000:00:02.1/usb2/2-1/2-1.1/2-1.1.1/2-1.1.1:1.0/host10/target10:0:0/type does not exist
> 
> 	***
> 	
> 	At the server I have an usb device connected too...
> 
> 	OR at eh end it's just a symptom of the problem... I don't
> know. The interesting thing is that as soon as I turned off the monitor 
> (at the exact moment), the speaker beeped, because I used TAB to auto
> complete my command, and I knew that sudo wasn't stalled
> anymore.
> 
> 	Maybe what the kernel does when we remove a usb hub could give us a hint about the stalling issue...

I doubt we'll find the cause from networking...

Trying something a bit more generic:

Try running vmstat 1 > somelog.file to see if there's something 
particularly interesting behavior during the stall.

-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-20 21:48                                     ` Ilpo Järvinen
@ 2008-10-20 21:49                                       ` Ilpo Järvinen
  2008-10-22  2:12                                       ` Dâniel Fraga
  1 sibling, 0 replies; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-20 21:49 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 278 bytes --]

On Tue, 21 Oct 2008, Ilpo Järvinen wrote:

> Trying something a bit more generic:
> 
> Try running vmstat 1 > somelog.file to see if there's something 
> particularly interesting behavior during the stall.

Won't hurt doing this on both the server and the client... :-)

-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-20 21:48                                     ` Ilpo Järvinen
  2008-10-20 21:49                                       ` Ilpo Järvinen
@ 2008-10-22  2:12                                       ` Dâniel Fraga
  2008-10-30 10:43                                         ` Ilpo Järvinen
  1 sibling, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-22  2:12 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Tue, 21 Oct 2008 00:48:16 +0300 (EEST)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Try running vmstat 1 > somelog.file to see if there's something 
> particularly interesting behavior during the stall.

	Here it is. I used nmap more or less in the middle of this log.
The stall happened just on my client machine (stalled sudo) so there's only one 
log ok? Thanks.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0  52540  69928  63832 1001580    0    1    17    48   30  119 18  1 80  1
 0  0  52540  69844  63832 1001580    0    0     0     0  633  921  0  0 100  0
 0  0  52540  69844  63832 1001580    0    0     0   100  621  916  0  0 99  0
 1  0  52540  69844  63832 1001580    0    0     0     0  611  893  0  0 100  0
 0  0  52540  69716  63832 1001580    0    0     0     0  671  961  0  0 100  0
 1  0  52540  69552  63832 1001580    0    0     0     0  813  975  4  0 95  0
 0  0  52540  69596  63832 1001580    0    0     0     0  716  947  2  0 98  0
 0  0  52540  69596  63832 1001580    0    0     0    36  866 2047  2  2 96  0
 0  0  52540  69596  63832 1001580    0    0     0     0  688  964  0  0 100  0
 0  0  52540  69596  63832 1001580    0    0     0     0  721 1005  0  0 99  0
 0  0  52540  69720  63832 1001580    0    0     0     0  679  958  0  0 99  0
 0  0  52540  69720  63832 1001580    0    0     4     0  651  935  0  0 99  1
 0  0  52540  69760  63832 1001584    0    0     0    36  990 2105  2  2 96  0
 0  0  52540  69720  63832 1001584    0    0     0     0  869 1187  0  0 99  0
 0  0  52540  69720  63832 1001584    0    0     0     0  937 1272  0  0 100  0
 0  0  52540  69472  63832 1001584    0    0     0     0 1486 1702 17  3 79  0
 1  0  52540  68092  63832 1001604    0    0     0     0 2263 2553 36 13 50  0
 0  0  52540  68092  63832 1001604    0    0     0   336 1214 1223  3  4 92  0
 3  0  52540  67752  63832 1001604    0    0     0     0 1031 1192  1  0 99  0
 0  0  52540  67712  63832 1001604    0    0     0     0  988 1149  1  0 99  0
 0  0  52540  67712  63832 1001608    0    0     0     0  989 1127  0  0 100  0
 1  0  52540  67712  63832 1001608    0    0     0     0  985 1124  0  0 100  0
 0  0  52540  67752  63832 1001608    0    0     0   512 1149 1106  0  0 99  0
 2  0  52540  67544  63832 1001608    0    0     0     0 1463 1329  7  7 86  0
 2  0  52540  66496  63832 1001612    0    0     0     0 1755 2439 21  5 74  0
 0  0  52540  66332  63832 1001620    0    0     0     0 1597 1534 23 12 65  0
 0  0  52540  65992  63836 1001620    0    0    72     0  861 1155  0  0 97  2
 0  0  52540  65952  63836 1001688    0    0     0   304  819 1111  1  0 98  0
 0  0  52540  65704  63840 1001696    0    0    12     0  847 1194  1  0 97  1
 0  0  52540  65704  63840 1001696    0    0     0     0  797 1123  0  0 99  0
 2  0  52540  65620  63840 1001696    0    0     0     0  827 1155  1  0 99  0
 0  0  52540  65580  63840 1001696    0    0     0     0  747 1033  0  0 100  0
 0  0  52540  65496  63840 1001696    0    0     0    84  969 2128  2  3 95  0
 0  0  52540  64368  63924 1002608  480    0  1444   148 1593 1471 10  5 57 28
 0  0  52540  64480  63924 1002600    0    0     0     0  854 1141  1  1 98  0
 1  0  52540  64480  63924 1002604    0    0    12     0 1517 1409 18  1 81  0
 0  0  52540  64476  63924 1002624    0    0     0     0 1717 1331 27  1 72  0
 0  0  52540  64268  63980 1002600    0    0    56   300 1816 1543 17 14 66  3
 0  0  52540  64240  64004 1002616    0    0    24     0  805 1063  2  0 93  4
 0  0  52540  64240  64020 1002628    0    0    16     0  877 1033  1  0 96  3
 1  0  52540  64272  64048 1002632    0    0    28     0 1202 2408  3  3 90  3
 0  0  52540  64232  64068 1002640    0    0    20     0  961 1121  2  0 93  4
 0  0  52540  64232  64076 1002664    0    0     8   288  975 1197  1  0 97  2
 1  0  52540  64232  64100 1002668    0    0    24     0  974 1247  1  0 93  6
 0  0  52540  64232  64100 1002676    0    0     0     0  961 1177  1  0 98  0
 0  0  52540  64108  64300 1002684    0    0   200     0 1157 1337  4  0 74 21
 1  0  52540  64108  64328 1002668    0    0    28     0  909 1156  1  1 92  6
 0  0  52540  64108  64352 1002688    0    0    24   244  879 1117  1  0 94  4
 0  1  52540  64024  64376 1002712    0    0    24     0  932 1171  1  0 94  4
 0  0  52540  63984  64396 1002728    0    0    20     0  974 1190  3  1 91  4
 1  0  52540  63900  64412 1002732    0    0    16     0  908 1143  1  0 96  3
 0  0  52540  63860  64436 1002760    0    0    24     0  857 1105  0  0 92  7
 0  0  52540  63608  64444 1002796    0    0     4   380  961 1129  1  1 97  1
 0  0  52540  63612  64456 1002824    0    0    12     0  881 1088  2  0 95  3

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-22  2:12                                       ` Dâniel Fraga
@ 2008-10-30 10:43                                         ` Ilpo Järvinen
  2008-10-30 18:16                                           ` Dâniel Fraga
  2008-11-02  5:56                                           ` [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED] Dâniel Fraga
  0 siblings, 2 replies; 39+ messages in thread
From: Ilpo Järvinen @ 2008-10-30 10:43 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1487 bytes --]

On Wed, 22 Oct 2008, Dâniel Fraga wrote:

> On Tue, 21 Oct 2008 00:48:16 +0300 (EEST)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > Try running vmstat 1 > somelog.file to see if there's something 
> > particularly interesting behavior during the stall.
> 
> 	Here it is. I used nmap more or less in the middle of this log.
> The stall happened just on my client machine (stalled sudo) so there's only one 
> log ok? Thanks.

Nothing particularly interesting in it (sorry for the delay, there were 
some other bugs in the meantime to fix and partially because I was out 
of ideas). Minor increase in cpu/wa after nmap, before that it was 0-2, 
mostly zero.

Another thing that occurred to me today is that all these events I know of 
are related to syslog/klogd (I don't know too well how they relate to each 
other and what functionality is provided by each, thus I refer them here 
as one :-)).

- sudo sendmsg'ed to /dev/log and stalled
- every-2nd-char missing in printk relates to it
- usb does cause printks
- server processes likely do some logging (and can then hang on waiting
  until that finishes and therefore eventually no worker is available to 
  process requests)

...Some other type of case I'm missing here?

Perhaps we could try to solve it though stracing syslogd...

Did you ever actually try btw with something else than NO_TICK, we 
discussed it at some point of time but I'm not sure what was the end 
result of what got tested and what not?

-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd)
  2008-10-30 10:43                                         ` Ilpo Järvinen
@ 2008-10-30 18:16                                           ` Dâniel Fraga
  2008-11-02  5:56                                           ` [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED] Dâniel Fraga
  1 sibling, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-10-30 18:16 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Nothing particularly interesting in it (sorry for the delay, there were 
> some other bugs in the meantime to fix and partially because I was out 
> of ideas). Minor increase in cpu/wa after nmap, before that it was 0-2, 
> mostly zero.

	Hi Ilpo. No problem, it's better late than never ;)

> Another thing that occurred to me today is that all these events I know of 
> are related to syslog/klogd (I don't know too well how they relate to each 
> other and what functionality is provided by each, thus I refer them here 
> as one :-)).
> 
> - sudo sendmsg'ed to /dev/log and stalled
> - every-2nd-char missing in printk relates to it
> - usb does cause printks
> - server processes likely do some logging (and can then hang on waiting
>   until that finishes and therefore eventually no worker is available to 
>   process requests)
> 
> ...Some other type of case I'm missing here?

	Yes, it is that. Anyway I'm talking directly to netfilter
people and they suggested that I use ulogd to capture the exact packet
when the stall happens using pcap. I'll try that. If they can figure it
out what happens to make the log prints every other char, maybe we have
a hint.

> Perhaps we could try to solve it though stracing syslogd...

	Yes I can trace it.

> Did you ever actually try btw with something else than NO_TICK, we 
> discussed it at some point of time but I'm not sure what was the end 
> result of what got tested and what not?

	I tried with many options, with TICK enabled and disabled, high
res timer enabled and disabled and other options in the kernel, but it
didn't make any difference.

	I'll try with netfilter people and if i get something, I
will reply ok?

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED]
  2008-10-30 10:43                                         ` Ilpo Järvinen
  2008-10-30 18:16                                           ` Dâniel Fraga
@ 2008-11-02  5:56                                           ` Dâniel Fraga
  2008-11-03 15:37                                             ` Ilpo Järvinen
  1 sibling, 1 reply; 39+ messages in thread
From: Dâniel Fraga @ 2008-11-02  5:56 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Perhaps we could try to solve it though stracing syslogd...

	Well Ilpo, you're right, what I'm about to write here will make
me very ashamed, but the truth must be told! The culprit was syslogd!
Almost unbeliavable, but I had been using and old syslogd version for
about 5 years!

	How can I'm sure that it's syslogd's fault? Simply, because I
had a stall today and when I killed syslogd everything was back to
normal.

	Well, I reinstalled GNU inetutils 1.5 (which I had already
installed before), but I don't know why it put syslogd
in /usr/local/libexec directory.

	But no problem. I'll just wait a few more days to test if
syslogd is the only responsible for this, but I'm 90% sure it is.

	So, just posting this, so if someone, who knows, some day, have
a similar problem, can read this message and avoid all the problems I
had.

	I apologize for thinking that it was a kernel fault. Anyway,
one more lesson I learned: do not keep old binaries lying around... ;)

	Thanks everyone, mainly Ilpo for giving me all tools to reach 
to this point.

	Ps: just for curiosity, I was using a syslogd binary from Mar,
3, 2003! Extremely old! This is so old, it was compiled for Linux
2.2.5. Or maybe I was too lazy and copied it from another machine...

syslogd: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for
GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped

	Ps2: I'll close the bug I opened on bugzilla.

	Ps3: anyway, it's interesting how a small piece of the system
(syslogd) can generate those kinds of problems... I mean, a simple
error on syslogd could lead to a complete stall on connections, just
because everything is waiting for it to log through /dev/log. Of course
the problem was the binary, but it could have a time out, so even if it
was in fact a buggy syslogd, it won't cause such a stall on the
system. I really don't know what changed from 2.6.24 to 2.6.25, but
maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that...
you guys know much more than me.

	Ps4: maybe now we can understand why nmap solved the issue...

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED]
  2008-11-02  5:56                                           ` [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED] Dâniel Fraga
@ 2008-11-03 15:37                                             ` Ilpo Järvinen
  2008-11-03 17:03                                               ` Dâniel Fraga
  0 siblings, 1 reply; 39+ messages in thread
From: Ilpo Järvinen @ 2008-11-03 15:37 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Thomas Gleixner, David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3415 bytes --]

On Sun, 2 Nov 2008, Dâniel Fraga wrote:

> On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > Perhaps we could try to solve it though stracing syslogd...
> 
> 	Well Ilpo, you're right, what I'm about to write here will make
> me very ashamed, but the truth must be told! The culprit was syslogd!
> Almost unbeliavable, but I had been using and old syslogd version for
> about 5 years!
> 
> 	How can I'm sure that it's syslogd's fault? Simply, because I
> had a stall today and when I killed syslogd everything was back to
> normal.

Once there's any kind of flow control, anything jamming downstream will 
eventually make upstream to stall as well (or to appear as not working 
as expected. Sadly, it's exactly opposite from correctness point of view 
as flow control is a feature in TCP, not a bug :-)). Thus I occassionally 
run to these tcp with flow control not working reports which turn to be 
totally unrelated.

This still doesn't explain everything though afaik... E.g., why did the 
sendto() to SOCK_DGRAM socket hung.

> 	But no problem. I'll just wait a few more days to test if
> syslogd is the only responsible for this, but I'm 90% sure it is.

And you had the same old syslogd on both hosts?

In any case the loss of every other character deterministically sounds 
like a real bug in the syslogd since it doesn't make too much sense to 
happen in kernel->syslogd communication (where I'd expect it to not show 
up in such consistent pattern but would cause more randomness).

> 	I apologize for thinking that it was a kernel fault.

It's not clear what caused this to happen _now_, nor the exact mechanism.

> 	Ps: just for curiosity, I was using a syslogd binary from Mar,
> 3, 2003! Extremely old! This is so old, it was compiled for Linux
> 2.2.5. Or maybe I was too lazy and copied it from another machine...

In theory this shouldn't be too big problem, but I'm hardly an expert of 
those things and syslogd is anyway more thightly coupled to kernel than 
some random app.

> 	Ps3: anyway, it's interesting how a small piece of the system
> (syslogd) can generate those kinds of problems... I mean, a simple
> error on syslogd could lead to a complete stall on connections, just
> because everything is waiting for it to log through /dev/log.

This is more of a philosophical question than something else... it's 
always balancing between data loss (=possibly losing a logline of an 
important event) or possibility of a stall. But this shouldn't be a 
concern in the case where SOCK_DGRAM was used by the sudo (like in the 
strace you sent to sudo people), in general UDP doesn't guarantee 
reliability so not delivering wouldn't be a problem but I don't know if 
PF_FILE domain does something otherwise in there.

> Of course
> the problem was the binary, but it could have a time out, so even if it
> was in fact a buggy syslogd, it won't cause such a stall on the
> system. I really don't know what changed from 2.6.24 to 2.6.25, but
> maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that...
> you guys know much more than me.

Until we know more details than that killing syslogd helped it's hard to 
tell what is the actual cause. And I have no clue about semantics of 
/dev/log anyway.

> 	Ps4: maybe now we can understand why nmap solved the issue...

Not very clear but at least sudo does some writing there too.

-- 
 i.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED]
  2008-11-03 15:37                                             ` Ilpo Järvinen
@ 2008-11-03 17:03                                               ` Dâniel Fraga
  0 siblings, 0 replies; 39+ messages in thread
From: Dâniel Fraga @ 2008-11-03 17:03 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Thomas Gleixner, David Miller, Netdev

On Mon, 3 Nov 2008 17:37:09 +0200 (EET)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Once there's any kind of flow control, anything jamming downstream will 
> eventually make upstream to stall as well (or to appear as not working 
> as expected. Sadly, it's exactly opposite from correctness point of view 
> as flow control is a feature in TCP, not a bug :-)). Thus I occassionally 
> run to these tcp with flow control not working reports which turn to be 
> totally unrelated.
> 
> This still doesn't explain everything though afaik... E.g., why did the 
> sendto() to SOCK_DGRAM socket hung.

	Well, the fact that the problem happened since 2.6.25 kernel
make me believe that it could exist a possible kernel issue too, but I
think that most part was caused by syslogd.

> And you had the same old syslogd on both hosts?

	Yes. My desktop and server have the same installation.

> In any case the loss of every other character deterministically sounds 
> like a real bug in the syslogd since it doesn't make too much sense to 
> happen in kernel->syslogd communication (where I'd expect it to not show 
> up in such consistent pattern but would cause more randomness).

	Yes. With the new compiled syslogd it doesn't happen anymore.
And I don't have stall too.

> It's not clear what caused this to happen _now_, nor the exact mechanism.

	Ok.

> This is more of a philosophical question than something else... it's 
> always balancing between data loss (=possibly losing a logline of an 
> important event) or possibility of a stall. But this shouldn't be a 
> concern in the case where SOCK_DGRAM was used by the sudo (like in the 
> strace you sent to sudo people), in general UDP doesn't guarantee 
> reliability so not delivering wouldn't be a problem but I don't know if 
> PF_FILE domain does something otherwise in there.

	I see.

> Until we know more details than that killing syslogd helped it's hard to 
> tell what is the actual cause. And I have no clue about semantics of 
> /dev/log anyway.

	Ok. Anyway, at least the problem was registered and if in the
future we have something related, maybe this can help someone.


-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2008-11-03 17:03 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-01 12:52 [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) Ilpo Järvinen
2008-10-01 16:27 ` Dâniel Fraga
2008-10-01 20:05   ` Thomas Gleixner
2008-10-01 20:46     ` Dâniel Fraga
2008-10-01 21:14     ` Dâniel Fraga
2008-10-02 10:42       ` Thomas Gleixner
2008-10-02 11:17         ` Dâniel Fraga
2008-10-02 12:08           ` Thomas Gleixner
2008-10-02 14:09             ` Dâniel Fraga
2008-10-02 19:16               ` Thomas Gleixner
2008-10-02 20:31                 ` Ilpo Järvinen
2008-10-02 20:59                   ` Dâniel Fraga
2008-10-02 21:19                     ` Ilpo Järvinen
2008-10-02 21:39                       ` Dâniel Fraga
2008-10-05 17:52                       ` Dâniel Fraga
2008-10-05 18:09                         ` Thomas Gleixner
2008-10-05 19:31                           ` Dâniel Fraga
2008-10-06  1:27                           ` Dâniel Fraga
2008-10-08 19:34                           ` Dâniel Fraga
2008-10-08 19:56                             ` Thomas Gleixner
2008-10-08 20:44                               ` Ilpo Järvinen
2008-10-10  4:20                                 ` Dâniel Fraga
2008-10-10  8:10                                   ` Thomas Gleixner
2008-10-11  8:27                                   ` Ilpo Järvinen
2008-10-11 13:38                                     ` Dâniel Fraga
2008-10-13  4:08                               ` Dâniel Fraga
2008-10-13 13:35                                 ` Ilpo Järvinen
2008-10-13 18:32                                   ` Dâniel Fraga
2008-10-13 18:48                                     ` Nicolas Cannasse
2008-10-17 23:51                                   ` Dâniel Fraga
2008-10-20 21:48                                     ` Ilpo Järvinen
2008-10-20 21:49                                       ` Ilpo Järvinen
2008-10-22  2:12                                       ` Dâniel Fraga
2008-10-30 10:43                                         ` Ilpo Järvinen
2008-10-30 18:16                                           ` Dâniel Fraga
2008-11-02  5:56                                           ` [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) [SOLVED] Dâniel Fraga
2008-11-03 15:37                                             ` Ilpo Järvinen
2008-11-03 17:03                                               ` Dâniel Fraga
2008-10-02 20:56                 ` [PATCH] tcp FRTO: in-order-only "TCP proxy" fragility workaround (fwd) Dâniel Fraga

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).