panic on 2.6.24rc5

All of lore.kernel.org
 help / color / mirror / Atom feed

* panic on 2.6.24rc5
@ 2007-12-27 23:16 Tomasz Grobelny
  2007-12-30 15:18 ` Tomasz Grobelny
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Tomasz Grobelny @ 2007-12-27 23:16 UTC (permalink / raw)
  To: dccp

[-- Attachment #1: Type: text/plain, Size: 1574 bytes --]

Dnia Wednesday 26 of December 2007, napisałeś:
> What are the panics you are getting? It might be worth posting them to the
> list.
>
Here is the screenshot I captured a few days ago. Details:
 - kernel-vanilla 2.6.24rc5,
 - netem+tbf limited traffic on lo interface,
 - the panic was preceeded by several dmesg entries like that:
BUG: err=1 after ccid_hc_tx_packet_sent 
at /home/users/tomek/samba/dccp24rc5/output.c:280/dccp_write_xmit()
Pid: 1637, comm: client Not tainted 2.6.24_vanilla-1smp #1
 [<e08ae01b>] dccp_write_xmit+0x17b/0x2b0 [dccp]
 [<e08afc6d>] dccp_sendmsg+0x12d/0x190 [dccp]
 [<c02ea257>] inet_sendmsg+0x47/0x60
 [<c0293591>] sock_sendmsg+0xe1/0xf0
 [<c0123dcf>] try_to_wake_up+0x1bf/0x2e0
 [<c0167a00>] get_page_from_freelist+0x80/0xc0
 [<c0140290>] autoremove_wake_function+0x0/0x50
 [<c0120dad>] enqueue_entity+0x3d/0x70
 [<c0120fe1>] enqueue_task_fair+0x21/0x30
 [<c020a6e4>] copy_from_user+0x34/0x70
 [<c0295005>] sys_sendmsg+0x155/0x1b0
 [<c025b033>] tty_wakeup+0x33/0x70
 [<c014010a>] remove_wait_queue+0x1a/0x50
 [<c0260479>] read_chan+0x3b9/0x5f0
 [<c0125e6e>] __wake_up+0x3e/0x60
 [<c0295489>] sys_socketcall+0x259/0x280
 [<c0185720>] vfs_read+0x110/0x140
 [<c01859e7>] sys_read+0x47/0x80
 [<c0104196>] sysenter_past_esp+0x5f/0x85
 - dccp seemed to be painfully slow at sending packets from queue (but I have 
no numbers to prove that),
 - the crash is pretty much reproducible.
As soon as I get the kernel with newest patches compiled I'll report if the 
panic is gone or not.
-- 
Regards,
Tomasz Grobelny

[-- Attachment #2: panic.png --]
[-- Type: image/png, Size: 21801 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
@ 2007-12-30 15:18 ` Tomasz Grobelny
  2008-01-01 19:04 ` Arnaldo Carvalho de Melo
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Tomasz Grobelny @ 2007-12-30 15:18 UTC (permalink / raw)
  To: dccp

On Friday 28 December 2007, I wrote:
> Dnia Wednesday 26 of December 2007, napisa³e¶:
> > What are the panics you are getting? It might be worth posting them to
> > the list.
>
> Here is the screenshot I captured a few days ago. Details:
>  - kernel-vanilla 2.6.24rc5,
Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + 
patches 0001 to 0051).

>  - netem+tbf limited traffic on lo interface,
>  - the panic was preceeded by several dmesg entries like that:
When netem is not used used there are no BUG entries in dmesg, if it is I get:
BUG: err=1 after ccid_hc_tx_packet_sent 
at /home/users/tomek/samba/dccp_exp/output.c:284/dccp_write_xmit()
Pid: 1745, comm: sclient Not tainted 2.6.24_vanilla-0.1smp #1
 [<e08afe4a>] dccp_write_xmit+0x14a/0x220 [dccp]
 [<e08b26f0>] dccp_write_xmit_timer+0x0/0x50 [dccp]
 [<e08b2739>] dccp_write_xmit_timer+0x49/0x50 [dccp]
 [<c013519b>] run_timer_softirq+0xbb/0x190
 [<c0143691>] hrtimer_interrupt+0x131/0x1e0
 [<c0130ec4>] __do_softirq+0xd4/0xf0
 [<c0130f18>] do_softirq+0x38/0x40
 [<c0130fc5>] irq_exit+0x75/0x80
 [<c0117c4a>] smp_apic_timer_interrupt+0x2a/0x40
 [<c0104c64>] apic_timer_interrupt+0x28/0x30
 [<c030c648>] _spin_unlock_irqrestore+0x8/0x10
 [<c025f5de>] n_tty_receive_buf+0x20e/0xaf0
 [<c01403f0>] autoremove_wake_function+0x0/0x50
 [<c016704a>] prep_new_page+0xca/0x120
 [<c0294dd4>] sys_sendto+0xc4/0xf0
 [<c0261c15>] pty_write+0x45/0x50
 [<c025edaf>] opost_block+0x7f/0x110
 [<c0260aed>] write_chan+0x18d/0x220
 [<c0125da0>] default_wake_function+0x0/0x10
 [<c0142b30>] lock_hrtimer_base+0x20/0x50
 [<c0125da0>] default_wake_function+0x0/0x10
 [<c025bde5>] tty_write+0x185/0x1f0
 [<c01439f1>] hrtimer_nanosleep+0x41/0xc0
 [<c0260960>] write_chan+0x0/0x220
 [<c0185acf>] vfs_write+0xbf/0x150
 [<c0185c27>] sys_write+0x47/0x80
 [<c0104196>] sysenter_past_esp+0x5f/0x85
 ===========
>  - dccp seemed to be painfully slow at sending packets from queue (but I
> have no numbers to prove that),
Ok, now I do have numbers. I wrote a program (sclient) which sends an 80 byte 
packet every 100ms. Here are the times of arrival (ie. time(0)) on the other 
end of the connection (note that this is on loopback interface with no 
limiting):
1199023603
1199023603
1199023603
1199023603
1199023604
1199023604
1199023604
1199023604
1199023604
1199023604
1199023604
1199023604
1199023604
1199023604
1199023605
1199023605
1199023605
1199023605
1199023605
1199023605
1199023605
1199023605
1199023605
1199023605
1199023606
1199023606
1199023606
1199023606
1199023606
1199023606
1199023606
1199023606
1199023606
1199023606
1199023607
1199023607
1199023607
1199023607
1199023607
1199023608
1199023608
1199023609
1199023610
1199023613
1199023615
1199023619
1199023624
1199023633
1199023642
1199023659
1199023677
1199023713
1199023749
1199023813
1199023877
1199023941
1199024005
1199024069
1199024133
1199024197
1199024261
1199024325

during this time I get 4 lines in dmesg:
dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 0
dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 2
dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 2
dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 0

As you can see for the first few seconds all is fine (packets arrive 9-10 a 
second), but then the speed drops to 1 packet every 64 seconds. Can anybody 
reproduce that? Or what may I be doing wrong?

>  - the crash is pretty much reproducible.
The crash seems to be gone. But for me dccp is still unusable...
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
  2007-12-30 15:18 ` Tomasz Grobelny
@ 2008-01-01 19:04 ` Arnaldo Carvalho de Melo
  2008-01-01 21:30 ` Tomasz Grobelny
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-01-01 19:04 UTC (permalink / raw)
  To: dccp

Em Sun, Dec 30, 2007 at 04:18:36PM +0100, Tomasz Grobelny escreveu:
> On Friday 28 December 2007, I wrote:
> > Dnia Wednesday 26 of December 2007, napisałeś:
> > > What are the panics you are getting? It might be worth posting them to
> > > the list.
> >
> > Here is the screenshot I captured a few days ago. Details:
> >  - kernel-vanilla 2.6.24rc5,
> Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + 
> patches 0001 to 0051).

dccp_hdlr_ack_ratio is not on net-2.6.25, which means it is in one of
the 0001 to 0051 patches from Gerrit. So, to help us understand where is
the problem you could try building a kernel without applying any of the
0001 to 0051 patches.

Could you do this at and report the results?

I'm also assuming you are using CCID2 either by explicitely using
feature negotiation setsockopt calls or by using the default, that is
CCID2. If this is the case it would also be interesting to, before
rebuilding the kernel, to try using CCID3 as the problem you're
experiencing when using netem is exactly in the interface between the
core DCCP code and the CCID being used.

- Arnaldo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
  2007-12-30 15:18 ` Tomasz Grobelny
  2008-01-01 19:04 ` Arnaldo Carvalho de Melo
@ 2008-01-01 21:30 ` Tomasz Grobelny
  2008-01-02  0:03 ` Arnaldo Carvalho de Melo
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Tomasz Grobelny @ 2008-01-01 21:30 UTC (permalink / raw)
  To: dccp

Dnia Tuesday 01 of January 2008, Arnaldo Carvalho de Melo napisał:
> Em Sun, Dec 30, 2007 at 04:18:36PM +0100, Tomasz Grobelny escreveu:
> > On Friday 28 December 2007, I wrote:
> > > Dnia Wednesday 26 of December 2007, napisałeś:
> > > > What are the panics you are getting? It might be worth posting them
> > > > to the list.
> > >
> > > Here is the screenshot I captured a few days ago. Details:
> > >  - kernel-vanilla 2.6.24rc5,
> >
> > Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 +
> > patches 0001 to 0051).
>
> dccp_hdlr_ack_ratio is not on net-2.6.25, which means it is in one of
> the 0001 to 0051 patches from Gerrit. So, to help us understand where is
> the problem you could try building a kernel without applying any of the
> 0001 to 0051 patches.
>
> Could you do this at and report the results?
>
But what should I exactly test? Just whether the delays are gone or something 
more? I'll try to when I have some time (hopefully during weekend).

> I'm also assuming you are using CCID2 either by explicitely using
> feature negotiation setsockopt calls or by using the default, that is
In fact I was using ccid3. When I switched to ccid2 it started to work more or 
less ok. It seems that for whatever reason ccid_hc_tx_send_packet is 
returning too big values (up to 64000).

> CCID2. If this is the case it would also be interesting to, before
> rebuilding the kernel, to try using CCID3 as the problem you're
> experiencing when using netem is exactly in the interface between the
> core DCCP code and the CCID being used.
>
The problem with netem exists with both ccid2 and ccid3. I suspect that when 
all three elements of the connection (server, client and netem) are on one 
host netem is able to communicate packet loss by returning error. If netem 
was on a diffrent host the packet would be sent correctly (no BUG: err=1 
after ccid_hc_tx_packet_sent) but dropped on another host. I think that in 
this situation dccp should behave as if the packet was simply dropped.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (2 preceding siblings ...)
  2008-01-01 21:30 ` Tomasz Grobelny
@ 2008-01-02  0:03 ` Arnaldo Carvalho de Melo
  2008-01-02  0:57 ` Tomasz Grobelny
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-01-02  0:03 UTC (permalink / raw)
  To: dccp

Em Tue, Jan 01, 2008 at 10:30:56PM +0100, Tomasz Grobelny escreveu:
> Dnia Tuesday 01 of January 2008, Arnaldo Carvalho de Melo napisał:
> > Em Sun, Dec 30, 2007 at 04:18:36PM +0100, Tomasz Grobelny escreveu:
> > > On Friday 28 December 2007, I wrote:
> > > > Dnia Wednesday 26 of December 2007, napisałeś:
> > > > > What are the panics you are getting? It might be worth posting them
> > > > > to the list.
> > > >
> > > > Here is the screenshot I captured a few days ago. Details:
> > > >  - kernel-vanilla 2.6.24rc5,
> > >
> > > Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 +
> > > patches 0001 to 0051).
> >
> > dccp_hdlr_ack_ratio is not on net-2.6.25, which means it is in one of
> > the 0001 to 0051 patches from Gerrit. So, to help us understand where is
> > the problem you could try building a kernel without applying any of the
> > 0001 to 0051 patches.
> >
> > Could you do this at and report the results?
> >
> But what should I exactly test? Just whether the delays are gone or something 
> more? I'll try to when I have some time (hopefully during weekend).

If the kernel oopses, if the results are the same or are some problem
introduced in the patches by Gerrit. I.e. you would help us to narrow
down the problem by trying a binary search of changeset history built
kernels. 

Please take a look at Documentation/BUG-HUNTING in the kernel sources.
The process is somehow time consuming and its understandable if you
can't perform it, your reports are already of great help, but if you can
try helping us to narrow down exactly when some bugs you notice
appeared, or if they were always present after some kernel builds, we'd
be really grateful :-)
 
> > I'm also assuming you are using CCID2 either by explicitely using
> > feature negotiation setsockopt calls or by using the default, that is

> In fact I was using ccid3. When I switched to ccid2 it started to work more or 
> less ok. It seems that for whatever reason ccid_hc_tx_send_packet is 
> returning too big values (up to 64000).

That is an excellent data point, ccid3 code is way more complex than
ccid2, so trying with both is always a valuable data point.
 
> > CCID2. If this is the case it would also be interesting to, before
> > rebuilding the kernel, to try using CCID3 as the problem you're
> > experiencing when using netem is exactly in the interface between the
> > core DCCP code and the CCID being used.

> The problem with netem exists with both ccid2 and ccid3. I suspect that when 
> all three elements of the connection (server, client and netem) are on one 
> host netem is able to communicate packet loss by returning error. If netem 
> was on a diffrent host the packet would be sent correctly (no BUG: err=1 
> after ccid_hc_tx_packet_sent) but dropped on another host. I think that in 
> this situation dccp should behave as if the packet was simply dropped.

I can't work on this right now, will look at it tomorrow, but thanks for
the data points!

- Arnaldo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (3 preceding siblings ...)
  2008-01-02  0:03 ` Arnaldo Carvalho de Melo
@ 2008-01-02  0:57 ` Tomasz Grobelny
  2008-01-02  2:40 ` Arnaldo Carvalho de Melo
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Tomasz Grobelny @ 2008-01-02  0:57 UTC (permalink / raw)
  To: dccp

Dnia Wednesday 02 of January 2008, Arnaldo Carvalho de Melo napisaÅ‚:
> If the kernel oopses, if the results are the same or are some problem
> introduced in the patches by Gerrit. I.e. you would help us to narrow
> down the problem by trying a binary search of changeset history built
> kernels.
Oh, and by the way: does there exist any set of automated tests for dccp? It 
would be nice to have one, wouldn't it? Otherwise accepting any patch is 
quite risky...
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (4 preceding siblings ...)
  2008-01-02  0:57 ` Tomasz Grobelny
@ 2008-01-02  2:40 ` Arnaldo Carvalho de Melo
  2008-01-07 13:41 ` Gerrit Renker
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-01-02  2:40 UTC (permalink / raw)
  To: dccp

Em Wed, Jan 02, 2008 at 01:57:14AM +0100, Tomasz Grobelny escreveu:
> Dnia Wednesday 02 of January 2008, Arnaldo Carvalho de Melo napisaÅ‚:
> > If the kernel oopses, if the results are the same or are some problem
> > introduced in the patches by Gerrit. I.e. you would help us to narrow
> > down the problem by trying a binary search of changeset history built
> > kernels.

> Oh, and by the way: does there exist any set of automated tests for dccp? It 
> would be nice to have one, wouldn't it? Otherwise accepting any patch is 
> quite risky...

There are test programs, documented in the wiki, and there is peer
review too :-)

And DCCP on Linux was written in such a way that a large part of its
core engine is actually shared with TCP, benefiting from a much bigger
set of developers and testers.

But please feel free to add more automated tests, it'll benefit us all.

- Arnaldo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (5 preceding siblings ...)
  2008-01-02  2:40 ` Arnaldo Carvalho de Melo
@ 2008-01-07 13:41 ` Gerrit Renker
  2008-01-07 16:56 ` Gerrit Renker
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Gerrit Renker @ 2008-01-07 13:41 UTC (permalink / raw)
  To: dccp

Quoting Tomasz Grobelny:
| On Friday 28 December 2007, I wrote:
| > Dnia Wednesday 26 of December 2007, napisa?e?:
| > > What are the panics you are getting? It might be worth posting them to
| > > the list.
| >
| > Here is the screenshot I captured a few days ago. Details:
| >  - kernel-vanilla 2.6.24rc5,
| Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + 
| patches 0001 to 0051).
| 
| >  - netem+tbf limited traffic on lo interface,
| >  - the panic was preceeded by several dmesg entries like that:
| When netem is not used used there are no BUG entries in dmesg, if it is I get:
| BUG: err=1 after ccid_hc_tx_packet_sent 
The `err' refers to the return value of dccp_transmit_skb(), the
corresponding errno number is in include/asm-generic/errno-base.h

#define EPERM            1      /* Operation not permitted */

Are you running a security Linux (selinux) or as non-root?

| 
| >  - dccp seemed to be painfully slow at sending packets from queue (but I
| > have no numbers to prove that),
| Ok, now I do have numbers. I wrote a program (sclient) which sends an 80 byte 
| packet every 100ms. Here are the times of arrival (ie. time(0)) on the other 
| end of the connection (note that this is on loopback interface with no 
| limiting):
| 1199023603
| 1199023603
| 1199023603
| 1199023603
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023607
| 1199023607
| 1199023607
| 1199023607
| 1199023607
| 1199023608
| 1199023608
| 1199023609
| 1199023610
| 1199023613
| 1199023615
| 1199023619
| 1199023624
| 1199023633
| 1199023642
| 1199023659
| 1199023677
| 1199023713
| 1199023749
| 1199023813
| 1199023877
| 1199023941
| 1199024005
| 1199024069
| 1199024133
| 1199024197
| 1199024261
| 1199024325
| 
| during this time I get 4 lines in dmesg:
| dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 0
| dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 2
| dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 2
| dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 0
| 
Did you set net.dccp.default.ack_ratio=1 ? It looks like it. In any
case, this output looks good, since
 * the CCID3 sender disables Ack Ratio (Ack Ratio is only used by CCID2)
 * since you are using loopback, you get both sender/receiver messages
   (i.e. one sender message is always paired with one receiver message)
 * the CCID3 receiver does not touch Ack Ratio and leaves it at the default
   value for Ack Ratio (2)
 * the "Not changing {R,T}X Ack Ratio" messages are due to the fact that
   the Ack Ratio handler is currently disabled. This is a precaution
   since CCID2 is not well-behaved yet with Ack Ratios different from 1
 * the Ack Ratio messages should in general not affect CCID3 performance
   since the CCID3 code ignores Ack Ratio (i.e. they are only informative)

| As you can see for the first few seconds all is fine (packets arrive 9-10 a 
| second), but then the speed drops to 1 packet every 64 seconds. Can anybody 
| reproduce that? Or what may I be doing wrong?
I will also try the setting you described. Is it possible for you to run
your program between two computers (on testbeds here it works ok)?
The 64 seconds means that CCID3 is dying very badly; 64 seconds is the
maximum packet delay (t_mbi in RFC 3448), so there is something really
strange going on. In combination with the -EPERM error it may be that
the loss rate is very very hight.

We would really be grateful for any further hints that you could give us.

Here are a few more things to test
 * you can see your setting in the Request/Response handshake, the set
   of features negotiated for the connection is in the Response

 * many CCID3 parameters are printed out in dccp_probe, but I don't know
   how well this works when using loopback. Some scripts are on
   http://www.erg.abdn.ac.uk/users/gerrit/dccp/testing_dccp/

 * wireshark (source version) is also able to decode the receiver reports
   by the CCID3 receiver (X_recv/loss p)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (6 preceding siblings ...)
  2008-01-07 13:41 ` Gerrit Renker
@ 2008-01-07 16:56 ` Gerrit Renker
  2008-03-06 20:57 ` Tomasz Grobelny
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Gerrit Renker @ 2008-01-07 16:56 UTC (permalink / raw)
  To: dccp

| > But what should I exactly test? Just whether the delays are gone or something 
| > more? I'll try to when I have some time (hopefully during weekend).
| 
| If the kernel oopses, if the results are the same or are some problem
| introduced in the patches by Gerrit. I.e. you would help us to narrow
| down the problem by trying a binary search of changeset history built
| kernels. 
| 
| Please take a look at Documentation/BUG-HUNTING in the kernel sources.
| The process is somehow time consuming and its understandable if you
| can't perform it, your reports are already of great help, but if you can
| try helping us to narrow down exactly when some bugs you notice
| appeared, or if they were always present after some kernel builds, we'd
| be really grateful :-)
I have tried to obtain the oops reported by Tomasz, but could not
trigger it. I used iperf on the loopback (127.0.0.1) and got 66.4
Mbits/sec out of it. 
Would also like to ask for more information if at all possible.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (7 preceding siblings ...)
  2008-01-07 16:56 ` Gerrit Renker
@ 2008-03-06 20:57 ` Tomasz Grobelny
  2008-03-07 12:16 ` Gerrit Renker
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Tomasz Grobelny @ 2008-03-06 20:57 UTC (permalink / raw)
  To: dccp

Dnia Tuesday 01 of January 2008, Arnaldo Carvalho de Melo napisaÅ‚:
> Could you do this at and report the results?
Ok, back to the old thread. I found out that commit after which dccp over 
loopback (no limiting) has huge delays (as reported previously) is 
52515e77a7a69867c479db4c9efb6be832b82179. This is for CCID3 only no matter if 
client and server programs are run by root or not. dmesg shows:
CCID: Registered CCID 3 (TCP-Friendly Rate Control)
dccp_sample_rtt: unusable RTT sample -172, using min
ccid3_hc_tx_packet_recv: client(cfb52740): ACK with bogus ACK-125773746264929

Please feel free to ask for more info. Now I should have more time to test and 
make experiments.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (8 preceding siblings ...)
  2008-03-06 20:57 ` Tomasz Grobelny
@ 2008-03-07 12:16 ` Gerrit Renker
  2008-03-07 14:52 ` Tomasz Grobelny
  2008-03-07 16:09 ` Gerrit Renker
  11 siblings, 0 replies; 13+ messages in thread
From: Gerrit Renker @ 2008-03-07 12:16 UTC (permalink / raw)
  To: dccp

| Ok, back to the old thread. I found out that commit after which dccp over
| loopback (no limiting) has huge delays (as reported previously) is
| 52515e77a7a69867c479db4c9efb6be832b82179. This is for CCID3 only no matter if
| client and server programs are run by root or not. dmesg shows:
| CCID: Registered CCID 3 (TCP-Friendly Rate Control)
| dccp_sample_rtt: unusable RTT sample -172, using min
| ccid3_hc_tx_packet_recv: client(cfb52740): ACK with bogus ACK-125773746264929
|
| Please feel free to ask for more info. Now I should have more time to test and
| make experiments.
Thanks. I need to first make a clarification with regard to earlier email: the
reported error ("err=1 after tx_packet_sent") has nothing to do with errno=EPERM.
That was my mistake, the `1' is generated by the device output routine,
which generates NET_XMIT_DROP when the Qdisc decides not to send the
packet (linux/netdevice.h).

Now to the bugs: the original error message was "crash on loopback", the
above is a different condition, so one after the other.

Firstly, irrespective of whether loopback is a representative environment or not,
if a crash happens on loopback then it must be fixed. Since your email I have heard
(privately) from at least one person who also encountered a crash on loopback. I do
occasional tests on loopback, but use two or three computers to do the
real testing. So far I have been unable to reproduce the bug and hence
any further information would be great.

I looked up your previous email on
http://www.mail-archive.com/dccp@vger.kernel.org/msg03129.html
and tried to guess what is happening with regard to "huge delays". If
the problem you are referring to is the same (i.e. CCID-3 switches to a
mode of sending 1 packet in 64 seconds), then the above error messages
are the consequence, but not the cause. 

Hence can you please clarify whether CCID-3 is getting into the mode of
sending once per 64 seconds?

Now lastly, ranting about loopback: this link is not very
representative, it has an extremely small RTT, a large MTU and actually
is a kind of virtual interface. Hence it is possible to run into
problems due to the nature of the link, but not the CCID-3 module.

In particular, the following problems are likely:
 * due to the small RTT, the nofeedback timer triggers very often,
   quickly reducing the sending rate towards 1 in 64 seconds
 * this is because the nofeedback timer is triggered every 4*RTT,
   a loopback RTT is < 50 usec, so that would be ~200 usec
 * to avoid this, there is a kernel configuration option of CCID-3
   to set an upper bound for this.

Please let us know if that diagnosis matches the case.

Gerrit

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (9 preceding siblings ...)
  2008-03-07 12:16 ` Gerrit Renker
@ 2008-03-07 14:52 ` Tomasz Grobelny
  2008-03-07 16:09 ` Gerrit Renker
  11 siblings, 0 replies; 13+ messages in thread
From: Tomasz Grobelny @ 2008-03-07 14:52 UTC (permalink / raw)
  To: dccp

Dnia Friday 07 of March 2008, Gerrit Renker napisa³:
> | Ok, back to the old thread. I found out that commit after which dccp over
> | loopback (no limiting) has huge delays (as reported previously) is
> | 52515e77a7a69867c479db4c9efb6be832b82179. This is for CCID3 only no
> | matter if client and server programs are run by root or not. dmesg shows:
> | CCID: Registered CCID 3 (TCP-Friendly Rate Control)
> | dccp_sample_rtt: unusable RTT sample -172, using min
> | ccid3_hc_tx_packet_recv: client(cfb52740): ACK with bogus
> | ACK-125773746264929
> |
> | Please feel free to ask for more info. Now I should have more time to
> | test and make experiments.
>
> Thanks. I need to first make a clarification with regard to earlier email:
> the reported error ("err=1 after tx_packet_sent") has nothing to do with
> errno=EPERM. That was my mistake, the `1' is generated by the device output
> routine, which generates NET_XMIT_DROP when the Qdisc decides not to send
> the packet (linux/netdevice.h).
>
And so I guess that this NET_XMIT_DROP should be ignored by dccp code?

> Now to the bugs: the original error message was "crash on loopback", the
> above is a different condition, so one after the other.
>
As I've written previously: I cannot reproduce the crash anymore. Probably due 
to newer kernel version.

> Firstly, irrespective of whether loopback is a representative environment
> or not, if a crash happens on loopback then it must be fixed. Since your
> email I have heard (privately) from at least one person who also
> encountered a crash on loopback. I do occasional tests on loopback, but use
> two or three computers to do the real testing. So far I have been unable to
> reproduce the bug and hence any further information would be great.
>
Hmm... I could provide you with vmware image but it is quite big. About 300M.

> I looked up your previous email on
> http://www.mail-archive.com/dccp@vger.kernel.org/msg03129.html
> and tried to guess what is happening with regard to "huge delays". If
> the problem you are referring to is the same (i.e. CCID-3 switches to a
> mode of sending 1 packet in 64 seconds), then the above error messages
> are the consequence, but not the cause.
>
> Hence can you please clarify whether CCID-3 is getting into the mode of
> sending once per 64 seconds?
>
Yes, that's the bug I'm refering to.

> Now lastly, ranting about loopback: this link is not very
> representative, it has an extremely small RTT, a large MTU and actually
> is a kind of virtual interface. Hence it is possible to run into
> problems due to the nature of the link, but not the CCID-3 module.
>
But still it is a very convenient testbed for new applications. So it would be 
nice if it worked so as not to scare off new developers ;-)

> In particular, the following problems are likely:
>  * due to the small RTT, the nofeedback timer triggers very often,
>    quickly reducing the sending rate towards 1 in 64 seconds
>  * this is because the nofeedback timer is triggered every 4*RTT,
>    a loopback RTT is < 50 usec, so that would be ~200 usec
>  * to avoid this, there is a kernel configuration option of CCID-3
>    to set an upper bound for this.
How do I set it?

>
> Please let us know if that diagnosis matches the case.
>
How can I test it? The best I came up with was adding delay to the interface 
(tc qdisc add dev lo root netem delay 1ms) and test again. And it works ok 
now, no slowing down.
So probably it is a matter of setting sensible default in the kernel.
-- 
Regards,
Tomasz Grobelny

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: panic on 2.6.24rc5
  2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
                   ` (10 preceding siblings ...)
  2008-03-07 14:52 ` Tomasz Grobelny
@ 2008-03-07 16:09 ` Gerrit Renker
  11 siblings, 0 replies; 13+ messages in thread
From: Gerrit Renker @ 2008-03-07 16:09 UTC (permalink / raw)
  To: dccp

| > errno=EPERM. That was my mistake, the `1' is generated by the device output
| > routine, which generates NET_XMIT_DROP when the Qdisc decides not to send
| > the packet (linux/netdevice.h).
| >
| And so I guess that this NET_XMIT_DROP should be ignored by dccp code?
| 
In the test tree it is now (almost) ignored. Since the time you reported
this problem I have changed the DCCP_BUG to a DCCP_WARN, so that the
drop will still be logged, but there will be fewer such warnings in the
log now (in DCCP_WARN, printk is rate-limited).

| > Firstly, irrespective of whether loopback is a representative environment
| > or not, if a crash happens on loopback then it must be fixed. Since your
| > email I have heard (privately) from at least one person who also
| > encountered a crash on loopback. I do occasional tests on loopback, but use
| > two or three computers to do the real testing. So far I have been unable to
| > reproduce the bug and hence any further information would be great.
| >
| Hmm... I could provide you with vmware image but it is quite big. About 300M.
| 
This is interesting -- so you are running DCCP under virtualisation?
Arnaldo used to do this with QEMU, I tried this recently also but am no
fan of virtual networks. Yes, if the crash persisted, any information
would help.

| > Now lastly, ranting about loopback: this link is not very
| > representative, it has an extremely small RTT, a large MTU and actually
| > is a kind of virtual interface. Hence it is possible to run into
| > problems due to the nature of the link, but not the CCID-3 module.
| >
| But still it is a very convenient testbed for new applications. So it would be 
| nice if it worked so as not to scare off new developers ;-)
| 
Yes you are right - I agree.

| >  * to avoid this, there is a kernel configuration option of CCID-3
| >    to set an upper bound for this.
| How do I set it?
| 
In the menu under
 Networking -> Network Options -> The DCCP Protocol (EXPERIMENTAL) 
 -> DCCP CCIDs Configuration (EXPERIMENTAL) -> CCID3 
 -> (100) Use higher bound for nofeedback timer
Ah - just remembered -- the default is 100 milliseconds, so this will
probably have caught the problems with the low RTT.

| >
| > Please let us know if that diagnosis matches the case.
| >
| How can I test it? The best I came up with was adding delay to the interface 
| (tc qdisc add dev lo root netem delay 1ms) and test again. And it works ok 
| now, no slowing down.
So from what you wrote I read
 * without additional delay the described problem occurs and CCID-3
   gets into 1-packet-per-64-seconds mode
 * when you add 1millisecond delay to the interface then it works ok.

Then this means that the problems are indeed due to an extremely low
RTT. If this is the case, the problem is resolved.

| So probably it is a matter of setting sensible default in the kernel.
There are arguments which say that on LANs with such low RTTs
congestion control should be skipped as this generates "noise".

If above reasoning holds, this should be marked as a ToDo item, e.g. by
using a higher threshold for the RTTs (the current minimum is 100 usec).

You can plot the CCID-3 RTTs using dccp_probe, scripts are on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/testing_dccp/

(at the bottom of the page).

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-03-07 16:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-27 23:16 panic on 2.6.24rc5 Tomasz Grobelny
2007-12-30 15:18 ` Tomasz Grobelny
2008-01-01 19:04 ` Arnaldo Carvalho de Melo
2008-01-01 21:30 ` Tomasz Grobelny
2008-01-02  0:03 ` Arnaldo Carvalho de Melo
2008-01-02  0:57 ` Tomasz Grobelny
2008-01-02  2:40 ` Arnaldo Carvalho de Melo
2008-01-07 13:41 ` Gerrit Renker
2008-01-07 16:56 ` Gerrit Renker
2008-03-06 20:57 ` Tomasz Grobelny
2008-03-07 12:16 ` Gerrit Renker
2008-03-07 14:52 ` Tomasz Grobelny
2008-03-07 16:09 ` Gerrit Renker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.