public inbox for b.a.t.m.a.n@lists.open-mesh.org
 help / color / mirror / Atom feed
* [B.A.T.M.A.N.] Batman gateway lock ups
@ 2008-09-05 15:01 Outback Dingo
  2008-09-08  7:08 ` Sven Eckelmann
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Outback Dingo @ 2008-09-05 15:01 UTC (permalink / raw)
  To: b.a.t.m.a.n

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

 see pastebin

http://www.pastebin.ca/1194874

pertinent info
dmesg | grep 'batgat loaded'
batgat: [init_module:96] batgat loaded  rv1025
uname -a
Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown
root@nightwing:~# batmand -v
WARNING: You are using the unstable batman branch. If you are interested in
*using* batman get the latest stable release !
B.A.T.M.A.N. 0.3-beta (compatibility version 5)
lsmod

Module                  Size  Used by    Tainted:
P
sch_htb                14048
2
ath_ahb               103616
0
wlan_xauth               480
0
wlan_wep                4000
0
wlan_tkip               9856
0
wlan_ccmp               5440
2
wlan_acl                1920  0
ath_rate_minstrel       8352  1
ath_hal               136832  3 ath_ahb,ath_rate_minstrel
wlan_scan_sta           8768  1
wlan_scan_ap            6656  0
wlan                  152464  10
ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap
batgat                 10944  1
ipt_iprange              672  0
ipt_TOS                  832  0
ipt_TTL                  928  0
xt_MARK                  960  3
ipt_ECN                 1472  0
xt_CLASSIFY              640  0
ipt_ttl                  704  0
ipt_tos                  544  0
ipt_time                1568  0
xt_tcpmss               1088  0
xt_statistic             832  0
xt_mark                  672  7
xt_mac                   736  3
xt_length                736  0
ipt_ecn                 1024  0
xt_DSCP                 1056  0
xt_dscp                  832  0
imq                     2096  0
ipt_IMQ                  672  2
xt_string                896  0
xt_layer7               9840  0
ipt_ipp2p               6784  0
ipt_LOG                 4640  0
xt_CHAOS                1792  0
xt_DELUDE               2624  1
xt_TARPIT               2816  1
xt_quota                 800  0
xt_portscan             2016  0
xt_pkttype               704  0
xt_physdev              1488  0
ipt_owner                800  0
iptable_raw              832  0
xt_NOTRACK               832  0
xt_CONNMARK             1088  0
ipt_recent              4992  0
xt_helper                992  0
xt_conntrack            1312  0
xt_connmark              832  0
xt_connbytes            1312  0
tun                     6592  0

[-- Attachment #2: Type: text/html, Size: 9985 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
@ 2008-09-08  7:08 ` Sven Eckelmann
  2008-09-08 21:18 ` Sven Eckelmann
  2009-07-21 20:48 ` Simon Wunderlich
  2 siblings, 0 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-08  7:08 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]

On Friday 05 September 2008 17:01:24 Outback Dingo wrote:
> [...]
> http://www.pastebin.ca/1194874
>
> pertinent info
> dmesg | grep 'batgat loaded'
> batgat: [init_module:96] batgat loaded  rv1025
> uname -a
> Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown
> root@nightwing:~# batmand -v
> WARNING: You are using the unstable batman branch. If you are interested in
> *using* batman get the latest stable release !
> B.A.T.M.A.N. 0.3-beta (compatibility version 5)
> [..]
Thx for your report. I looked a little bit at the the call stack but it is 
really hard to guess the functions when you don't have the symbol table. It 
looks a little bit like most of the functions are kernel thread/scheduling 
related. Only one function <c00c8650> is  from a module. Can you please send 
the /proc/modules file so we can find in which module it is so we can check if 
it could be a batgat related. Further investigations could be done by checking 
the symbols of batgat.ko with nm - but this would not help very much because 
it is stripped down. Maybe someone else has a good idea.

Now to something you said on the nightwing mailing list:
> i think the version of batgat being used has a bug as confirmed by someone in
> the batman irc room, note only gateways have this issue, clients do not
> crash at all.
If you mean me (Lazhur on irc) then I have to say that I am not a developer 
nor a spokesman of b.a.t.m.a.n. and I never confirmed that it is batgat related 
- only that something crashed/lock up.
I cannot find any crash related bug fixes in the batgat trunk directory - so it 
doesn't seem to be a known problem (if it is a batgat problem at all).

Best regards
	Sven Eckelmann

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
  2008-09-08  7:08 ` Sven Eckelmann
@ 2008-09-08 21:18 ` Sven Eckelmann
  2008-09-08 21:45   ` Sven Eckelmann
  2008-09-09 11:26   ` Simon Wunderlich
  2009-07-21 20:48 ` Simon Wunderlich
  2 siblings, 2 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-08 21:18 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]

Ok, I got the /proc/modules file now. Current situation is following: it 
crashes inside the the batman module add position 0x00000aa4

    a60:	3c020000 	lui	v0,0x0
     a64:	8c500024 	lw	s0,36(v0)
     a68:	24420024 	addiu	v0,v0,36
     a6c:	12020014 	beq	s0,v0,ac0 <cleanup_module+0x610>
     a70:	3c040000 	lui	a0,0x0
     a74:	3c050000 	lui	a1,0x0
     a78:	3c020000 	lui	v0,0x0
     a7c:	24840000 	addiu	a0,a0,0
     a80:	24a50088 	addiu	a1,a1,136
     a84:	24420000 	addiu	v0,v0,0
     a88:	0040f809 	jalr	v0
     a8c:	24060283 	li	a2,643
     a90:	8e040004 	lw	a0,4(s0)
     a94:	8e030000 	lw	v1,0(s0)
     a98:	3c020010 	lui	v0,0x10
     a9c:	34420100 	ori	v0,v0,0x100
     aa0:	8e110008 	lw	s1,8(s0)
     aa4:	ac830000 	sw	v1,0(a0)
     aa8:	ae020000 	sw	v0,0(s0)
     aac:	3c020020 	lui	v0,0x20
     ab0:	34420200 	ori	v0,v0,0x200
     ab4:	ac640004 	sw	a0,4(v1)

This is part of the compiled version of packet_recv_thread. Due the 
optimizations done I cannot say were exactly the problem lies.

I think the code of get_ip_addr() got inlined in packet_recv_thread and we 
need to search for the crash inside of it at list_del(&entry->list);
I would also say that the really crash is inside __list_del where prev and 
next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of 
poison.h of the current linux kernel. You will notice that the values are 
0x00100100 and 0x00200200 == address of the failed paging request. The list 
poison stuff will be done in in list_del after calling __list_del (it is the 
sequence lui, ori, sw in the asm snipped). So could it be that we have a 
poisened entry inside the list?
This could for example happen when we get scheduled (please notice that the 
optimizer exchanged many instrictions) while another part of the program is 
deleting entries. I haven't checked the rest of the code if that really could 
happen, but that is my current idea.

So for better readability the callstack:
- packet_recv_thread
- get_ip_addr from gateway.c:401
- list_del from gateway.c:645
- __list_del

Best regards
	Sven Eckelmann

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-08 21:18 ` Sven Eckelmann
@ 2008-09-08 21:45   ` Sven Eckelmann
  2008-09-09  8:03     ` Sven Eckelmann
  2008-09-09 11:26   ` Simon Wunderlich
  1 sibling, 1 reply; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-08 21:45 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking


[-- Attachment #1.1: Type: text/plain, Size: 502 bytes --]

On Monday 08 September 2008 23:18:42 Sven Eckelmann wrote:
> Ok, I got the /proc/modules file now. Current situation is following: it
> crashes inside the the batman module add position 0x00000aa4
> [..]
I got the System.map right now. So we can convert the kernel oops into 
something more readable. It doesn't help that much now, but... just for sake 
of completeness

END_OF_CODE+3fe1d020 is the batgat.ko when we search for c00c8650 inside 
/proc/modules

Best regards
	Sven Eckelmann

[-- Attachment #1.2: informative_ooops.txt --]
[-- Type: text/plain, Size: 1672 bytes --]

CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == c00c8aa4, ra == c00c8a90
Cpu 0
$ 0   : 00000000 10009c00 00100100 802bb600
$ 4   : 00000000 00000001 00000000 00000000
$ 8   : 00000000 806e7a28 00000007 a106af00
$12   : 00000007 270e0000 000006d0 7d96d080
$16   : 80a23500 00000000 c00c9a28 00000064
$20   : c00d0000 00000000 000b0f4f 8071193d
$24   : 80711730 00008000
$28   : 80710000 80711890 00000000 c00c8a90
Hi    : 00000140
Lo    : 68fdd3c0
epc   : c00c8aa4     Tainted: P
Cause : 3080000c
        807119a0 000005dc 8071193d 000004c0 000210d2 05a82b1a 00000000 00000000
        00020000 185352d0 2ac668cc 2ac668d4 000210d2 00000000 2ac668dc 2ac668e4
        00000000 806e79f8 8007d41c 807118fc 807118fc 807118c0 00000010 807118b0
        00000001 00000000 00000000 00004040 807118d0 00000010 807118b8 00000001
Call Trace:[<8007d41c>][<8005e5f0>][<8005cd04>][<8005e8d4>][<8005e0e4>][<8005cd38>][<8005d578>][<8007d0b0>][<8022656c>][<c00c8650>][<8007d108>][<8007d0e8>][<80045698>][<80045688>]
Code: 3c020010  34420100  8e110008 <ac830000> ae020000  3c020020  34420200  ac640004  16200011

Trace; 8007d41c <autoremove_wake_function+0/44>
Trace; 8005e5f0 <enqueue_entity+2fc/33c>
Trace; 8005cd04 <enqueue_task+1c/34>
Trace; 8005e8d4 <dequeue_entity+98/d8>
Trace; 8005e0e4 <try_to_wake_up+84/d8>
Trace; 8005cd38 <dequeue_task+1c/30>
Trace; 8005d578 <pick_next_task_fair+38/78>
Trace; 8007d0b0 <kthread+0/b0>
Trace; 8022656c <schedule+1e0/7d4>
Trace; c00c8650 <END_OF_CODE+3fe1d020/????>
Trace; 8007d108 <kthread+58/b0>
Trace; 8007d0e8 <kthread+38/b0>
Trace; 80045698 <kernel_thread_helper+10/18>
Trace; 80045688 <kernel_thread_helper+0/18>

[-- Attachment #1.3: proc_modules --]
[-- Type: text/plain, Size: 1954 bytes --]

sch_htb 14048 2 - Live 0xc00e5000
ath_ahb 103616 0 - Live 0xc0150000
wlan_xauth 480 0 - Live 0xc00dc000
wlan_wep 4000 0 - Live 0xc00cf000
wlan_tkip 9856 0 - Live 0xc00d8000
wlan_ccmp 5440 3 - Live 0xc00d5000
wlan_acl 1920 0 - Live 0xc00af000
ath_rate_minstrel 8352 1 - Live 0xc00d1000
ath_hal 136832 3 ath_ahb,ath_rate_minstrel, Live 0xc012d000 (P)
wlan_scan_sta 8768 1 - Live 0xc00c1000
wlan_scan_ap 6656 0 - Live 0xc00c5000
wlan 152464 10 ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap,Live 0xc0106000
batgat 10944 1 - Live 0xc00c8000
ipt_iprange 672 0 - Live 0xc00bf000
ipt_TOS 832 0 - Live 0xc00bd000
ipt_TTL 928 0 - Live 0xc00bb000
xt_MARK 960 0 - Live 0xc00b9000
ipt_ECN 1472 0 - Live 0xc00b7000
xt_CLASSIFY 640 0 - Live 0xc00b5000
ipt_ttl 704 0 - Live 0xc00b3000
ipt_tos 544 0 - Live 0xc00b1000
ipt_time 1568 0 - Live 0xc009d000
xt_tcpmss 1088 0 - Live 0xc00ad000
xt_statistic 832 0 - Live 0xc00ab000
xt_mark 672 7 - Live 0xc00a9000
xt_mac 736 0 - Live 0xc00a7000
xt_length 736 0 - Live 0xc00a5000
ipt_ecn 1024 0 - Live 0xc00a3000
xt_DSCP 1056 0 - Live 0xc00a1000
xt_dscp 832 0 - Live 0xc009f000
imq 2096 0 - Live 0xc0097000
ipt_IMQ 672 2 - Live 0xc009b000
xt_string 896 0 - Live 0xc0099000
xt_layer7 9840 0 - Live 0xc008f000
ipt_ipp2p 6784 0 - Live 0xc0094000
ipt_LOG 4640 0 - Live 0xc0088000
xt_CHAOS 1792 0 - Live 0xc008d000
xt_DELUDE 2624 1 - Live 0xc008b000
xt_TARPIT 2816 1 - Live 0xc0084000
xt_quota 800 0 - Live 0xc0086000
xt_portscan 2016 0 - Live 0xc0066000
xt_pkttype 704 0 - Live 0xc0082000
xt_physdev 1488 0 - Live 0xc0080000
ipt_owner 800 0 - Live 0xc007e000
iptable_raw 832 0 - Live 0xc007c000
xt_NOTRACK 832 0 - Live 0xc0076000
xt_CONNMARK 1088 0 - Live 0xc0074000
ipt_recent 4992 0 - Live 0xc0079000
xt_helper 992 0 - Live 0xc0072000
xt_conntrack 1312 0 - Live 0xc0070000
xt_connmark 832 0 - Live 0xc006e000
xt_connbytes 1312 0 - Live 0xc0068000
tun 6592 0 - Live 0xc006b000

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-08 21:45   ` Sven Eckelmann
@ 2008-09-09  8:03     ` Sven Eckelmann
  0 siblings, 0 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-09  8:03 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking


[-- Attachment #1.1: Type: text/plain, Size: 351 bytes --]

On Monday 08 September 2008 23:45:55 Sven Eckelmann wrote:
> I got the System.map right now. So we can convert the kernel oops into
> something more readable. It doesn't help that much now, but... just for
> sake of completeness
Sry, forgot the second oops with the interesting address of the paging 
failure.

Best regards
	Sven Eckelmann


[-- Attachment #1.2: informative_ooops2.txt --]
[-- Type: text/plain, Size: 4199 bytes --]

CPU 0 Unable to handle kernel paging request at virtual address 00200200, epc == c00c8aa4, ra == c00c8a90
Cpu 0
$ 0   : 00000000 10009c00 00100100 00100100
$ 4   : 00200200 00000001 00000000 00000000
$ 8   : 00000000 8071aa28 0000000b 127a3980
$12   : 0000000b ebc20000 0000045d 67350e80
$16   : 80ac1600 00000000 c00c9a28 00000064
$20   : c00d0000 00000000 0006ab6e 8071d93d
$24   : 8071d730 00008000
$28   : 8071c000 8071d890 00000000 c00c8a90
Hi    : 00000140
Lo    : 68fdd3c0
epc   : c00c8aa4     Tainted: P
Cause : 3080000c
        8071d9a0 000005dc 8071d93d 00000054 000210d2 05a82b6e 00000000 00000000
        00020000 c00505f1 8026dd80 8071db60 000210d2 00000000 00000000 801ca5e8
        00000000 8071a9f8 8007d41c 8071d8fc 8071d8fc 8071d8c0 00000010 8071d8b0
        00000001 00000000 00000000 00004040 8071d8d0 00000010 8071d8b8 00000001
Call Trace:[<801ca5e8>][<8007d41c>][<801c018c>][<801ba0d0>][<801bbf74>][<8020f0d4>][<8020f95c>][<8020f95c>][<c015b2f4>][<c0161e80>][<8008bfe0>][<8008dfac>][<801ca110>][<800431e8>][<800437a4>][<c0106840>][<80050000>][<c01549f0>][<c015f7b0>][<c015f7f0>][<c0161e80>][<80079f5c>][<8006f8d0>][<8008bfe0>][<8006b778>][<8006b1e0>][<8006b2c4>][<c015f694>][<800437a4>][<80279960>][<8005fa64>][<800ba4d4>][<8005e8d4>][<8005cd38>][<8005d578>][<800b6d24>][<800b6d1c>][<802276d8>][<8022656c>][<8006704c>][<80067044>][<800691d4>][<80072f64>][<80072e54>][<80069290>][<80073a00>][<8005e5f0>][<80046aa0>][<8005e5f0>][<8005cd04>][<8005e8d4>][<8005e0e4>][<8005cd38>][<8005d578>][<8007d0b0>][<8022656c>][<c00c8650>][<8007d108>][<8007d0e8>][<80045698>][<80045688>]
Code: 3c020010  34420100  8e110008 <ac830000> ae020000  3c020020  34420200  ac640004  16200011


>>???; c00c8aa4 <END_OF_CODE+3fe1d474/????>   <=====

Trace; 801ca5e8 <ip_local_deliver_finish+0/2c0>
Trace; 8007d41c <autoremove_wake_function+0/44>
Trace; 801c018c <udp_packet+f0/114>
Trace; 801ba0d0 <nf_conntrack_find_get+c8/dc>
Trace; 801bbf74 <nf_conntrack_in+4ac/6f8>
Trace; 8020f0d4 <ipt_do_table+50c/588>
Trace; 8020f95c <nf_nat_fn+20c/244>
Trace; 8020f95c <nf_nat_fn+20c/244>
Trace; c015b2f4 <END_OF_CODE+3feafcc4/????>
Trace; c0161e80 <END_OF_CODE+3feb6850/????>
Trace; 8008bfe0 <handle_IRQ_event+64/d4>
Trace; 8008dfac <handle_level_irq+c0/114>
Trace; 801ca110 <ip_rcv_finish+0/4d8>
Trace; 800431e8 <ar5315_irq_dispatch+26c/2a4>
Trace; 800437a4 <ret_from_irq+0/4>
Trace; c0106840 <END_OF_CODE+3fe5b210/????>
Trace; 80050000 <blast_icache64_page_indexed+0/e4>
Trace; c01549f0 <END_OF_CODE+3fea93c0/????>
Trace; c015f7b0 <END_OF_CODE+3feb4180/????>
Trace; c015f7f0 <END_OF_CODE+3feb41c0/????>
Trace; c0161e80 <END_OF_CODE+3feb6850/????>
Trace; 80079f5c <rcu_process_callbacks+1c/38>
Trace; 8006f8d0 <run_timer_softirq+20/1fc>
Trace; 8008bfe0 <handle_IRQ_event+64/d4>
Trace; 8006b778 <tasklet_action+118/198>
Trace; 8006b1e0 <__do_softirq+78/100>
Trace; 8006b2c4 <do_softirq+5c/94>
Trace; c015f694 <END_OF_CODE+3feb4064/????>
Trace; 800437a4 <ret_from_irq+0/4>
Trace; 80279960 <cpu_probe+584/994>
Trace; 8005fa64 <__wake_up_sync+3c/74>
Trace; 800ba4d4 <__fput+188/1cc>
Trace; 8005e8d4 <dequeue_entity+98/d8>
Trace; 8005cd38 <dequeue_task+1c/30>
Trace; 8005d578 <pick_next_task_fair+38/78>
Trace; 800b6d24 <filp_close+74/90>
Trace; 800b6d1c <filp_close+6c/90>
Trace; 802276d8 <cond_resched+44/5c>
Trace; 8022656c <schedule+1e0/7d4>
Trace; 8006704c <put_files_struct+188/208>
Trace; 80067044 <put_files_struct+180/208>
Trace; 800691d4 <do_exit+960/96c>
Trace; 80072f64 <dequeue_signal+13c/17c>
Trace; 80072e54 <dequeue_signal+2c/17c>
Trace; 80069290 <sys_exit_group+0/c>
Trace; 80073a00 <get_signal_to_deliver+444/498>
Trace; 8005e5f0 <enqueue_entity+2fc/33c>
Trace; 80046aa0 <do_notify_resume+64/3ec>
Trace; 8005e5f0 <enqueue_entity+2fc/33c>
Trace; 8005cd04 <enqueue_task+1c/34>
Trace; 8005e8d4 <dequeue_entity+98/d8>
Trace; 8005e0e4 <try_to_wake_up+84/d8>
Trace; 8005cd38 <dequeue_task+1c/30>
Trace; 8005d578 <pick_next_task_fair+38/78>
Trace; 8007d0b0 <kthread+0/b0>
Trace; 8022656c <schedule+1e0/7d4>
Trace; c00c8650 <END_OF_CODE+3fe1d020/????>
Trace; 8007d108 <kthread+58/b0>
Trace; 8007d0e8 <kthread+38/b0>
Trace; 80045698 <kernel_thread_helper+10/18>
Trace; 80045688 <kernel_thread_helper+0/18>


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-08 21:18 ` Sven Eckelmann
  2008-09-08 21:45   ` Sven Eckelmann
@ 2008-09-09 11:26   ` Simon Wunderlich
  2008-09-09 22:45     ` Sven Eckelmann
  1 sibling, 1 reply; 9+ messages in thread
From: Simon Wunderlich @ 2008-09-09 11:26 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 2826 bytes --]

Hey Sven,

thanks for you analysis!!

On Mon, Sep 08, 2008 at 11:18:42PM +0200, Sven Eckelmann wrote:
> Ok, I got the /proc/modules file now. Current situation is following: it 
> crashes inside the the batman module add position 0x00000aa4
> 
>     a60:	3c020000 	lui	v0,0x0
>      a64:	8c500024 	lw	s0,36(v0)
>      a68:	24420024 	addiu	v0,v0,36
>      a6c:	12020014 	beq	s0,v0,ac0 <cleanup_module+0x610>
>      a70:	3c040000 	lui	a0,0x0
>      a74:	3c050000 	lui	a1,0x0
>      a78:	3c020000 	lui	v0,0x0
>      a7c:	24840000 	addiu	a0,a0,0
>      a80:	24a50088 	addiu	a1,a1,136
>      a84:	24420000 	addiu	v0,v0,0
>      a88:	0040f809 	jalr	v0
>      a8c:	24060283 	li	a2,643
>      a90:	8e040004 	lw	a0,4(s0)
>      a94:	8e030000 	lw	v1,0(s0)
>      a98:	3c020010 	lui	v0,0x10
>      a9c:	34420100 	ori	v0,v0,0x100
>      aa0:	8e110008 	lw	s1,8(s0)
>      aa4:	ac830000 	sw	v1,0(a0)
>      aa8:	ae020000 	sw	v0,0(s0)
>      aac:	3c020020 	lui	v0,0x20
>      ab0:	34420200 	ori	v0,v0,0x200
>      ab4:	ac640004 	sw	a0,4(v1)
> 
> This is part of the compiled version of packet_recv_thread. Due the 
> optimizations done I cannot say were exactly the problem lies.
> 
> I think the code of get_ip_addr() got inlined in packet_recv_thread and we 
> need to search for the crash inside of it at list_del(&entry->list);
> I would also say that the really crash is inside __list_del where prev and 
> next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of 
> poison.h of the current linux kernel. You will notice that the values are 
> 0x00100100 and 0x00200200 == address of the failed paging request. The list 
> poison stuff will be done in in list_del after calling __list_del (it is the 
> sequence lui, ori, sw in the asm snipped). So could it be that we have a 
> poisened entry inside the list?
> This could for example happen when we get scheduled (please notice that the 
> optimizer exchanged many instrictions) while another part of the program is 
> deleting entries. I haven't checked the rest of the code if that really could 
> happen, but that is my current idea.

Mhm, as far as i looked into the issue, there are the following 
points where free_client_list is accessed:

init_module() - INIT_LIST_HEAD()
* called on startup

get_ip_addr() - list_del():
* "secured" with a hash_lock spinlock

cleanup_module() - list_del():
* only called when unloading the module

batgat_ioctl() - list_del()
* from IOCREMDEV. This is called when batman shuts down.

packet_recv_thread - list_add():
* also secured in a hash_lock spinlock.

So it seems there should be no concurrency without user interaction 
(module or batman shutdown).
But i don't have a good idea yet where the problem comes from  ... :/

best regards,
	Simon

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-09 11:26   ` Simon Wunderlich
@ 2008-09-09 22:45     ` Sven Eckelmann
  2008-09-10  9:50       ` Sven Eckelmann
  0 siblings, 1 reply; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-09 22:45 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 4523 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday 09 September 2008 13:26:47 Simon Wunderlich wrote:
> Hey Sven,
>
> thanks for you analysis!!
>
> [...]
>
> Mhm, as far as i looked into the issue, there are the following
> points where free_client_list is accessed:
> [...]
> So it seems there should be no concurrency without user interaction
> (module or batman shutdown).
> But i don't have a good idea yet where the problem comes from  ... :/
Yes, the idea of the race condition was stupid. So what is the real problem? 
Maybe a compiler bug? Let's check the assembler stuff:

The stuff around list_del in get_ip_addr:
     a90: lw	a0,4(s0) /* a0 gets our prev pointer */
     a94: lw	v1,0(s0) /* v1 gets our next pointer */
     a98: lui	v0,0x10  /* load 0x100100 in v0 */
     a9c: ori	v0,v0,0x100
     aa0: lw	s1,8(s0) /* load pointer to gw_client in s1 */
     aa4: sw	v1,0(a0) /* store our next pointer in the next pointer of prev
                          ****crash**** because 0x200200 or 0x0 was in our
                          next pointer - why do we have a poisened next
                          pointer when we are probably the first entry of the
                          list -> is the list not correctly initialised or
                          aren't we added correctly? */
     aa8: sw	v0,0(s0) /* store poison in next pointer */
     aac: lui	v0,0x20  /* load 0x200200 in v0 */
     ab0: ori	v0,v0,0x200
     ab4: sw	a0,4(v1) /* store our prev pointer in in the prev pointer of next 
*/


The initialisation of the list is done in init_module. prev and next should be 
set to the list address. So let's search for it:

    /* "zero" means here the position were our module was loaded to. */
    1374: lui	a1,0x0    /* set a1 to "zero" */
    1378: addiu	v1,a1,36 /* 36 is the position of the structure
                              free_client_list, so we set v1 to it */
    137c: lui	a0,0x0    /* set a0 to "zero" */
    1380: sw	v0,28(a0) /* store pointer to wp_hash */
    1384: sw	v1,36(a1) /* store free_client_list.next as free_client_list */
    1388: sw	v1,4(v1)  /* store free_client_list.prev as free_client_list */


So this looks good too. So when we have a entry, we must have added it 
somewhere. Lets take a look at packet_recv_thread again where the list_add is

    /* v0 and v1 holds pointer to new allocated struct */
    /* a3 holds "zero" - like t0 */
    9a8: sw	s1,8(v0) /* store pointer to client_data in new data buffer */
    9ac: lw	v0,36(a3) /* v0 gets free_client_list.next -> lets call it
                           next_element */
    /* shouldn't be another instruction between load and usage of the
       register? - like a nop */
    9b0: sw	v0,0(v1) /* store next_element in tmp_entry.list.next */
    9b4: sw	v1,4(v0) /* store pointer to tmp_entry next_element.prev */
    9b8: sw	v1,36(a3) /* store pointer to tmp_entry in freeclient_list.next  
*/
    9bc: j	9d4 <cleanup_module+0x524>
    9c0: sw	t0,4(v1) /* saves freeclient_list in tmp_entry.prev << if this
                          would not be executed in parallel, we would get
                          wrong data here, but because we are using mips it
                          must be executed */

So I cannot see anything special - only these two instructions. Maybe we 
should create a version with debug output after list_add with next and prev 
pointer of tmp_entry and free_client_list. The same with entry before calling 
list_del in get_ip_addr. This should be compiled for the nightwing and send to 
Outback Dingo so he can test it and send the kernel log after a crash to us. 
The problem is that I added some printks, checked the resulting output and 
noticed that this changed the output (so the interesting parts aren't 
there anymore).
So if it will run without problems and prints some "use free client from list" 
in the kernel log then we should have fixed it by resorting some instructions. 
If not... we should try to find it by using the extra debugging output.
If it runs without problems, please remove the printks around the list_add and 
if it still runs and prints some "use free client from list" now... do it the 
other way around (keep the printks around list_add and remove it before 
list_del).

...just my ideas to find the problem.

Best regards
	Sven Eckelmann

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkjG/HsACgkQqQGwKVlMoDv6fACg90wX35fyHR13/Dh/nBvrKM4C
euwAn03zpb+HqWccdjcf7Z7SotWd+1s0
=pwTz
-----END PGP SIGNATURE-----

[-- Attachment #2: batgat_test.patch --]
[-- Type: text/x-patch, Size: 1204 bytes --]

--- a/batman/linux/modules/gateway.c
+++ b/batman/linux/modules/gateway.c
@@ -385,7 +385,11 @@ static int packet_recv_thread(void *data)
 						tmp_entry = kmalloc(sizeof(struct free_client_data), GFP_KERNEL);
 						if(tmp_entry != NULL) {
 							tmp_entry->gw_client = client_data;
+							printk("list_add_b; tmp_entry pointers (%p, %p)\n", tmp_entry->list.prev, tmp_entry->list.next);
+							printk("list_add_b; free_client_list pointers (%p, %p)\n", free_client_list.prev, free_client_list.next);
 							list_add(&tmp_entry->list,&free_client_list);
+							printk("list_add_a; tmp_entry pointers (%p, %p)\n", tmp_entry->list.prev, tmp_entry->list.next);
+							printk("list_add_a; free_client_list pointers (%p, %p)\n", free_client_list.prev, free_client_list.next);
 						} else
 							DBG("can't add free gw_client to free list");
 
@@ -642,6 +646,7 @@ static struct gw_client *get_ip_addr(struct sockaddr_in *client_addr)
 	list_for_each_entry_safe(entry, next, &free_client_list, list) {
 		DBG("use free client from list");
 		gw_client = entry->gw_client;
+		printk("free client; entry pointers (%p, %p)\n", entry->list.prev, entry->list.next);
 		list_del(&entry->list);
 		break;
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-09 22:45     ` Sven Eckelmann
@ 2008-09-10  9:50       ` Sven Eckelmann
  0 siblings, 0 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-10  9:50 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]

On Wednesday 10 September 2008 00:45:07 Sven Eckelmann wrote:
> [...]
>     /* v0 and v1 holds pointer to new allocated struct */
>     /* a3 holds "zero" - like t0 */
>     9a8: sw	s1,8(v0) /* store pointer to client_data in new data buffer */
>     9ac: lw	v0,36(a3) /* v0 gets free_client_list.next -> lets call it
>                            next_element */
>     /* shouldn't be another instruction between load and usage of the
>        register? - like a nop */
>     9b0: sw	v0,0(v1) /* store next_element in tmp_entry.list.next */
>     9b4: sw	v1,4(v0) /* store pointer to tmp_entry next_element.prev */
>     9b8: sw	v1,36(a3) /* store pointer to tmp_entry in freeclient_list.next
> */
>     9bc: j	9d4 <cleanup_module+0x524>
>     9c0: sw	t0,4(v1) /* saves freeclient_list in tmp_entry.prev << if this
>                           would not be executed in parallel, we would get
>                           wrong data here, but because we are using mips it
>                           must be executed */
> [...]
Ok, searched inside the official mips32 instruction set reference and both 
questionable instructions are defined in a hard way without any unpredictable 
or undefined remarks. So they should be working fine.

Best regards
	Sven Eckelmann

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [B.A.T.M.A.N.] Batman gateway lock ups
  2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
  2008-09-08  7:08 ` Sven Eckelmann
  2008-09-08 21:18 ` Sven Eckelmann
@ 2009-07-21 20:48 ` Simon Wunderlich
  2 siblings, 0 replies; 9+ messages in thread
From: Simon Wunderlich @ 2009-07-21 20:48 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 3067 bytes --]

Hello Dingo,

i know it is a long time ago, but can you please verify if the problem present in
Ticket 121 [1] still affects you? Or is it already solved? The entries are 10
months old and the issue is probably already fixed.

Thank you very much,
	Simon

[1] https://www.open-mesh.net/ticket/121

On Fri, Sep 05, 2008 at 10:01:24PM +0700, Outback Dingo wrote:
>  see pastebin
> 
> http://www.pastebin.ca/1194874
> 
> pertinent info
> dmesg | grep 'batgat loaded'
> batgat: [init_module:96] batgat loaded  rv1025
> uname -a
> Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown
> root@nightwing:~# batmand -v
> WARNING: You are using the unstable batman branch. If you are interested in
> *using* batman get the latest stable release !
> B.A.T.M.A.N. 0.3-beta (compatibility version 5)
> lsmod
> 
> Module                  Size  Used by    Tainted:
> P
> sch_htb                14048
> 2
> ath_ahb               103616
> 0
> wlan_xauth               480
> 0
> wlan_wep                4000
> 0
> wlan_tkip               9856
> 0
> wlan_ccmp               5440
> 2
> wlan_acl                1920  0
> ath_rate_minstrel       8352  1
> ath_hal               136832  3 ath_ahb,ath_rate_minstrel
> wlan_scan_sta           8768  1
> wlan_scan_ap            6656  0
> wlan                  152464  10
> ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap
> batgat                 10944  1
> ipt_iprange              672  0
> ipt_TOS                  832  0
> ipt_TTL                  928  0
> xt_MARK                  960  3
> ipt_ECN                 1472  0
> xt_CLASSIFY              640  0
> ipt_ttl                  704  0
> ipt_tos                  544  0
> ipt_time                1568  0
> xt_tcpmss               1088  0
> xt_statistic             832  0
> xt_mark                  672  7
> xt_mac                   736  3
> xt_length                736  0
> ipt_ecn                 1024  0
> xt_DSCP                 1056  0
> xt_dscp                  832  0
> imq                     2096  0
> ipt_IMQ                  672  2
> xt_string                896  0
> xt_layer7               9840  0
> ipt_ipp2p               6784  0
> ipt_LOG                 4640  0
> xt_CHAOS                1792  0
> xt_DELUDE               2624  1
> xt_TARPIT               2816  1
> xt_quota                 800  0
> xt_portscan             2016  0
> xt_pkttype               704  0
> xt_physdev              1488  0
> ipt_owner                800  0
> iptable_raw              832  0
> xt_NOTRACK               832  0
> xt_CONNMARK             1088  0
> ipt_recent              4992  0
> xt_helper                992  0
> xt_conntrack            1312  0
> xt_connmark              832  0
> xt_connbytes            1312  0
> tun                     6592  0

> _______________________________________________
> B.A.T.M.A.N mailing list
> B.A.T.M.A.N@open-mesh.net
> https://list.open-mesh.net/mm/listinfo/b.a.t.m.a.n


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-07-21 20:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
2008-09-08  7:08 ` Sven Eckelmann
2008-09-08 21:18 ` Sven Eckelmann
2008-09-08 21:45   ` Sven Eckelmann
2008-09-09  8:03     ` Sven Eckelmann
2008-09-09 11:26   ` Simon Wunderlich
2008-09-09 22:45     ` Sven Eckelmann
2008-09-10  9:50       ` Sven Eckelmann
2009-07-21 20:48 ` Simon Wunderlich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox