* [B.A.T.M.A.N.] Batman gateway lock ups
@ 2008-09-05 15:01 Outback Dingo
2008-09-08 7:08 ` Sven Eckelmann
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Outback Dingo @ 2008-09-05 15:01 UTC (permalink / raw)
To: b.a.t.m.a.n
[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]
see pastebin
http://www.pastebin.ca/1194874
pertinent info
dmesg | grep 'batgat loaded'
batgat: [init_module:96] batgat loaded rv1025
uname -a
Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown
root@nightwing:~# batmand -v
WARNING: You are using the unstable batman branch. If you are interested in
*using* batman get the latest stable release !
B.A.T.M.A.N. 0.3-beta (compatibility version 5)
lsmod
Module Size Used by Tainted:
P
sch_htb 14048
2
ath_ahb 103616
0
wlan_xauth 480
0
wlan_wep 4000
0
wlan_tkip 9856
0
wlan_ccmp 5440
2
wlan_acl 1920 0
ath_rate_minstrel 8352 1
ath_hal 136832 3 ath_ahb,ath_rate_minstrel
wlan_scan_sta 8768 1
wlan_scan_ap 6656 0
wlan 152464 10
ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap
batgat 10944 1
ipt_iprange 672 0
ipt_TOS 832 0
ipt_TTL 928 0
xt_MARK 960 3
ipt_ECN 1472 0
xt_CLASSIFY 640 0
ipt_ttl 704 0
ipt_tos 544 0
ipt_time 1568 0
xt_tcpmss 1088 0
xt_statistic 832 0
xt_mark 672 7
xt_mac 736 3
xt_length 736 0
ipt_ecn 1024 0
xt_DSCP 1056 0
xt_dscp 832 0
imq 2096 0
ipt_IMQ 672 2
xt_string 896 0
xt_layer7 9840 0
ipt_ipp2p 6784 0
ipt_LOG 4640 0
xt_CHAOS 1792 0
xt_DELUDE 2624 1
xt_TARPIT 2816 1
xt_quota 800 0
xt_portscan 2016 0
xt_pkttype 704 0
xt_physdev 1488 0
ipt_owner 800 0
iptable_raw 832 0
xt_NOTRACK 832 0
xt_CONNMARK 1088 0
ipt_recent 4992 0
xt_helper 992 0
xt_conntrack 1312 0
xt_connmark 832 0
xt_connbytes 1312 0
tun 6592 0
[-- Attachment #2: Type: text/html, Size: 9985 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
@ 2008-09-08 7:08 ` Sven Eckelmann
2008-09-08 21:18 ` Sven Eckelmann
2009-07-21 20:48 ` Simon Wunderlich
2 siblings, 0 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-08 7:08 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]
On Friday 05 September 2008 17:01:24 Outback Dingo wrote:
> [...]
> http://www.pastebin.ca/1194874
>
> pertinent info
> dmesg | grep 'batgat loaded'
> batgat: [init_module:96] batgat loaded rv1025
> uname -a
> Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown
> root@nightwing:~# batmand -v
> WARNING: You are using the unstable batman branch. If you are interested in
> *using* batman get the latest stable release !
> B.A.T.M.A.N. 0.3-beta (compatibility version 5)
> [..]
Thx for your report. I looked a little bit at the the call stack but it is
really hard to guess the functions when you don't have the symbol table. It
looks a little bit like most of the functions are kernel thread/scheduling
related. Only one function <c00c8650> is from a module. Can you please send
the /proc/modules file so we can find in which module it is so we can check if
it could be a batgat related. Further investigations could be done by checking
the symbols of batgat.ko with nm - but this would not help very much because
it is stripped down. Maybe someone else has a good idea.
Now to something you said on the nightwing mailing list:
> i think the version of batgat being used has a bug as confirmed by someone in
> the batman irc room, note only gateways have this issue, clients do not
> crash at all.
If you mean me (Lazhur on irc) then I have to say that I am not a developer
nor a spokesman of b.a.t.m.a.n. and I never confirmed that it is batgat related
- only that something crashed/lock up.
I cannot find any crash related bug fixes in the batgat trunk directory - so it
doesn't seem to be a known problem (if it is a batgat problem at all).
Best regards
Sven Eckelmann
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
2008-09-08 7:08 ` Sven Eckelmann
@ 2008-09-08 21:18 ` Sven Eckelmann
2008-09-08 21:45 ` Sven Eckelmann
2008-09-09 11:26 ` Simon Wunderlich
2009-07-21 20:48 ` Simon Wunderlich
2 siblings, 2 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-08 21:18 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]
Ok, I got the /proc/modules file now. Current situation is following: it
crashes inside the the batman module add position 0x00000aa4
a60: 3c020000 lui v0,0x0
a64: 8c500024 lw s0,36(v0)
a68: 24420024 addiu v0,v0,36
a6c: 12020014 beq s0,v0,ac0 <cleanup_module+0x610>
a70: 3c040000 lui a0,0x0
a74: 3c050000 lui a1,0x0
a78: 3c020000 lui v0,0x0
a7c: 24840000 addiu a0,a0,0
a80: 24a50088 addiu a1,a1,136
a84: 24420000 addiu v0,v0,0
a88: 0040f809 jalr v0
a8c: 24060283 li a2,643
a90: 8e040004 lw a0,4(s0)
a94: 8e030000 lw v1,0(s0)
a98: 3c020010 lui v0,0x10
a9c: 34420100 ori v0,v0,0x100
aa0: 8e110008 lw s1,8(s0)
aa4: ac830000 sw v1,0(a0)
aa8: ae020000 sw v0,0(s0)
aac: 3c020020 lui v0,0x20
ab0: 34420200 ori v0,v0,0x200
ab4: ac640004 sw a0,4(v1)
This is part of the compiled version of packet_recv_thread. Due the
optimizations done I cannot say were exactly the problem lies.
I think the code of get_ip_addr() got inlined in packet_recv_thread and we
need to search for the crash inside of it at list_del(&entry->list);
I would also say that the really crash is inside __list_del where prev and
next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of
poison.h of the current linux kernel. You will notice that the values are
0x00100100 and 0x00200200 == address of the failed paging request. The list
poison stuff will be done in in list_del after calling __list_del (it is the
sequence lui, ori, sw in the asm snipped). So could it be that we have a
poisened entry inside the list?
This could for example happen when we get scheduled (please notice that the
optimizer exchanged many instrictions) while another part of the program is
deleting entries. I haven't checked the rest of the code if that really could
happen, but that is my current idea.
So for better readability the callstack:
- packet_recv_thread
- get_ip_addr from gateway.c:401
- list_del from gateway.c:645
- __list_del
Best regards
Sven Eckelmann
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-08 21:18 ` Sven Eckelmann
@ 2008-09-08 21:45 ` Sven Eckelmann
2008-09-09 8:03 ` Sven Eckelmann
2008-09-09 11:26 ` Simon Wunderlich
1 sibling, 1 reply; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-08 21:45 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1.1: Type: text/plain, Size: 502 bytes --]
On Monday 08 September 2008 23:18:42 Sven Eckelmann wrote:
> Ok, I got the /proc/modules file now. Current situation is following: it
> crashes inside the the batman module add position 0x00000aa4
> [..]
I got the System.map right now. So we can convert the kernel oops into
something more readable. It doesn't help that much now, but... just for sake
of completeness
END_OF_CODE+3fe1d020 is the batgat.ko when we search for c00c8650 inside
/proc/modules
Best regards
Sven Eckelmann
[-- Attachment #1.2: informative_ooops.txt --]
[-- Type: text/plain, Size: 1672 bytes --]
CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == c00c8aa4, ra == c00c8a90
Cpu 0
$ 0 : 00000000 10009c00 00100100 802bb600
$ 4 : 00000000 00000001 00000000 00000000
$ 8 : 00000000 806e7a28 00000007 a106af00
$12 : 00000007 270e0000 000006d0 7d96d080
$16 : 80a23500 00000000 c00c9a28 00000064
$20 : c00d0000 00000000 000b0f4f 8071193d
$24 : 80711730 00008000
$28 : 80710000 80711890 00000000 c00c8a90
Hi : 00000140
Lo : 68fdd3c0
epc : c00c8aa4 Tainted: P
Cause : 3080000c
807119a0 000005dc 8071193d 000004c0 000210d2 05a82b1a 00000000 00000000
00020000 185352d0 2ac668cc 2ac668d4 000210d2 00000000 2ac668dc 2ac668e4
00000000 806e79f8 8007d41c 807118fc 807118fc 807118c0 00000010 807118b0
00000001 00000000 00000000 00004040 807118d0 00000010 807118b8 00000001
Call Trace:[<8007d41c>][<8005e5f0>][<8005cd04>][<8005e8d4>][<8005e0e4>][<8005cd38>][<8005d578>][<8007d0b0>][<8022656c>][<c00c8650>][<8007d108>][<8007d0e8>][<80045698>][<80045688>]
Code: 3c020010 34420100 8e110008 <ac830000> ae020000 3c020020 34420200 ac640004 16200011
Trace; 8007d41c <autoremove_wake_function+0/44>
Trace; 8005e5f0 <enqueue_entity+2fc/33c>
Trace; 8005cd04 <enqueue_task+1c/34>
Trace; 8005e8d4 <dequeue_entity+98/d8>
Trace; 8005e0e4 <try_to_wake_up+84/d8>
Trace; 8005cd38 <dequeue_task+1c/30>
Trace; 8005d578 <pick_next_task_fair+38/78>
Trace; 8007d0b0 <kthread+0/b0>
Trace; 8022656c <schedule+1e0/7d4>
Trace; c00c8650 <END_OF_CODE+3fe1d020/????>
Trace; 8007d108 <kthread+58/b0>
Trace; 8007d0e8 <kthread+38/b0>
Trace; 80045698 <kernel_thread_helper+10/18>
Trace; 80045688 <kernel_thread_helper+0/18>
[-- Attachment #1.3: proc_modules --]
[-- Type: text/plain, Size: 1954 bytes --]
sch_htb 14048 2 - Live 0xc00e5000
ath_ahb 103616 0 - Live 0xc0150000
wlan_xauth 480 0 - Live 0xc00dc000
wlan_wep 4000 0 - Live 0xc00cf000
wlan_tkip 9856 0 - Live 0xc00d8000
wlan_ccmp 5440 3 - Live 0xc00d5000
wlan_acl 1920 0 - Live 0xc00af000
ath_rate_minstrel 8352 1 - Live 0xc00d1000
ath_hal 136832 3 ath_ahb,ath_rate_minstrel, Live 0xc012d000 (P)
wlan_scan_sta 8768 1 - Live 0xc00c1000
wlan_scan_ap 6656 0 - Live 0xc00c5000
wlan 152464 10 ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap,Live 0xc0106000
batgat 10944 1 - Live 0xc00c8000
ipt_iprange 672 0 - Live 0xc00bf000
ipt_TOS 832 0 - Live 0xc00bd000
ipt_TTL 928 0 - Live 0xc00bb000
xt_MARK 960 0 - Live 0xc00b9000
ipt_ECN 1472 0 - Live 0xc00b7000
xt_CLASSIFY 640 0 - Live 0xc00b5000
ipt_ttl 704 0 - Live 0xc00b3000
ipt_tos 544 0 - Live 0xc00b1000
ipt_time 1568 0 - Live 0xc009d000
xt_tcpmss 1088 0 - Live 0xc00ad000
xt_statistic 832 0 - Live 0xc00ab000
xt_mark 672 7 - Live 0xc00a9000
xt_mac 736 0 - Live 0xc00a7000
xt_length 736 0 - Live 0xc00a5000
ipt_ecn 1024 0 - Live 0xc00a3000
xt_DSCP 1056 0 - Live 0xc00a1000
xt_dscp 832 0 - Live 0xc009f000
imq 2096 0 - Live 0xc0097000
ipt_IMQ 672 2 - Live 0xc009b000
xt_string 896 0 - Live 0xc0099000
xt_layer7 9840 0 - Live 0xc008f000
ipt_ipp2p 6784 0 - Live 0xc0094000
ipt_LOG 4640 0 - Live 0xc0088000
xt_CHAOS 1792 0 - Live 0xc008d000
xt_DELUDE 2624 1 - Live 0xc008b000
xt_TARPIT 2816 1 - Live 0xc0084000
xt_quota 800 0 - Live 0xc0086000
xt_portscan 2016 0 - Live 0xc0066000
xt_pkttype 704 0 - Live 0xc0082000
xt_physdev 1488 0 - Live 0xc0080000
ipt_owner 800 0 - Live 0xc007e000
iptable_raw 832 0 - Live 0xc007c000
xt_NOTRACK 832 0 - Live 0xc0076000
xt_CONNMARK 1088 0 - Live 0xc0074000
ipt_recent 4992 0 - Live 0xc0079000
xt_helper 992 0 - Live 0xc0072000
xt_conntrack 1312 0 - Live 0xc0070000
xt_connmark 832 0 - Live 0xc006e000
xt_connbytes 1312 0 - Live 0xc0068000
tun 6592 0 - Live 0xc006b000
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-08 21:45 ` Sven Eckelmann
@ 2008-09-09 8:03 ` Sven Eckelmann
0 siblings, 0 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-09 8:03 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1.1: Type: text/plain, Size: 351 bytes --]
On Monday 08 September 2008 23:45:55 Sven Eckelmann wrote:
> I got the System.map right now. So we can convert the kernel oops into
> something more readable. It doesn't help that much now, but... just for
> sake of completeness
Sry, forgot the second oops with the interesting address of the paging
failure.
Best regards
Sven Eckelmann
[-- Attachment #1.2: informative_ooops2.txt --]
[-- Type: text/plain, Size: 4199 bytes --]
CPU 0 Unable to handle kernel paging request at virtual address 00200200, epc == c00c8aa4, ra == c00c8a90
Cpu 0
$ 0 : 00000000 10009c00 00100100 00100100
$ 4 : 00200200 00000001 00000000 00000000
$ 8 : 00000000 8071aa28 0000000b 127a3980
$12 : 0000000b ebc20000 0000045d 67350e80
$16 : 80ac1600 00000000 c00c9a28 00000064
$20 : c00d0000 00000000 0006ab6e 8071d93d
$24 : 8071d730 00008000
$28 : 8071c000 8071d890 00000000 c00c8a90
Hi : 00000140
Lo : 68fdd3c0
epc : c00c8aa4 Tainted: P
Cause : 3080000c
8071d9a0 000005dc 8071d93d 00000054 000210d2 05a82b6e 00000000 00000000
00020000 c00505f1 8026dd80 8071db60 000210d2 00000000 00000000 801ca5e8
00000000 8071a9f8 8007d41c 8071d8fc 8071d8fc 8071d8c0 00000010 8071d8b0
00000001 00000000 00000000 00004040 8071d8d0 00000010 8071d8b8 00000001
Call Trace:[<801ca5e8>][<8007d41c>][<801c018c>][<801ba0d0>][<801bbf74>][<8020f0d4>][<8020f95c>][<8020f95c>][<c015b2f4>][<c0161e80>][<8008bfe0>][<8008dfac>][<801ca110>][<800431e8>][<800437a4>][<c0106840>][<80050000>][<c01549f0>][<c015f7b0>][<c015f7f0>][<c0161e80>][<80079f5c>][<8006f8d0>][<8008bfe0>][<8006b778>][<8006b1e0>][<8006b2c4>][<c015f694>][<800437a4>][<80279960>][<8005fa64>][<800ba4d4>][<8005e8d4>][<8005cd38>][<8005d578>][<800b6d24>][<800b6d1c>][<802276d8>][<8022656c>][<8006704c>][<80067044>][<800691d4>][<80072f64>][<80072e54>][<80069290>][<80073a00>][<8005e5f0>][<80046aa0>][<8005e5f0>][<8005cd04>][<8005e8d4>][<8005e0e4>][<8005cd38>][<8005d578>][<8007d0b0>][<8022656c>][<c00c8650>][<8007d108>][<8007d0e8>][<80045698>][<80045688>]
Code: 3c020010 34420100 8e110008 <ac830000> ae020000 3c020020 34420200 ac640004 16200011
>>???; c00c8aa4 <END_OF_CODE+3fe1d474/????> <=====
Trace; 801ca5e8 <ip_local_deliver_finish+0/2c0>
Trace; 8007d41c <autoremove_wake_function+0/44>
Trace; 801c018c <udp_packet+f0/114>
Trace; 801ba0d0 <nf_conntrack_find_get+c8/dc>
Trace; 801bbf74 <nf_conntrack_in+4ac/6f8>
Trace; 8020f0d4 <ipt_do_table+50c/588>
Trace; 8020f95c <nf_nat_fn+20c/244>
Trace; 8020f95c <nf_nat_fn+20c/244>
Trace; c015b2f4 <END_OF_CODE+3feafcc4/????>
Trace; c0161e80 <END_OF_CODE+3feb6850/????>
Trace; 8008bfe0 <handle_IRQ_event+64/d4>
Trace; 8008dfac <handle_level_irq+c0/114>
Trace; 801ca110 <ip_rcv_finish+0/4d8>
Trace; 800431e8 <ar5315_irq_dispatch+26c/2a4>
Trace; 800437a4 <ret_from_irq+0/4>
Trace; c0106840 <END_OF_CODE+3fe5b210/????>
Trace; 80050000 <blast_icache64_page_indexed+0/e4>
Trace; c01549f0 <END_OF_CODE+3fea93c0/????>
Trace; c015f7b0 <END_OF_CODE+3feb4180/????>
Trace; c015f7f0 <END_OF_CODE+3feb41c0/????>
Trace; c0161e80 <END_OF_CODE+3feb6850/????>
Trace; 80079f5c <rcu_process_callbacks+1c/38>
Trace; 8006f8d0 <run_timer_softirq+20/1fc>
Trace; 8008bfe0 <handle_IRQ_event+64/d4>
Trace; 8006b778 <tasklet_action+118/198>
Trace; 8006b1e0 <__do_softirq+78/100>
Trace; 8006b2c4 <do_softirq+5c/94>
Trace; c015f694 <END_OF_CODE+3feb4064/????>
Trace; 800437a4 <ret_from_irq+0/4>
Trace; 80279960 <cpu_probe+584/994>
Trace; 8005fa64 <__wake_up_sync+3c/74>
Trace; 800ba4d4 <__fput+188/1cc>
Trace; 8005e8d4 <dequeue_entity+98/d8>
Trace; 8005cd38 <dequeue_task+1c/30>
Trace; 8005d578 <pick_next_task_fair+38/78>
Trace; 800b6d24 <filp_close+74/90>
Trace; 800b6d1c <filp_close+6c/90>
Trace; 802276d8 <cond_resched+44/5c>
Trace; 8022656c <schedule+1e0/7d4>
Trace; 8006704c <put_files_struct+188/208>
Trace; 80067044 <put_files_struct+180/208>
Trace; 800691d4 <do_exit+960/96c>
Trace; 80072f64 <dequeue_signal+13c/17c>
Trace; 80072e54 <dequeue_signal+2c/17c>
Trace; 80069290 <sys_exit_group+0/c>
Trace; 80073a00 <get_signal_to_deliver+444/498>
Trace; 8005e5f0 <enqueue_entity+2fc/33c>
Trace; 80046aa0 <do_notify_resume+64/3ec>
Trace; 8005e5f0 <enqueue_entity+2fc/33c>
Trace; 8005cd04 <enqueue_task+1c/34>
Trace; 8005e8d4 <dequeue_entity+98/d8>
Trace; 8005e0e4 <try_to_wake_up+84/d8>
Trace; 8005cd38 <dequeue_task+1c/30>
Trace; 8005d578 <pick_next_task_fair+38/78>
Trace; 8007d0b0 <kthread+0/b0>
Trace; 8022656c <schedule+1e0/7d4>
Trace; c00c8650 <END_OF_CODE+3fe1d020/????>
Trace; 8007d108 <kthread+58/b0>
Trace; 8007d0e8 <kthread+38/b0>
Trace; 80045698 <kernel_thread_helper+10/18>
Trace; 80045688 <kernel_thread_helper+0/18>
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-08 21:18 ` Sven Eckelmann
2008-09-08 21:45 ` Sven Eckelmann
@ 2008-09-09 11:26 ` Simon Wunderlich
2008-09-09 22:45 ` Sven Eckelmann
1 sibling, 1 reply; 9+ messages in thread
From: Simon Wunderlich @ 2008-09-09 11:26 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1: Type: text/plain, Size: 2826 bytes --]
Hey Sven,
thanks for you analysis!!
On Mon, Sep 08, 2008 at 11:18:42PM +0200, Sven Eckelmann wrote:
> Ok, I got the /proc/modules file now. Current situation is following: it
> crashes inside the the batman module add position 0x00000aa4
>
> a60: 3c020000 lui v0,0x0
> a64: 8c500024 lw s0,36(v0)
> a68: 24420024 addiu v0,v0,36
> a6c: 12020014 beq s0,v0,ac0 <cleanup_module+0x610>
> a70: 3c040000 lui a0,0x0
> a74: 3c050000 lui a1,0x0
> a78: 3c020000 lui v0,0x0
> a7c: 24840000 addiu a0,a0,0
> a80: 24a50088 addiu a1,a1,136
> a84: 24420000 addiu v0,v0,0
> a88: 0040f809 jalr v0
> a8c: 24060283 li a2,643
> a90: 8e040004 lw a0,4(s0)
> a94: 8e030000 lw v1,0(s0)
> a98: 3c020010 lui v0,0x10
> a9c: 34420100 ori v0,v0,0x100
> aa0: 8e110008 lw s1,8(s0)
> aa4: ac830000 sw v1,0(a0)
> aa8: ae020000 sw v0,0(s0)
> aac: 3c020020 lui v0,0x20
> ab0: 34420200 ori v0,v0,0x200
> ab4: ac640004 sw a0,4(v1)
>
> This is part of the compiled version of packet_recv_thread. Due the
> optimizations done I cannot say were exactly the problem lies.
>
> I think the code of get_ip_addr() got inlined in packet_recv_thread and we
> need to search for the crash inside of it at list_del(&entry->list);
> I would also say that the really crash is inside __list_del where prev and
> next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of
> poison.h of the current linux kernel. You will notice that the values are
> 0x00100100 and 0x00200200 == address of the failed paging request. The list
> poison stuff will be done in in list_del after calling __list_del (it is the
> sequence lui, ori, sw in the asm snipped). So could it be that we have a
> poisened entry inside the list?
> This could for example happen when we get scheduled (please notice that the
> optimizer exchanged many instrictions) while another part of the program is
> deleting entries. I haven't checked the rest of the code if that really could
> happen, but that is my current idea.
Mhm, as far as i looked into the issue, there are the following
points where free_client_list is accessed:
init_module() - INIT_LIST_HEAD()
* called on startup
get_ip_addr() - list_del():
* "secured" with a hash_lock spinlock
cleanup_module() - list_del():
* only called when unloading the module
batgat_ioctl() - list_del()
* from IOCREMDEV. This is called when batman shuts down.
packet_recv_thread - list_add():
* also secured in a hash_lock spinlock.
So it seems there should be no concurrency without user interaction
(module or batman shutdown).
But i don't have a good idea yet where the problem comes from ... :/
best regards,
Simon
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-09 11:26 ` Simon Wunderlich
@ 2008-09-09 22:45 ` Sven Eckelmann
2008-09-10 9:50 ` Sven Eckelmann
0 siblings, 1 reply; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-09 22:45 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1: Type: text/plain, Size: 4523 bytes --]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Tuesday 09 September 2008 13:26:47 Simon Wunderlich wrote:
> Hey Sven,
>
> thanks for you analysis!!
>
> [...]
>
> Mhm, as far as i looked into the issue, there are the following
> points where free_client_list is accessed:
> [...]
> So it seems there should be no concurrency without user interaction
> (module or batman shutdown).
> But i don't have a good idea yet where the problem comes from ... :/
Yes, the idea of the race condition was stupid. So what is the real problem?
Maybe a compiler bug? Let's check the assembler stuff:
The stuff around list_del in get_ip_addr:
a90: lw a0,4(s0) /* a0 gets our prev pointer */
a94: lw v1,0(s0) /* v1 gets our next pointer */
a98: lui v0,0x10 /* load 0x100100 in v0 */
a9c: ori v0,v0,0x100
aa0: lw s1,8(s0) /* load pointer to gw_client in s1 */
aa4: sw v1,0(a0) /* store our next pointer in the next pointer of prev
****crash**** because 0x200200 or 0x0 was in our
next pointer - why do we have a poisened next
pointer when we are probably the first entry of the
list -> is the list not correctly initialised or
aren't we added correctly? */
aa8: sw v0,0(s0) /* store poison in next pointer */
aac: lui v0,0x20 /* load 0x200200 in v0 */
ab0: ori v0,v0,0x200
ab4: sw a0,4(v1) /* store our prev pointer in in the prev pointer of next
*/
The initialisation of the list is done in init_module. prev and next should be
set to the list address. So let's search for it:
/* "zero" means here the position were our module was loaded to. */
1374: lui a1,0x0 /* set a1 to "zero" */
1378: addiu v1,a1,36 /* 36 is the position of the structure
free_client_list, so we set v1 to it */
137c: lui a0,0x0 /* set a0 to "zero" */
1380: sw v0,28(a0) /* store pointer to wp_hash */
1384: sw v1,36(a1) /* store free_client_list.next as free_client_list */
1388: sw v1,4(v1) /* store free_client_list.prev as free_client_list */
So this looks good too. So when we have a entry, we must have added it
somewhere. Lets take a look at packet_recv_thread again where the list_add is
/* v0 and v1 holds pointer to new allocated struct */
/* a3 holds "zero" - like t0 */
9a8: sw s1,8(v0) /* store pointer to client_data in new data buffer */
9ac: lw v0,36(a3) /* v0 gets free_client_list.next -> lets call it
next_element */
/* shouldn't be another instruction between load and usage of the
register? - like a nop */
9b0: sw v0,0(v1) /* store next_element in tmp_entry.list.next */
9b4: sw v1,4(v0) /* store pointer to tmp_entry next_element.prev */
9b8: sw v1,36(a3) /* store pointer to tmp_entry in freeclient_list.next
*/
9bc: j 9d4 <cleanup_module+0x524>
9c0: sw t0,4(v1) /* saves freeclient_list in tmp_entry.prev << if this
would not be executed in parallel, we would get
wrong data here, but because we are using mips it
must be executed */
So I cannot see anything special - only these two instructions. Maybe we
should create a version with debug output after list_add with next and prev
pointer of tmp_entry and free_client_list. The same with entry before calling
list_del in get_ip_addr. This should be compiled for the nightwing and send to
Outback Dingo so he can test it and send the kernel log after a crash to us.
The problem is that I added some printks, checked the resulting output and
noticed that this changed the output (so the interesting parts aren't
there anymore).
So if it will run without problems and prints some "use free client from list"
in the kernel log then we should have fixed it by resorting some instructions.
If not... we should try to find it by using the extra debugging output.
If it runs without problems, please remove the printks around the list_add and
if it still runs and prints some "use free client from list" now... do it the
other way around (keep the printks around list_add and remove it before
list_del).
...just my ideas to find the problem.
Best regards
Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkjG/HsACgkQqQGwKVlMoDv6fACg90wX35fyHR13/Dh/nBvrKM4C
euwAn03zpb+HqWccdjcf7Z7SotWd+1s0
=pwTz
-----END PGP SIGNATURE-----
[-- Attachment #2: batgat_test.patch --]
[-- Type: text/x-patch, Size: 1204 bytes --]
--- a/batman/linux/modules/gateway.c
+++ b/batman/linux/modules/gateway.c
@@ -385,7 +385,11 @@ static int packet_recv_thread(void *data)
tmp_entry = kmalloc(sizeof(struct free_client_data), GFP_KERNEL);
if(tmp_entry != NULL) {
tmp_entry->gw_client = client_data;
+ printk("list_add_b; tmp_entry pointers (%p, %p)\n", tmp_entry->list.prev, tmp_entry->list.next);
+ printk("list_add_b; free_client_list pointers (%p, %p)\n", free_client_list.prev, free_client_list.next);
list_add(&tmp_entry->list,&free_client_list);
+ printk("list_add_a; tmp_entry pointers (%p, %p)\n", tmp_entry->list.prev, tmp_entry->list.next);
+ printk("list_add_a; free_client_list pointers (%p, %p)\n", free_client_list.prev, free_client_list.next);
} else
DBG("can't add free gw_client to free list");
@@ -642,6 +646,7 @@ static struct gw_client *get_ip_addr(struct sockaddr_in *client_addr)
list_for_each_entry_safe(entry, next, &free_client_list, list) {
DBG("use free client from list");
gw_client = entry->gw_client;
+ printk("free client; entry pointers (%p, %p)\n", entry->list.prev, entry->list.next);
list_del(&entry->list);
break;
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-09 22:45 ` Sven Eckelmann
@ 2008-09-10 9:50 ` Sven Eckelmann
0 siblings, 0 replies; 9+ messages in thread
From: Sven Eckelmann @ 2008-09-10 9:50 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]
On Wednesday 10 September 2008 00:45:07 Sven Eckelmann wrote:
> [...]
> /* v0 and v1 holds pointer to new allocated struct */
> /* a3 holds "zero" - like t0 */
> 9a8: sw s1,8(v0) /* store pointer to client_data in new data buffer */
> 9ac: lw v0,36(a3) /* v0 gets free_client_list.next -> lets call it
> next_element */
> /* shouldn't be another instruction between load and usage of the
> register? - like a nop */
> 9b0: sw v0,0(v1) /* store next_element in tmp_entry.list.next */
> 9b4: sw v1,4(v0) /* store pointer to tmp_entry next_element.prev */
> 9b8: sw v1,36(a3) /* store pointer to tmp_entry in freeclient_list.next
> */
> 9bc: j 9d4 <cleanup_module+0x524>
> 9c0: sw t0,4(v1) /* saves freeclient_list in tmp_entry.prev << if this
> would not be executed in parallel, we would get
> wrong data here, but because we are using mips it
> must be executed */
> [...]
Ok, searched inside the official mips32 instruction set reference and both
questionable instructions are defined in a hard way without any unpredictable
or undefined remarks. So they should be working fine.
Best regards
Sven Eckelmann
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [B.A.T.M.A.N.] Batman gateway lock ups
2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
2008-09-08 7:08 ` Sven Eckelmann
2008-09-08 21:18 ` Sven Eckelmann
@ 2009-07-21 20:48 ` Simon Wunderlich
2 siblings, 0 replies; 9+ messages in thread
From: Simon Wunderlich @ 2009-07-21 20:48 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
[-- Attachment #1: Type: text/plain, Size: 3067 bytes --]
Hello Dingo,
i know it is a long time ago, but can you please verify if the problem present in
Ticket 121 [1] still affects you? Or is it already solved? The entries are 10
months old and the issue is probably already fixed.
Thank you very much,
Simon
[1] https://www.open-mesh.net/ticket/121
On Fri, Sep 05, 2008 at 10:01:24PM +0700, Outback Dingo wrote:
> see pastebin
>
> http://www.pastebin.ca/1194874
>
> pertinent info
> dmesg | grep 'batgat loaded'
> batgat: [init_module:96] batgat loaded rv1025
> uname -a
> Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown
> root@nightwing:~# batmand -v
> WARNING: You are using the unstable batman branch. If you are interested in
> *using* batman get the latest stable release !
> B.A.T.M.A.N. 0.3-beta (compatibility version 5)
> lsmod
>
> Module Size Used by Tainted:
> P
> sch_htb 14048
> 2
> ath_ahb 103616
> 0
> wlan_xauth 480
> 0
> wlan_wep 4000
> 0
> wlan_tkip 9856
> 0
> wlan_ccmp 5440
> 2
> wlan_acl 1920 0
> ath_rate_minstrel 8352 1
> ath_hal 136832 3 ath_ahb,ath_rate_minstrel
> wlan_scan_sta 8768 1
> wlan_scan_ap 6656 0
> wlan 152464 10
> ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap
> batgat 10944 1
> ipt_iprange 672 0
> ipt_TOS 832 0
> ipt_TTL 928 0
> xt_MARK 960 3
> ipt_ECN 1472 0
> xt_CLASSIFY 640 0
> ipt_ttl 704 0
> ipt_tos 544 0
> ipt_time 1568 0
> xt_tcpmss 1088 0
> xt_statistic 832 0
> xt_mark 672 7
> xt_mac 736 3
> xt_length 736 0
> ipt_ecn 1024 0
> xt_DSCP 1056 0
> xt_dscp 832 0
> imq 2096 0
> ipt_IMQ 672 2
> xt_string 896 0
> xt_layer7 9840 0
> ipt_ipp2p 6784 0
> ipt_LOG 4640 0
> xt_CHAOS 1792 0
> xt_DELUDE 2624 1
> xt_TARPIT 2816 1
> xt_quota 800 0
> xt_portscan 2016 0
> xt_pkttype 704 0
> xt_physdev 1488 0
> ipt_owner 800 0
> iptable_raw 832 0
> xt_NOTRACK 832 0
> xt_CONNMARK 1088 0
> ipt_recent 4992 0
> xt_helper 992 0
> xt_conntrack 1312 0
> xt_connmark 832 0
> xt_connbytes 1312 0
> tun 6592 0
> _______________________________________________
> B.A.T.M.A.N mailing list
> B.A.T.M.A.N@open-mesh.net
> https://list.open-mesh.net/mm/listinfo/b.a.t.m.a.n
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-07-21 20:48 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-05 15:01 [B.A.T.M.A.N.] Batman gateway lock ups Outback Dingo
2008-09-08 7:08 ` Sven Eckelmann
2008-09-08 21:18 ` Sven Eckelmann
2008-09-08 21:45 ` Sven Eckelmann
2008-09-09 8:03 ` Sven Eckelmann
2008-09-09 11:26 ` Simon Wunderlich
2008-09-09 22:45 ` Sven Eckelmann
2008-09-10 9:50 ` Sven Eckelmann
2009-07-21 20:48 ` Simon Wunderlich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox