* [B.A.T.M.A.N.] mad mad batman ... @ 2013-10-13 21:34 Nicolás Echániz 2013-11-12 21:45 ` Nicolás Echániz 0 siblings, 1 reply; 8+ messages in thread From: Nicolás Echániz @ 2013-10-13 21:34 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking While I'm still in Europe I've observed that the network in Quintana has started performing very poorly today. It was working perfectly fine until yesterday. The logs on every router have started showing entries like these: Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan: received packet on bat0 with own address as source address As you can see there are many per second. I've pasted a bit of batctl ll batman; batctl log here: http://pastebin.com/HX1LBPK4 ...it's only showing the "originator packet from myself" lines and one line before. (the sample is less than 5 secs of logs) Every node I checked is showing the same. Last time this happened it was due to a router that had been affected by a nearby lightning bolt. The switch went crazy. It took a while to detect it and the network was 15 nodes big. Now it's 40 and we are quite far away :) If anyone has an idea of how to better test where the problem is originated, I'll be glad to hear it. Also if any batman devel wishes to log in to the net to check first hand, just let me know. Cheers! Nico PS: batman version is 2012.4 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [B.A.T.M.A.N.] mad mad batman ... 2013-10-13 21:34 [B.A.T.M.A.N.] mad mad batman Nicolás Echániz @ 2013-11-12 21:45 ` Nicolás Echániz 2013-11-12 21:54 ` Antonio Quartulli 0 siblings, 1 reply; 8+ messages in thread From: Nicolás Echániz @ 2013-11-12 21:45 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking back in Quintana... this problem is still showing in every node. The network is unstable and so it's difficult to debug. If anyone has a clue as to where to look for the origin I'll be glad to read your thoughts. cheers, Nico El 13/10/13 18:34, Nicolás Echániz escribió: > While I'm still in Europe I've observed that the network in Quintana has > started performing very poorly today. It was working perfectly fine > until yesterday. > > The logs on every router have started showing entries like these: > > Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan: > received packet on bat0 with own address as source address > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: > received packet on bat0 with own address as source address > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: > received packet on bat0 with own address as source address > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: > received packet on bat0 with own address as source address > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: > received packet on bat0 with own address as source address > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan: > received packet on bat0 with own address as source address > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan: > received packet on bat0 with own address as source address > Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan: > received packet on bat0 with own address as source address > > As you can see there are many per second. > > > I've pasted a bit of batctl ll batman; batctl log here: > > http://pastebin.com/HX1LBPK4 > > > ...it's only showing the "originator packet from myself" lines and one > line before. (the sample is less than 5 secs of logs) > > Every node I checked is showing the same. > > > Last time this happened it was due to a router that had been affected by > a nearby lightning bolt. The switch went crazy. > It took a while to detect it and the network was 15 nodes big. Now it's > 40 and we are quite far away :) > > > If anyone has an idea of how to better test where the problem is > originated, I'll be glad to hear it. Also if any batman devel wishes to > log in to the net to check first hand, just let me know. > > > Cheers! > Nico > > > PS: batman version is 2012.4 > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [B.A.T.M.A.N.] mad mad batman ... 2013-11-12 21:45 ` Nicolás Echániz @ 2013-11-12 21:54 ` Antonio Quartulli 2013-11-13 0:56 ` Nicolás Echániz 0 siblings, 1 reply; 8+ messages in thread From: Antonio Quartulli @ 2013-11-12 21:54 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking [-- Attachment #1: Type: text/plain, Size: 2887 bytes --] Hi Nico, I have no real clue, but is it possible that there is a loop somewhere? I imagine you have checked already..but I can't come with something more useful at the moment.. Cheers, On Tue, Nov 12, 2013 at 06:45:40PM -0300, Nicolás Echániz wrote: > > back in Quintana... this problem is still showing in every node. The network is unstable and so it's difficult to debug. If anyone has a clue as to where to look for the origin I'll be glad to read your thoughts. > > cheers, > Nico > > > El 13/10/13 18:34, Nicolás Echániz escribió: > > While I'm still in Europe I've observed that the network in Quintana has > > started performing very poorly today. It was working perfectly fine > > until yesterday. > > > > The logs on every router have started showing entries like these: > > > > Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan: > > received packet on bat0 with own address as source address > > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: > > received packet on bat0 with own address as source address > > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: > > received packet on bat0 with own address as source address > > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: > > received packet on bat0 with own address as source address > > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: > > received packet on bat0 with own address as source address > > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan: > > received packet on bat0 with own address as source address > > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan: > > received packet on bat0 with own address as source address > > Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan: > > received packet on bat0 with own address as source address > > > > As you can see there are many per second. > > > > > > I've pasted a bit of batctl ll batman; batctl log here: > > > > http://pastebin.com/HX1LBPK4 > > > > > > ...it's only showing the "originator packet from myself" lines and one > > line before. (the sample is less than 5 secs of logs) > > > > Every node I checked is showing the same. > > > > > > Last time this happened it was due to a router that had been affected by > > a nearby lightning bolt. The switch went crazy. > > It took a while to detect it and the network was 15 nodes big. Now it's > > 40 and we are quite far away :) > > > > > > If anyone has an idea of how to better test where the problem is > > originated, I'll be glad to hear it. Also if any batman devel wishes to > > log in to the net to check first hand, just let me know. > > > > > > Cheers! > > Nico > > > > > > PS: batman version is 2012.4 > > > -- Antonio Quartulli [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [B.A.T.M.A.N.] mad mad batman ... 2013-11-12 21:54 ` Antonio Quartulli @ 2013-11-13 0:56 ` Nicolás Echániz 2013-11-13 8:04 ` Bastian Bittorf 0 siblings, 1 reply; 8+ messages in thread From: Nicolás Echániz @ 2013-11-13 0:56 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 El 12/11/13 18:54, Antonio Quartulli escribió: > Hi Nico, > > I have no real clue, but is it possible that there is a loop > somewhere? I imagine you have checked already..but I can't come > with something more useful at the moment.. Ok, I've spent the afternoon turning off semi-inaccessible nodes one by one until I found the one causing the problem. It's installed on a public lighting post, so it may take a while to take it down for inspection. I don't know if you guys remember I had brought to the battlemesh a crazy node (nicknamed Jocker), that started misbehaving after a lightning bolt hit nearby. The symptom was the same I observed now: every node in the net would start repeatedly showing the message: "received packet on bat0 with own address as source address". I was in Europe during the time this second node started behaving like this so I still don't know much about the moment it started. Do you think this matter could be addressed at batman level somehow? In a 50 node network this is already quite difficult to diagnose. I can't imagine how a bigger network where no single person has remote access to every node would coordinate to isolate the problematic router... If you are interested in looking at this first hand we can try to set up an isolated test-bed with IPv6 connectivity for you to log in and play around. Am I the only one who has bumped into this (twice)? cheers. > On Tue, Nov 12, 2013 at 06:45:40PM -0300, Nicolás Echániz wrote: >> >> back in Quintana... this problem is still showing in every node. >> The network is unstable and so it's difficult to debug. If anyone >> has a clue as to where to look for the origin I'll be glad to >> read your thoughts. >> >> cheers, Nico >> >> >> El 13/10/13 18:34, Nicolás Echániz escribió: >>> While I'm still in Europe I've observed that the network in >>> Quintana has started performing very poorly today. It was >>> working perfectly fine until yesterday. >>> >>> The logs on every router have started showing entries like >>> these: >>> >>> Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] >>> br-lan: received packet on bat0 with own address as source >>> address Oct 13 18:09:45 frigorifico kern.warn kernel: >>> [12020.040000] br-lan: received packet on bat0 with own address >>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel: >>> [12020.040000] br-lan: received packet on bat0 with own address >>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel: >>> [12020.550000] br-lan: received packet on bat0 with own address >>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel: >>> [12020.550000] br-lan: received packet on bat0 with own address >>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel: >>> [12020.570000] br-lan: received packet on bat0 with own address >>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel: >>> [12020.580000] br-lan: received packet on bat0 with own address >>> as source address Oct 13 18:09:46 frigorifico kern.warn kernel: >>> [12021.040000] br-lan: received packet on bat0 with own address >>> as source address >>> >>> As you can see there are many per second. >>> >>> >>> I've pasted a bit of batctl ll batman; batctl log here: >>> >>> http://pastebin.com/HX1LBPK4 >>> >>> >>> ...it's only showing the "originator packet from myself" lines >>> and one line before. (the sample is less than 5 secs of logs) >>> >>> Every node I checked is showing the same. >>> >>> >>> Last time this happened it was due to a router that had been >>> affected by a nearby lightning bolt. The switch went crazy. It >>> took a while to detect it and the network was 15 nodes big. Now >>> it's 40 and we are quite far away :) >>> >>> >>> If anyone has an idea of how to better test where the problem >>> is originated, I'll be glad to hear it. Also if any batman >>> devel wishes to log in to the net to check first hand, just let >>> me know. >>> >>> >>> Cheers! Nico >>> >>> >>> PS: batman version is 2012.4 >>> >> > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Icedove - http://www.enigmail.net/ iQEcBAEBAgAGBQJSgs47AAoJEPHORz17N0TF/BsH/0R+d/k10hrPDOb6zX5rwkZJ oSbqjrBttmBRAJFGJ2hywFXm0nnL8Teo696Ulrf8wuh9wrl6aR9MD/fsrKdyfPM+ /+MTKGb6uak20Kb5pbUY+8/xl+ZmtmWkXlolEdT/PFsFFh+Ap4EiMgQIDr64EXZq IF6AkFFuNv/Oovc3COnlY+gWhxton+DRNKoHpZw37h46jTH6zSy0MATIIGKueS44 2zVpuEgbkyUVvyJYKhBt9GjX5SGQ4TVBMX0EnZuN/bEk2al64EiEOySjUYvf3xTD iqRWbvyloPZi5mhgEI876gt5qTDXmcs6YjEsvpOU+YiZoawtIAI1J8vo3kUzdBs= =APtL -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [B.A.T.M.A.N.] mad mad batman ... 2013-11-13 0:56 ` Nicolás Echániz @ 2013-11-13 8:04 ` Bastian Bittorf 2013-11-13 8:01 ` Antonio Quartulli 0 siblings, 1 reply; 8+ messages in thread From: Bastian Bittorf @ 2013-11-13 8:04 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking * Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013 08:59]: > Am I the only one who has bumped into this (twice)? I have also seen a lot of these messages with an indoor mesh, so no lightning involved 8-) but with v2013.04 this is gone. (same network). bye, bastian ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [B.A.T.M.A.N.] mad mad batman ... 2013-11-13 8:04 ` Bastian Bittorf @ 2013-11-13 8:01 ` Antonio Quartulli 2013-11-26 3:56 ` Nicolás Echániz 0 siblings, 1 reply; 8+ messages in thread From: Antonio Quartulli @ 2013-11-13 8:01 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking [-- Attachment #1: Type: text/plain, Size: 456 bytes --] On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote: > * Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013 08:59]: > > Am I the only one who has bumped into this (twice)? > > I have also seen a lot of these messages with an indoor mesh, > so no lightning involved 8-) but with v2013.04 this is gone. > (same network). this message is the symptom of a loop. The causes can be gazillions. Cheers, -- Antonio Quartulli [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [B.A.T.M.A.N.] mad mad batman ... 2013-11-13 8:01 ` Antonio Quartulli @ 2013-11-26 3:56 ` Nicolás Echániz 2013-11-26 4:48 ` Linus Lüssing 0 siblings, 1 reply; 8+ messages in thread From: Nicolás Echániz @ 2013-11-26 3:56 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 El 13/11/13 05:01, Antonio Quartulli escribió: > On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote: >> * Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013 >> 08:59]: >>> Am I the only one who has bumped into this (twice)? >> >> I have also seen a lot of these messages with an indoor mesh, so >> no lightning involved 8-) but with v2013.04 this is gone. (same >> network). > > this message is the symptom of a loop. The causes can be > gazillions. Well... it took about a week to finally find the node creating this problem. As before, it's failing hardware that caused the issue. When this happens every node in the net is repeatedly showing that message. It is not the same with any "loop symptom" I believe... At least I've never seen this happen on every node being caused by something else. I really would like to find out more about how this condition comes to happen and how to diagnose and prevent it. The whole batman-adv cloud dies when this happens and it's a pain in the ass to "debug". All the failing routers are WR842ND. There are many more of the same model working just fine. I now have three routers which produce this symptom, so if anyone who can understand the problem better is willing to test, I can set up a dedicated mini-test-bed. Cheers, NicoEchániz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Icedove - http://www.enigmail.net/ iQEcBAEBAgAGBQJSlBvtAAoJEPHORz17N0TFf+sH/jC2qGbx3SpHZSNGgUoQrD/f J9n+lUMugbsM7uOodBnU+iiswpOrNVn4NsddWNYXFmydpbIbnO6wuUTlDZu3wNig gOMNDZqaze5fcdopwgcEuxvp1S5QRiAe2DN+ci0jQ3OGdOjuZgZJjpRxARDbSExT guF4TgB4izZUy5/5OTg4pJdNZ8zUsMap45Gy4xzslP58EZUBKkeqol+kaWuiq4eX FbUpezTpeTXlUi401pOzvrVMg7/JXhN7JULMgjPPR6y8359tfZFFQq4JwuYNSwz1 Fg6L3/kJSKm4DF7MyYa2Id/3DJx7NLZk1B7RQR05ISmuvGLDBqobqv6EY5a+Z0M= =0BD+ -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [B.A.T.M.A.N.] mad mad batman ... 2013-11-26 3:56 ` Nicolás Echániz @ 2013-11-26 4:48 ` Linus Lüssing 0 siblings, 0 replies; 8+ messages in thread From: Linus Lüssing @ 2013-11-26 4:48 UTC (permalink / raw) To: The list for a Better Approach To Mobile Ad-hoc Networking Hi Nico, On Tue, Nov 26, 2013 at 12:56:29AM -0300, Nicolás Echániz wrote: > El 13/11/13 05:01, Antonio Quartulli escribió: > > On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote: > >> * Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013 > >> 08:59]: > >>> Am I the only one who has bumped into this (twice)? > >> > >> I have also seen a lot of these messages with an indoor mesh, so > >> no lightning involved 8-) but with v2013.04 this is gone. (same > >> network). > > > > this message is the symptom of a loop. The causes can be > > gazillions. > > Well... it took about a week to finally find the node creating this > problem. As before, it's failing hardware that caused the issue. Interesting. Could you be more specific in which way the hardware fails? Does it reboot frequently? Does it send broken OGM packets? Could you make a checksum of the flashed squashfs, does it differ from the one you've built? > > When this happens every node in the net is repeatedly showing that > message. It is not the same with any "loop symptom" I believe... At > least I've never seen this happen on every node being caused by > something else. > > I really would like to find out more about how this condition comes to > happen and how to diagnose and prevent it. The whole batman-adv cloud > dies when this happens and it's a pain in the ass to "debug". > > All the failing routers are WR842ND. There are many more of the same > model working just fine. We are also using quite a lot of 842NDs, 841NDs and 3600NDs, as well as some 741ND, 1043ND and 4300NDs. We've never had the issue of one broken node taking down the whole network yet, not in Hamburg, Kiel or Lübeck. Would be interesting to figure out the differences between our setups. Maybe I missed it so far, did you say you were using bridge loop avoidance (we don't)? We are using batman-adv 2013.1.0 mostly with a few still on 2012.4.0 and some on 2013.4.0. > > > I now have three routers which produce this symptom, so if anyone who > can understand the problem better is willing to test, I can set up a > dedicated mini-test-bed. > > > Cheers, > NicoEchániz > > > Cheers, Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-11-26 4:48 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-13 21:34 [B.A.T.M.A.N.] mad mad batman Nicolás Echániz 2013-11-12 21:45 ` Nicolás Echániz 2013-11-12 21:54 ` Antonio Quartulli 2013-11-13 0:56 ` Nicolás Echániz 2013-11-13 8:04 ` Bastian Bittorf 2013-11-13 8:01 ` Antonio Quartulli 2013-11-26 3:56 ` Nicolás Echániz 2013-11-26 4:48 ` Linus Lüssing
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox