public inbox for b.a.t.m.a.n@lists.open-mesh.org
 help / color / mirror / Atom feed
* [B.A.T.M.A.N.] mad mad batman ...
@ 2013-10-13 21:34 Nicolás Echániz
  2013-11-12 21:45 ` Nicolás Echániz
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolás Echániz @ 2013-10-13 21:34 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

While I'm still in Europe I've observed that the network in Quintana has
started performing very poorly today. It was working perfectly fine
until yesterday.

The logs on every router have started showing entries like these:

Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan:
received packet on bat0 with own address as source address
Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan:
received packet on bat0 with own address as source address
Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan:
received packet on bat0 with own address as source address
Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan:
received packet on bat0 with own address as source address
Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan:
received packet on bat0 with own address as source address
Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan:
received packet on bat0 with own address as source address
Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan:
received packet on bat0 with own address as source address
Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan:
received packet on bat0 with own address as source address

As you can see there are many per second.


I've pasted a bit of batctl ll batman; batctl log here:

http://pastebin.com/HX1LBPK4


...it's only showing the "originator packet from myself" lines and one
line before. (the sample is less than 5 secs of logs)

Every node I checked is showing the same.


Last time this happened it was due to a router that had been affected by
a nearby lightning bolt. The switch went crazy.
It took a while to detect it and the network was 15 nodes big. Now it's
40 and we are quite far away :)


If anyone has an idea of how to better test where the problem is
originated, I'll be glad to hear it. Also if any batman devel wishes to
log in to the net to check first hand, just let me know.


Cheers!
Nico


PS: batman version is 2012.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [B.A.T.M.A.N.] mad mad batman ...
  2013-10-13 21:34 [B.A.T.M.A.N.] mad mad batman Nicolás Echániz
@ 2013-11-12 21:45 ` Nicolás Echániz
  2013-11-12 21:54   ` Antonio Quartulli
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolás Echániz @ 2013-11-12 21:45 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking


back in Quintana... this problem is still showing in every node. The network is unstable and so it's difficult to debug. If anyone has a clue as to where to look for the origin I'll be glad to read your thoughts.

cheers,
Nico


El 13/10/13 18:34, Nicolás Echániz escribió:
> While I'm still in Europe I've observed that the network in Quintana has
> started performing very poorly today. It was working perfectly fine
> until yesterday.
> 
> The logs on every router have started showing entries like these:
> 
> Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan:
> received packet on bat0 with own address as source address
> Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan:
> received packet on bat0 with own address as source address
> Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan:
> received packet on bat0 with own address as source address
> Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan:
> received packet on bat0 with own address as source address
> Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan:
> received packet on bat0 with own address as source address
> Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan:
> received packet on bat0 with own address as source address
> Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan:
> received packet on bat0 with own address as source address
> Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan:
> received packet on bat0 with own address as source address
> 
> As you can see there are many per second.
> 
> 
> I've pasted a bit of batctl ll batman; batctl log here:
> 
> http://pastebin.com/HX1LBPK4
> 
> 
> ...it's only showing the "originator packet from myself" lines and one
> line before. (the sample is less than 5 secs of logs)
> 
> Every node I checked is showing the same.
> 
> 
> Last time this happened it was due to a router that had been affected by
> a nearby lightning bolt. The switch went crazy.
> It took a while to detect it and the network was 15 nodes big. Now it's
> 40 and we are quite far away :)
> 
> 
> If anyone has an idea of how to better test where the problem is
> originated, I'll be glad to hear it. Also if any batman devel wishes to
> log in to the net to check first hand, just let me know.
> 
> 
> Cheers!
> Nico
> 
> 
> PS: batman version is 2012.4
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [B.A.T.M.A.N.] mad mad batman ...
  2013-11-12 21:45 ` Nicolás Echániz
@ 2013-11-12 21:54   ` Antonio Quartulli
  2013-11-13  0:56     ` Nicolás Echániz
  0 siblings, 1 reply; 8+ messages in thread
From: Antonio Quartulli @ 2013-11-12 21:54 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 2887 bytes --]

Hi Nico,

I have no real clue, but is it possible that there is a loop somewhere? I
imagine you have checked already..but I can't come with something more useful at
the moment..

Cheers,

On Tue, Nov 12, 2013 at 06:45:40PM -0300, Nicolás Echániz wrote:
> 
> back in Quintana... this problem is still showing in every node. The network is unstable and so it's difficult to debug. If anyone has a clue as to where to look for the origin I'll be glad to read your thoughts.
> 
> cheers,
> Nico
> 
> 
> El 13/10/13 18:34, Nicolás Echániz escribió:
> > While I'm still in Europe I've observed that the network in Quintana has
> > started performing very poorly today. It was working perfectly fine
> > until yesterday.
> > 
> > The logs on every router have started showing entries like these:
> > 
> > Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan:
> > received packet on bat0 with own address as source address
> > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan:
> > received packet on bat0 with own address as source address
> > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan:
> > received packet on bat0 with own address as source address
> > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan:
> > received packet on bat0 with own address as source address
> > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan:
> > received packet on bat0 with own address as source address
> > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan:
> > received packet on bat0 with own address as source address
> > Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan:
> > received packet on bat0 with own address as source address
> > Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan:
> > received packet on bat0 with own address as source address
> > 
> > As you can see there are many per second.
> > 
> > 
> > I've pasted a bit of batctl ll batman; batctl log here:
> > 
> > http://pastebin.com/HX1LBPK4
> > 
> > 
> > ...it's only showing the "originator packet from myself" lines and one
> > line before. (the sample is less than 5 secs of logs)
> > 
> > Every node I checked is showing the same.
> > 
> > 
> > Last time this happened it was due to a router that had been affected by
> > a nearby lightning bolt. The switch went crazy.
> > It took a while to detect it and the network was 15 nodes big. Now it's
> > 40 and we are quite far away :)
> > 
> > 
> > If anyone has an idea of how to better test where the problem is
> > originated, I'll be glad to hear it. Also if any batman devel wishes to
> > log in to the net to check first hand, just let me know.
> > 
> > 
> > Cheers!
> > Nico
> > 
> > 
> > PS: batman version is 2012.4
> > 
> 

-- 
Antonio Quartulli

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [B.A.T.M.A.N.] mad mad batman ...
  2013-11-12 21:54   ` Antonio Quartulli
@ 2013-11-13  0:56     ` Nicolás Echániz
  2013-11-13  8:04       ` Bastian Bittorf
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolás Echániz @ 2013-11-13  0:56 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

El 12/11/13 18:54, Antonio Quartulli escribió:
> Hi Nico,
> 
> I have no real clue, but is it possible that there is a loop
> somewhere? I imagine you have checked already..but I can't come
> with something more useful at the moment..

Ok, I've spent the afternoon turning off semi-inaccessible nodes one
by one until I found the one causing the problem.

It's installed on a public lighting post, so it may take a while to
take it down for inspection.

I don't know if you guys remember I had brought to the battlemesh a
crazy node (nicknamed Jocker), that started misbehaving after a
lightning bolt hit nearby. The symptom was the same I observed now:
every node in the net would start repeatedly showing the message:
"received packet on bat0 with own address as source address".

I was in Europe during the time this second node started behaving like
this so I still don't know much about the moment it started.

Do you think this matter could be addressed at batman level somehow?
In a 50 node network this is already quite difficult to diagnose. I
can't imagine how a bigger network where no single person has remote
access to every node would coordinate to isolate the problematic router...

If you are interested in looking at this first hand we can try to set
up an isolated test-bed with IPv6 connectivity for you to log in and
play around.

Am I the only one who has bumped into this (twice)?

cheers.


> On Tue, Nov 12, 2013 at 06:45:40PM -0300, Nicolás Echániz wrote:
>> 
>> back in Quintana... this problem is still showing in every node.
>> The network is unstable and so it's difficult to debug. If anyone
>> has a clue as to where to look for the origin I'll be glad to
>> read your thoughts.
>> 
>> cheers, Nico
>> 
>> 
>> El 13/10/13 18:34, Nicolás Echániz escribió:
>>> While I'm still in Europe I've observed that the network in
>>> Quintana has started performing very poorly today. It was
>>> working perfectly fine until yesterday.
>>> 
>>> The logs on every router have started showing entries like
>>> these:
>>> 
>>> Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000]
>>> br-lan: received packet on bat0 with own address as source
>>> address Oct 13 18:09:45 frigorifico kern.warn kernel:
>>> [12020.040000] br-lan: received packet on bat0 with own address
>>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel:
>>> [12020.040000] br-lan: received packet on bat0 with own address
>>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel:
>>> [12020.550000] br-lan: received packet on bat0 with own address
>>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel:
>>> [12020.550000] br-lan: received packet on bat0 with own address
>>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel:
>>> [12020.570000] br-lan: received packet on bat0 with own address
>>> as source address Oct 13 18:09:45 frigorifico kern.warn kernel:
>>> [12020.580000] br-lan: received packet on bat0 with own address
>>> as source address Oct 13 18:09:46 frigorifico kern.warn kernel:
>>> [12021.040000] br-lan: received packet on bat0 with own address
>>> as source address
>>> 
>>> As you can see there are many per second.
>>> 
>>> 
>>> I've pasted a bit of batctl ll batman; batctl log here:
>>> 
>>> http://pastebin.com/HX1LBPK4
>>> 
>>> 
>>> ...it's only showing the "originator packet from myself" lines
>>> and one line before. (the sample is less than 5 secs of logs)
>>> 
>>> Every node I checked is showing the same.
>>> 
>>> 
>>> Last time this happened it was due to a router that had been
>>> affected by a nearby lightning bolt. The switch went crazy. It
>>> took a while to detect it and the network was 15 nodes big. Now
>>> it's 40 and we are quite far away :)
>>> 
>>> 
>>> If anyone has an idea of how to better test where the problem
>>> is originated, I'll be glad to hear it. Also if any batman
>>> devel wishes to log in to the net to check first hand, just let
>>> me know.
>>> 
>>> 
>>> Cheers! Nico
>>> 
>>> 
>>> PS: batman version is 2012.4
>>> 
>> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iQEcBAEBAgAGBQJSgs47AAoJEPHORz17N0TF/BsH/0R+d/k10hrPDOb6zX5rwkZJ
oSbqjrBttmBRAJFGJ2hywFXm0nnL8Teo696Ulrf8wuh9wrl6aR9MD/fsrKdyfPM+
/+MTKGb6uak20Kb5pbUY+8/xl+ZmtmWkXlolEdT/PFsFFh+Ap4EiMgQIDr64EXZq
IF6AkFFuNv/Oovc3COnlY+gWhxton+DRNKoHpZw37h46jTH6zSy0MATIIGKueS44
2zVpuEgbkyUVvyJYKhBt9GjX5SGQ4TVBMX0EnZuN/bEk2al64EiEOySjUYvf3xTD
iqRWbvyloPZi5mhgEI876gt5qTDXmcs6YjEsvpOU+YiZoawtIAI1J8vo3kUzdBs=
=APtL
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [B.A.T.M.A.N.] mad mad batman ...
  2013-11-13  8:04       ` Bastian Bittorf
@ 2013-11-13  8:01         ` Antonio Quartulli
  2013-11-26  3:56           ` Nicolás Echániz
  0 siblings, 1 reply; 8+ messages in thread
From: Antonio Quartulli @ 2013-11-13  8:01 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

[-- Attachment #1: Type: text/plain, Size: 456 bytes --]

On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote:
> * Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013 08:59]:
> > Am I the only one who has bumped into this (twice)?
> 
> I have also seen a lot of these messages with an indoor mesh,
> so no lightning involved 8-) but with v2013.04 this is gone.
> (same network).

this message is the symptom of a loop. The causes can be gazillions.

Cheers,

-- 
Antonio Quartulli

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [B.A.T.M.A.N.] mad mad batman ...
  2013-11-13  0:56     ` Nicolás Echániz
@ 2013-11-13  8:04       ` Bastian Bittorf
  2013-11-13  8:01         ` Antonio Quartulli
  0 siblings, 1 reply; 8+ messages in thread
From: Bastian Bittorf @ 2013-11-13  8:04 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

* Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013 08:59]:
> Am I the only one who has bumped into this (twice)?

I have also seen a lot of these messages with an indoor mesh,
so no lightning involved 8-) but with v2013.04 this is gone.
(same network).

bye, bastian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [B.A.T.M.A.N.] mad mad batman ...
  2013-11-13  8:01         ` Antonio Quartulli
@ 2013-11-26  3:56           ` Nicolás Echániz
  2013-11-26  4:48             ` Linus Lüssing
  0 siblings, 1 reply; 8+ messages in thread
From: Nicolás Echániz @ 2013-11-26  3:56 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

El 13/11/13 05:01, Antonio Quartulli escribió:
> On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote:
>> * Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013
>> 08:59]:
>>> Am I the only one who has bumped into this (twice)?
>> 
>> I have also seen a lot of these messages with an indoor mesh, so
>> no lightning involved 8-) but with v2013.04 this is gone. (same
>> network).
> 
> this message is the symptom of a loop. The causes can be
> gazillions.

Well... it took about a week to finally find the node creating this
problem. As before, it's failing hardware that caused the issue.

When this happens every node in the net is repeatedly showing that
message. It is not the same with any "loop symptom" I believe... At
least I've never seen this happen on every node being caused by
something else.

I really would like to find out more about how this condition comes to
happen and how to diagnose and prevent it. The whole batman-adv cloud
dies when this happens and it's a pain in the ass to "debug".

All the failing routers are WR842ND. There are many more of the same
model working just fine.


I now have three routers which produce this symptom, so if anyone who
can understand the problem better is willing to test, I can set up a
dedicated mini-test-bed.


Cheers,
NicoEchániz



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iQEcBAEBAgAGBQJSlBvtAAoJEPHORz17N0TFf+sH/jC2qGbx3SpHZSNGgUoQrD/f
J9n+lUMugbsM7uOodBnU+iiswpOrNVn4NsddWNYXFmydpbIbnO6wuUTlDZu3wNig
gOMNDZqaze5fcdopwgcEuxvp1S5QRiAe2DN+ci0jQ3OGdOjuZgZJjpRxARDbSExT
guF4TgB4izZUy5/5OTg4pJdNZ8zUsMap45Gy4xzslP58EZUBKkeqol+kaWuiq4eX
FbUpezTpeTXlUi401pOzvrVMg7/JXhN7JULMgjPPR6y8359tfZFFQq4JwuYNSwz1
Fg6L3/kJSKm4DF7MyYa2Id/3DJx7NLZk1B7RQR05ISmuvGLDBqobqv6EY5a+Z0M=
=0BD+
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [B.A.T.M.A.N.] mad mad batman ...
  2013-11-26  3:56           ` Nicolás Echániz
@ 2013-11-26  4:48             ` Linus Lüssing
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Lüssing @ 2013-11-26  4:48 UTC (permalink / raw)
  To: The list for a Better Approach To Mobile Ad-hoc Networking

Hi Nico,

On Tue, Nov 26, 2013 at 12:56:29AM -0300, Nicolás Echániz wrote:
> El 13/11/13 05:01, Antonio Quartulli escribió:
> > On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote:
> >> * Nicolás Echániz <nicoechaniz@altermundi.net> [13.11.2013
> >> 08:59]:
> >>> Am I the only one who has bumped into this (twice)?
> >> 
> >> I have also seen a lot of these messages with an indoor mesh, so
> >> no lightning involved 8-) but with v2013.04 this is gone. (same
> >> network).
> > 
> > this message is the symptom of a loop. The causes can be
> > gazillions.
> 
> Well... it took about a week to finally find the node creating this
> problem. As before, it's failing hardware that caused the issue.

Interesting. Could you be more specific in which way the hardware
fails? Does it reboot frequently? Does it send broken OGM packets?

Could you make a checksum of the flashed squashfs, does it differ
from the one you've built?

> 
> When this happens every node in the net is repeatedly showing that
> message. It is not the same with any "loop symptom" I believe... At
> least I've never seen this happen on every node being caused by
> something else.
> 
> I really would like to find out more about how this condition comes to
> happen and how to diagnose and prevent it. The whole batman-adv cloud
> dies when this happens and it's a pain in the ass to "debug".
> 
> All the failing routers are WR842ND. There are many more of the same
> model working just fine.

We are also using quite a lot of 842NDs, 841NDs and 3600NDs, as well
as some 741ND, 1043ND and 4300NDs. We've never had the issue of
one broken node taking down the whole network yet, not in Hamburg,
Kiel or Lübeck.

Would be interesting to figure out the differences between our
setups. Maybe I missed it so far, did you say you were using
bridge loop avoidance (we don't)? We are using batman-adv 2013.1.0
mostly with a few still on 2012.4.0 and some on 2013.4.0.

> 
> 
> I now have three routers which produce this symptom, so if anyone who
> can understand the problem better is willing to test, I can set up a
> dedicated mini-test-bed.
> 
> 
> Cheers,
> NicoEchániz
> 
> 
> 

Cheers, Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-11-26  4:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-13 21:34 [B.A.T.M.A.N.] mad mad batman Nicolás Echániz
2013-11-12 21:45 ` Nicolás Echániz
2013-11-12 21:54   ` Antonio Quartulli
2013-11-13  0:56     ` Nicolás Echániz
2013-11-13  8:04       ` Bastian Bittorf
2013-11-13  8:01         ` Antonio Quartulli
2013-11-26  3:56           ` Nicolás Echániz
2013-11-26  4:48             ` Linus Lüssing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox