public inbox for b.a.t.m.a.n@lists.open-mesh.org
 help / color / mirror / Atom feed
From: Simon Wunderlich <sw@simonwunderlich.de>
To: srn@coolheads.com
Cc: b.a.t.m.a.n@lists.open-mesh.org
Subject: Re: can't find logged information
Date: Wed, 19 Feb 2020 10:50:35 +0100	[thread overview]
Message-ID: <2230258.gT3E0N9OL2@prime> (raw)
In-Reply-To: <a687012a-8857-22f3-0623-d82bad19e7bc@coolheads.com>

[-- Attachment #1: Type: text/plain, Size: 16961 bytes --]

Hi Steve,

On Tuesday, February 11, 2020 8:03:32 PM CET Steve Newcomb wrote:
> First a report about trace-cmd.  It's not working right, I think, in the
> bleeding-edge ("trunk") distribution of OpenWRT that I'm using.
> 
> Then, in case you are inclined to respond, below are some
> observations/questions about the loss-of-contact I'm experiencing in my
> meshes.
> 
> ---------------------------------------------
> 
>   I tried setting 'batctl ll bla'.  (I suspect something to do with
> bridge loop avoidance may be what's killing our meshes.)  After that,
> the cat /sys/kernel/debug/batman_adv/bat0/log method works pretty well,
> but the trace-cmd method produces little or nothing.
> 
> root@rpc152:~# trace-cmd stream -e batadv:batadv_dbg
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_tx_mlme_mgmt] function __le16_to_cpup not defined
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_rx_unprot_mlme_mgmt] function __le16_to_cpup not
> defined
>    [cfg80211:cfg80211_rx_mlme_mgmt] function __le16_to_cpup not defined
> Hit Ctrl^C to stop recording
> 
> [the above lines appear to be coming from stderr]
> 
> [... here I wait a long time, and there's no output, so eventually I
> grow impatient and I interrupt ...]
> 
> ^C^Croot@rpc152:~#
> 
> ----------------------------------------------
> 
> Sometimes after interrupting I get a few lines of bridge-loop-avoidance
> log entries; here is such a case:
> 
> @rpc152:/root# trace-cmd stream -e batadv:batadv_dbg
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_tx_mlme_mgmt] function __le16_to_cpup not defined
> trace-cmd: Invalid argument
>    [cfg80211:cfg80211_rx_unprot_mlme_mgmt] function __le16_to_cpup not
> defined
>    [cfg80211:cfg80211_rx_mlme_mgmt] function __le16_to_cpup not defined
> Hit Ctrl^C to stop recording
>    C-c C-c          <idle>-0     [000] 81408.634044:
> batadv_dbg:           batman_adv bat0 batadv_bla_process_claim():
> received a claim frame from another group. From: 0c:80:63:e1:59:40 on vi\
> d 1 ...(hw_src 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
>             <...>-4595  [000] 81409.423996: batadv_dbg:          
> batman_adv bat0 batadv_bla_purge_claims(): timed out.
>             <...>-4595  [000] 81409.424016: batadv_dbg:          
> batman_adv bat0 batadv_bla_purge_claims(): 5c:1d:d9:54:09:8f, vid 32769
>             <...>-4595  [000] 81409.424034: batadv_dbg:          
> batman_adv bat0 batadv_bla_send_claim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
>             <...>-4595  [000] 81409.424053: batadv_dbg:          
> batman_adv bat0 batadv_handle_unclaim(): UNCLAIM 5c:1d:d9:54:09:8f on
> vid 1 (sent by 50:c7:bf:69:bb:17)...
>             <...>-4595  [000] 81409.424065: batadv_dbg:          
> batman_adv bat0 batadv_bla_del_claim(): 5c:1d:d9:54:09:8f, vid 1
>             <...>-4595  [000] 81409.424092: batadv_dbg:          
> batman_adv bat0 batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17
> on vid 1
>            <idle>-0     [000] 81418.874031: batadv_dbg:          
> batman_adv bat0 batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(h\
> w_src 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
>             <...>-4595  [000] 81419.663601: batadv_dbg:          
> batman_adv bat0 batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17
> on vid 1
> @rpc152:/root#
> 
> ------------------------------------------------
> 
> With the cat /sys/kernel/debug/batman_adv/bat0/log method of
> log-reading, there's generally oodles of output.  At first it comes all
> in a rush, and then it continues more slowly but still pretty steadily:
> 
> @rpc152:/root# cat /sys/kernel/debug/batman_adv/bat0/log
> [  81068650] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81069440] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81078890] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81079680] batadv_bla_purge_claims(): timed out.
> [  81079680] batadv_bla_purge_claims(): 64:c7:53:e6:f8:ae, vid 32769
> [  81079680] batadv_bla_send_claim(): UNCLAIM 64:c7:53:e6:f8:ae on vid 1
> [  81079680] batadv_handle_unclaim(): UNCLAIM 64:c7:53:e6:f8:ae on vid 1
> (sent by 50:c7:bf:69:bb:17)...
> [  81079680] batadv_bla_del_claim(): 64:c7:53:e6:f8:ae, vid 1
> [  81079680] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81079680] Send loopdetect frame for vid 32769
> [  81079680] batadv_bla_send_claim(): LOOPDETECT of ba:be:22:e0:3f:ec to
> ff:ff:ff:ff:ff:ff on vid 1
> [  81089130] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81089920] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81099370] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81100160] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81109610] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81110400] batadv_bla_purge_claims(): timed out.
> [  81110400] batadv_bla_purge_claims(): 5c:1d:d9:54:09:8f, vid 32769
> [  81110400] batadv_bla_send_claim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
> [  81110400] batadv_handle_unclaim(): UNCLAIM 5c:1d:d9:54:09:8f on vid 1
> (sent by 50:c7:bf:69:bb:17)...
> [  81110400] batadv_bla_del_claim(): 5c:1d:d9:54:09:8f, vid 1
> [  81110400] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81119850] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81120640] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81130090] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81130880] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81140330] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81141120] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81141120] Send loopdetect frame for vid 32769
> [  81141120] batadv_bla_send_claim(): LOOPDETECT of ba:be:e5:06:19:6e to
> ff:ff:ff:ff:ff:ff on vid 1
> [  81150100] batadv_bla_rx(): Unclaimed MAC 5c:1d:d9:54:09:8f found.
> Claim it. Local: no
> [  81150100] batadv_bla_add_claim(): adding new entry 5c:1d:d9:54:09:8f,
> vid 1 to hash ...
> [  81150100] batadv_bla_send_claim(): CLAIM 5c:1d:d9:54:09:8f on vid 1
> [  81150570] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81151360] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81160810] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81161600] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81171050] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81171840] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81181290] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81182080] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81191530] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81192320] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81201770] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81202560] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81202560] Send loopdetect frame for vid 32769
> [  81202560] batadv_bla_send_claim(): LOOPDETECT of ba:be:bb:25:c7:93 to
> ff:ff:ff:ff:ff:ff on vid 1
> [  81212010] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81212800] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81222250] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81223040] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81232490] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81233280] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> [  81242730] batadv_bla_process_claim(): received a claim frame from
> another group. From: 0c:80:63:e1:59:40 on vid 1 ...(hw_src
> 43:05:43:05:00:00, hw_dst ff:43:05:02:c0:74)
> [  81243520] batadv_bla_send_claim(): ANNOUNCE of 50:c7:bf:69:bb:17 on vid 1
> 
> -------------------------------------------------------------------------
> 
> ...so if I were you I wouldn't hurry to remove
> /sys/kernel/debug/batman_adv/bat0/log.  The trace-cmd stuff doesn't
> appear to work quite correctly yet.


OK, thank you for reporting this!

> 
> -----------------------------------------------------------------------
> 
> Now my first question is: what does it mean to have a
> 
> "LOOPDETECT of ba:be:aa:c0:d6:bf to ff:ff:ff:ff:ff:ff"
> 
> ?

This means that your node has sent a LOOPDETECT packet. Those are sent out on 
the LAN to check if there is a loop somewhere which BLA couldn't resolve or 
where a misconfiguration happened. There are some scenarios which are not 
covered by BLA:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II#Limitations

> 
> -----------------------------------------------------------------------
> 
> When nodes become unreachable, they do so only partially.  Consider this
> weirdness I encountered two days ago:  Given nodes a, b, c, d, from the
> perspective of a, d has disappeared; in other words, "a# batctl ping d"
> doesn't work.  But I ssh'd from a to b, then from b to c, then from c to
> d, all successfully.  And "a# batctl ping d" still wasn't working, even
> though I was talking to d through that chain of ssh pipes.  Any ideas on
> what that might mean?  (When I reboot a -- the gateway -- everything
> always works again, usually for many hours, but never as long as a whole
> day.)

Hmm, that's strange indeed. Did you have good connection between all those 
devices? There is a certain "horizon", e.g. if you have many weak links in a 
daisy chain the the OGMs are dropped before they are reaching the end of the 
path.

Did you see node D in the originator table of node A?

> 
> Assuming some mischief is being done in bridge-loop-avoidance-land, and
> bearing in mind that both meshes are physically laid out as simple
> linear daisy-chains, is there a way to force each node to ignore all
> nodes other than its immediate lefthand and righthand nodes?  If that's
> possible, it might allow an interesting experiment.

We don't have an option to ignore nodes from a user perspective.

> 
> ----------------------------------------------------------------------
> 
> Do I have a problem because the two meshes, and everything connected to
> them, all share the same LAN?  I note "received a claim frame from
> another group" in the above log excerpt.  (I don't know what that means,
> but I'm guessing that the two meshes are getting each other's
> maintenance traffic.)  Should the two meshes be separate subnets?

It's possible and perfectly fine if you have two meshes connected to the same 
LAN like this:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-Testcases#Two-meshes-connected-by-one-LAN

Just make sure that the meshes are properly disconnected and not rejoin from 
time to time (e.g. by having different SSIDs)

> 
> In a similar vein: Should each node be running its own subnet?
> 
> ----------------------------------------------------------------------
> 
> Should I try changing all nodes over to BATMAN_V, rebooting them all,
> and hoping they re-establish contact?  (It would be massively
> inconvenient to have to reset them all physically.)

No, BATMAN V will not magically fix this.

> 
> ----------------------------------------------------------------------
> 
> Should I try turning off bridge loop avoidance?

bridge loop avoidance should be on as soon as you have any two nodes connected 
to the same LAN and mesh at one time.
> 
> ---------------------------------------------------------------------
> 
> Should I try turning on ap_isolation?

I don't see why you need ap_isolation. This is for external stations which are 
connected to APs of the same mesh and shouldn't talk to each other.

> 
> ---------------------------------------------------------------------
> 
> Any other ideas?

I think we should work on your a - b - c - d chain and find out why a can't 
talk to d. That seems like the most obvious symptom.

Cheers,
      Simon


> 
> On 2/7/20 12:07 PM, Simon Wunderlich wrote:
> > On Friday, February 7, 2020 5:59:27 PM CET Steve Newcomb wrote:
> >> On 2/7/20 10:52 AM, Simon Wunderlich wrote:
> >>> Oops, you are right, we have actually removed that command in 2019.2.
> >>> You
> >>> can use one of the two following commands:
> >>> 
> >>> cat /sys/kernel/debug/batman_adv/bat0/log
> >> 
> >> The cat method works.  Thanks.  Wow, with all the debug info selected,
> >> it's a torrent of stuff.  (Which I expected.)
> > 
> > Awesome
> > 
> >>> (will be removed in the future when debugfs support is dropped
> >>> 
> >>>    trace-cmd stream -e batadv:batadv_dbg
> >> 
> >> Uh, *which* of the two above commands will be dropped?  Should I add
> >> trace-cmd to my menuconfig to retain access to the logs?  I haven't
> >> added it yet.
> > 
> > the /sys/kernel/debug/batman_adv/bat0/log file will be removed in the
> > future. So yes, add the trace-cmd
> > 
> >>   > By "works" you mean you get useful outputs where the timeout is not
> >> 
> >> increasing or similar? can you still "batctl ping" to one of your
> >> neighbors?
> >> 
> >> You are far ahead of where I am.  I meant only that I was getting what
> >> appears to be useful information, not that I have interpreted it.  It
> >> will take me a while to figure out how to read it.  I would be grateful
> >> for any hints on that, and I'm already very grateful for the help you
> >> have provided.
> >> 
> >> I haven't been using "batctl ping" to cue a reboot; I've been using
> >> "ping".  I'm not sure what difference it might make to use "batctl
> >> ping"; it seems to me that if I can't "ping" a node, after 3 or 4 trials
> >> over a period of 15-20 seconds, I'm no longer in touch with it in any
> >> usable way.  When that happens, as long as I don't reboot the gateway,
> >> everything just stays offline.  So it's better to reboot.
> >> 
> >> By the way, at least one user is using the mesh for landline telephone
> >> (Google Voice) service, so I'd really like to stabilize it if I can.
> > 
> > So in iw station dump and and batctl neighbor table, there is a "last
> > seen"
> > field or "inactive time". Usually those fields should be < 5 seconds or
> > so, if they rise then it means this station has been lost somehow.
> > 
> > The difference between batctl ping and a regular ping is that the batctl
> > ping is actually implemented INSIDE of batman-adv, so it will be used to
> > ping a certain mesh participant. Therefore batctl ping relies on less
> > "moving parts" such as correctly configured IPs, MAC address translation,
> > etc. If batctl ping works but regular ping doesn't then we know something
> > on the Ethernet transport part is off. The other way around (batctl ping
> > doesn't work but regular ping does) is pretty much impossible unless the
> > user does something wrong. If batctl ping to a neighbor doesn't work,
> > it's likelythe Wifi layer or something else is exhibiting problems or
> > batman-adv is not set up correctly.
> > 
> > Cheers,
> > 
> >         Simon


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2020-02-19  9:50 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <24aa48bb-8f52-78be-28df-98ec2e2a054b@coolheads.com>
2020-02-07 15:18 ` can't find logged information Steve Newcomb
2020-02-07 15:52   ` Simon Wunderlich
2020-02-07 16:59     ` Steve Newcomb
2020-02-07 17:07       ` Simon Wunderlich
2020-02-11 19:03         ` Steve Newcomb
2020-02-19  9:50           ` Simon Wunderlich [this message]
2020-02-19 15:00             ` Steve Newcomb
2020-02-07 14:13 Steve Newcomb
2020-02-07 14:51 ` Simon Wunderlich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2230258.gT3E0N9OL2@prime \
    --to=sw@simonwunderlich.de \
    --cc=b.a.t.m.a.n@lists.open-mesh.org \
    --cc=srn@coolheads.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox