public inbox for b.a.t.m.a.n@lists.open-mesh.org
 help / color / mirror / Atom feed
From: Sven Eckelmann <sven@narfation.org>
To: "Marek Lindner" <marek.lindner@mailbox.org>,
	"Simon Wunderlich" <sw@simonwunderlich.de>,
	"Antonio Quartulli" <antonio@mandelbit.com>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Simon Horman" <horms@kernel.org>,
	b.a.t.m.a.n@lists.open-mesh.org,
	"Network Development" <netdev@vger.kernel.org>,
	"Linus Lüssing" <linus.luessing@c0d3.blue>,
	"Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp>
Subject: Re: unregister_netdevice: waiting for batadv_slave_0 to become free. Usage count = 2
Date: Sat, 27 Sep 2025 19:21:51 +0200	[thread overview]
Message-ID: <2754825.KlZ2vcFHjT@sven-desktop> (raw)
In-Reply-To: <1731a084-79fb-4bc6-9e0b-9b17f3345c4b@I-love.SAKURA.ne.jp>

[-- Attachment #1: Type: text/plain, Size: 6928 bytes --]

On Tuesday, 23 September 2025 16:45:48 CEST Tetsuo Handa wrote:
> On 2025/09/22 23:09, Tetsuo Handa wrote:
> > I suspect that batadv_hard_if_event_meshif() has something to do upon
> > NETDEV_UNREGISTER event because batadv_hard_if_event_meshif() receives
> > NETDEV_POST_INIT / NETDEV_REGISTER / NETDEV_UNREGISTER / NETDEV_PRE_UNINIT
> > events when this reproducer is executed, but I don't know what to do...
> 
> With a change show bottom, the reproducer no longer triggers this problem.
> But is this change correct?

Thanks for the writeup. Unfortunately, I tried to read this multiple times but 
you don't make it easy to figure out what the problem is.

Please don't take the following paragraphs the wrong way - I was just not able 
to really figure out what you meant. I have therefore just added my thoughts 
on each paragraph.

At the end of this mail, I tried to give a shorter summary about the interface 
states and what I think is the actual problem.

> Commit 9e6b5648bbc4 ("batman-adv: Fix duplicated OGMs on NETDEV_UP")
> replaced batadv_iv_iface_activate() (which is called via iface.activate()
>  from batadv_hardif_activate_interface()) with batadv_iv_iface_enabled()
> (which is called via iface.enabled() from batadv_hardif_enable_interface()).
> But that commit missed that batadv_hardif_activate_interface() is called from
> both batadv_hardif_enable_interface() and batadv_hard_if_event().

Why did it miss it? This is the plan in the mentioned commit.

What is the relevant effect which is creating the problem here?


> Since batadv_iv_ogm_schedule_buff() updates if_status to BATADV_IF_ACTIVE
> only when if_status was BATADV_IF_TO_BE_ACTIVATED, we need to call
> batadv_iv_ogm_schedule_buff() from batadv_iv_ogm_schedule() from
> batadv_iv_iface_enabled() via iface.enabled() with
> if_status == BATADV_IF_TO_BE_ACTIVATED if we want iface.enabled() from
> batadv_hardif_enable_interface() to update if_status to BATADV_IF_ACTIVE.

This basically says: Because of A <- B <- C, we need to have C to B to get A. 
But not really what effect this would have.

> But when IFF_UP is not set upon creation, batadv_hardif_enable_interface()
> does not call batadv_hardif_activate_interface(), which means that
> if_status remains BATADV_IF_INACTIVE despite
> batadv_iv_ogm_schedule_buff() is called via iface.enabled().

It must stay inactive when it is down. But the periodic OGM scheduling is 
still needed to have the OGM queued for active interfaces.


> And when IFF_UP is set after creation, batadv_hard_if_event() calls
> batadv_hardif_activate_interface(). But despite "Interface activated: %s\n"
> message being printed, if_status remains BATADV_IF_TO_BE_ACTIVATED because
> iface.activate == NULL due to above-mentioned commit.

This is not necessarily true. It will simply be switched to BATADV_IF_ACTIVATE 
with the next OGM because batadv_iv_ogm_schedule is rescheduled all the time.

> Since we need to call iface.enabled() instead of iface.activate() so that
> batadv_iv_ogm_schedule_buff() will update if_status to BATADV_IF_ACTIVE,
> move iface.enabled() from batadv_hardif_enable_interface() to
> batadv_hardif_activate_interface().

Why do we need to switch to this state in the first place? I didn't find this 
explanation here.

If I would add your change, we would suddenly have multiple scheduled OGMs 
again, right? Because it basically an oddly written revert of 
commit 9e6b5648bbc4 ("batman-adv: Fix duplicated OGMs on NETDEV_UP").


If this would actually the problem, why would the BATADV_CMD_GET_NEIGHBORS 
request be required? I would have expected that following would show the same 
problem when the switch of the state is the problem:

    rmmod batman-adv || true
    modprobe batman-adv
 
    ip link add batadv_slave_0 type veth peer name veth0_to_batadv
    
    ip link add batadv0 type batadv
    batctl meshif batadv0 it 1000000
    ip link set down master batadv0 dev batadv_slave_0
    ip link set up address 00a:a:aa:aa:aa:aa dev batadv_slave_0
    ip link del dev batadv_slave_0


Let me rephrase what happens with the state of an interface. Just so we are on 
the same page:

* an interface detected by batman-adv is starting with the state 
  BATADV_IF_NOT_IN_USE
* an active interface has the state BATADV_IF_ACTIVE
* when an interface is added to a batman-adv meshif, it is first set to the 
  state BATADV_IF_TO_BE_ACTIVATED
  - this is done by batadv_hardif_activate_interface()
  - there are some dependencies to the UP state of the netdev - see below
* the transition from BATADV_IF_TO_BE_ACTIVATED to BATADV_IF_ACTIVE is 
  algorithm specific
  - IV: scheduling of the OGM buffer
  - V: "directly" after BATADV_IF_TO_BE_ACTIVATED is set


The transition from state BATADV_IF_NOT_IN_USE to BATADV_IF_TO_BE_ACTIVATED 
only happens when the interface is actually UP when .ndo_add_slave is called. 
Otherwise, batman-adv postpones the call to batadv_hardif_activate_interface() 
until the NETDEV_UP event is received.

As mentioned earlier the B.A.T.M.A.N. IV algorithm implementation is (in your 
reproducer) responsible for switching the interface state from 
BATADV_IF_TO_BE_ACTIVATED to BATADV_IF_ACTIVE. And it will only do 
this when the interface was actually in the state BATADV_IF_TO_BE_ACTIVATED. 
There is a comment above the statement which explains the details.

And exactly this "delay" makes it more likely that operations are started 
while an interface is in the transition state of BATADV_IF_TO_BE_ACTIVATED. 
Simply because the periodic OGM scheduling isn't triggered immediately - it 
continues with its normal periodicity.
 

The question would now be: what is the actual problem? For example, with 
following change, the problem also seems to be gone:

diff --git i/net/batman-adv/originator.c w/net/batman-adv/originator.c
index c84420cb..f82dce20 100644
--- i/net/batman-adv/originator.c
+++ w/net/batman-adv/originator.c
@@ -763,7 +763,7 @@ int batadv_hardif_neigh_dump(struct sk_buff *msg, struct netlink_callback *cb)
 	bat_priv = netdev_priv(mesh_iface);
 
 	primary_if = batadv_primary_if_get_selected(bat_priv);
-	if (!primary_if || primary_if->if_status != BATADV_IF_ACTIVE) {
+	if (!primary_if) {
 		ret = -ENOENT;
 		goto out_put_mesh_iface;
 	}


And now we are most likely on the right path... primary_if is holding a 
reference to an hardif and therefore also a reference to the netdev. And the 
error handling is only taking care of releasing the reference to the meshif 
but not the primary_if.

I will later send a fix for this with you and 
syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com as Reported-by. Is this 
okay for you?



As said, earlier - I am really thankful that you worked on this. So please 
don't interpret this as a harsh criticism. I just had really big problems to 
figure out what you wanted to say in these paragraphs. I usually prefer 
something which is easier to consume.

Regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

  parent reply	other threads:[~2025-09-27 17:22 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22 14:09 unregister_netdevice: waiting for batadv_slave_0 to become free. Usage count = 2 Tetsuo Handa
2025-09-23 14:45 ` Tetsuo Handa
2025-09-23 15:01   ` Sven Eckelmann
2025-09-27 17:21   ` Sven Eckelmann [this message]
2025-09-28  1:06     ` Tetsuo Handa
2025-09-28  7:50       ` Sven Eckelmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2754825.KlZ2vcFHjT@sven-desktop \
    --to=sven@narfation.org \
    --cc=antonio@mandelbit.com \
    --cc=b.a.t.m.a.n@lists.open-mesh.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linus.luessing@c0d3.blue \
    --cc=marek.lindner@mailbox.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=sw@simonwunderlich.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox