xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* race condition when re-connecting vif after backend died
@ 2025-10-08 11:22 Marek Marczykowski-Górecki
  2025-10-08 12:32 ` Jürgen Groß
  0 siblings, 1 reply; 4+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-10-08 11:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 2337 bytes --]

Hi,

I have the following scenario:
1. Start backend domain (call it netvm1)
2. Start frontend domain (call it vm1), with
vif=['backend=netvm2,mac=00:16:3e:5e:6c:00,script=vif-route-qubes,ip=10.138.17.244']
3. Pause vm1 (not strictly required, but makes reproducing much easier)
5. Crash/shutdown/destroy netvm1
4. Start another backend domain (call it netvm2)
5. In quick succession:
   5.1. unpause vm1
   5.2. detach (or actually cleanup) vif from vm1 (connected to now dead
        netvm1)
   5.3. attach similar vif with backend=netvm2

Sometimes it ends up with eth0 being present in vm1, but its xenstore
state key is still XenbusStateInitializing. And the backend state is at
XenbusStateInitWait.
In step 5.2, normally libxl waits for the backend to transition to state
XenbusStateClosed, and IIUC backend waits for the frontend to do the
same too. But when the backend is gone, libxl seems to simply removes
frontend xenstore entries without any coordination with the frontend
domain itself.
What I suspect happens is that xenstore events generated at 5.2 are
getting handled by the frontend's kernel only after 5.3.  At this stage,
frontend sees device that was is XenbusStateConnected transitioning to
XenbusStateInitializing (not really expected by the frontend to somebody
else change its state key) and (I guess) doesn't notice device vanished
for a moment (xenbus_dev_changed() doesn't hit the !exists path). I
haven't verified it, but I guess it also doesn't notice backend path
change, so it's still watching the old one (gone at this point).

If my diagnosis is correct, what should be the solution here? Add
handling for XenbusStateUnknown in xen-netfrontc.c:netback_changed()? If
so, it should probably carefully cleanup the old device while not
touching xenstore entries (which belong to the new instance already) and
then re-initialize the device (xennet_connect()? call).
Or maybe it should be done in generic way in xenbus_probe.c, in
xenbus_dev_changed()? Not sure how exactly - maybe by checking if
backend path (or just backend-id?) changed? And then call both
device_unregister() (again, being careful to not change xenstore,
especially not set XenbusStateClosed) and then xenbus_probe_node()?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-02  3:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 11:22 race condition when re-connecting vif after backend died Marek Marczykowski-Górecki
2025-10-08 12:32 ` Jürgen Groß
2025-10-08 14:04   ` Marek Marczykowski-Górecki
2025-11-02  3:19     ` Marek Marczykowski-Górecki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).