public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: linux-kernel@vger.kernel.org
Cc: Juergen Gross <jgross@suse.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
	Peng Jiang <jiang.peng9@zte.com.cn>,
	Qiu-ji Chen <chenqiuji666@gmail.com>,
	Jason Andryuk <jason.andryuk@amd.com>,
	"moderated list:XEN HYPERVISOR INTERFACE"
	<xen-devel@lists.xenproject.org>
Subject: Re: [PATCH] xen/xenbus: better handle backend crash
Date: Thu, 13 Nov 2025 19:43:07 +0100	[thread overview]
Message-ID: <aRYmuwW5rShXqMj1@mail-itl> (raw)
In-Reply-To: <20251102032105.772670-1-marmarek@invisiblethingslab.com>

[-- Attachment #1: Type: text/plain, Size: 5284 bytes --]

Ping?

On Sun, Nov 02, 2025 at 04:20:12AM +0100, Marek Marczykowski-Górecki wrote:
> When the backend domain crashes, coordinated device cleanup is not
> possible (as it involves waiting for the backend state change). In that
> case, toolstack forcefully removes frontend xenstore entries.
> xenbus_dev_changed() handles this case, and triggers device cleanup.
> It's possible that toolstack manages to connect new device in that
> place, before xenbus_dev_changed() notices the old one is missing. If
> that happens, new one won't be probed and will forever remain in
> XenbusStateInitialising.
> 
> Fix this by checking backend-id and if it changes, consider it
> unplug+plug operation. It's important that cleanup on such unplug
> doesn't modify xenstore entries (especially the "state" key) as it
> belong to the new device to be probed - changing it would derail
> establishing connection to the new backend (most likely, closing the
> device before it was even connected). Handle this case by setting new
> xenbus_device->vanished flag to true, and check it before changing state
> entry.
> 
> And even if xenbus_dev_changed() correctly detects the device was
> forcefully removed, the cleanup handling is still racy. Since this whole
> handling doesn't happend in a single xenstore transaction, it's possible
> that toolstack might put a new device there already. Avoid re-creating
> the state key (which in the case of loosing the race would actually
> close newly attached device).
> 
> The problem does not apply to frontend domain crash, as this case
> involves coordinated cleanup.
> 
> Problem originally reported at
> https://lore.kernel.org/xen-devel/aOZvivyZ9YhVWDLN@mail-itl/T/#t,
> including reproduction steps.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
> I considered re-using one of existing fields instead of a new
> xenbus_device->vanished, but I wasn't sure if that would work better.
> Setting xenbus_device->nodename to NULL would prevent few other places
> using it (including some log messages). Setting xenbus_device->otherend
> might have less unintentional impact, but logically it doesn't feel
> correct.
> 
> With this patch applied, I cannot reproduce the issue anymore - neither
> with the simplified reproducer script, nor with the full test suite.
> ---
>  drivers/xen/xenbus/xenbus_client.c |  2 ++
>  drivers/xen/xenbus/xenbus_probe.c  | 25 +++++++++++++++++++++++++
>  include/xen/xenbus.h               |  1 +
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
> index e73ec225d4a61..ce2f49d9aa4ad 100644
> --- a/drivers/xen/xenbus/xenbus_client.c
> +++ b/drivers/xen/xenbus/xenbus_client.c
> @@ -275,6 +275,8 @@ __xenbus_switch_state(struct xenbus_device *dev,
>   */
>  int xenbus_switch_state(struct xenbus_device *dev, enum xenbus_state state)
>  {
> +	if (dev->vanished)
> +		return 0;
>  	return __xenbus_switch_state(dev, state, 0);
>  }
>  
> diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
> index 86fe6e7790566..3c3e56b544976 100644
> --- a/drivers/xen/xenbus/xenbus_probe.c
> +++ b/drivers/xen/xenbus/xenbus_probe.c
> @@ -444,6 +444,9 @@ static void xenbus_cleanup_devices(const char *path, struct bus_type *bus)
>  		info.dev = NULL;
>  		bus_for_each_dev(bus, NULL, &info, cleanup_dev);
>  		if (info.dev) {
> +			dev_warn(&info.dev->dev,
> +			         "device forcefully removed from xenstore\n");
> +			info.dev->vanished = true;
>  			device_unregister(&info.dev->dev);
>  			put_device(&info.dev->dev);
>  		}
> @@ -659,6 +662,28 @@ void xenbus_dev_changed(const char *node, struct xen_bus_type *bus)
>  		return;
>  
>  	dev = xenbus_device_find(root, &bus->bus);
> +	/* Backend domain crash results in not coordinated frontend removal,
> +	 * without going through XenbusStateClosing. Check if the device
> +	 * wasn't replaced to point at another backend in the meantime.
> +	 */
> +	if (dev && !strncmp(node, "device/", sizeof("device/")-1)) {
> +		int backend_id;
> +		int err = xenbus_gather(XBT_NIL, root,
> +				        "backend-id", "%i", &backend_id,
> +					NULL);
> +		if (!err && backend_id != dev->otherend_id) {
> +			/* It isn't the same device, assume the old one
> +			 * vanished and new one needs to be probed.
> +			 */
> +			dev_warn(&dev->dev,
> +				 "backend-id mismatch (%d != %d), reconnecting\n",
> +				 backend_id, dev->otherend_id);
> +			dev->vanished = true;
> +			device_unregister(&dev->dev);
> +			put_device(&dev->dev);
> +			dev = NULL;
> +		}
> +	}
>  	if (!dev)
>  		xenbus_probe_node(bus, type, root);
>  	else
> diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
> index 7dab04cf4a36c..43a5335f1d5a3 100644
> --- a/include/xen/xenbus.h
> +++ b/include/xen/xenbus.h
> @@ -87,6 +87,7 @@ struct xenbus_device {
>  	struct completion down;
>  	struct work_struct work;
>  	struct semaphore reclaim_sem;
> +	bool vanished;
>  
>  	/* Event channel based statistics and settings. */
>  	atomic_t event_channels;
> -- 
> 2.51.0
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2025-11-13 18:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-02  3:20 [PATCH] xen/xenbus: better handle backend crash Marek Marczykowski-Górecki
2025-11-13 18:43 ` Marek Marczykowski-Górecki [this message]
2025-11-17 11:06 ` Jürgen Groß
2026-01-26  7:08   ` Jürgen Groß
2026-01-29  7:02     ` Jürgen Groß
2026-02-06 16:57       ` Marek Marczykowski-Górecki
2026-02-09  7:30         ` Jürgen Groß
2026-02-09  9:02         ` Jürgen Groß
2026-02-09 20:46           ` Marek Marczykowski-Górecki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aRYmuwW5rShXqMj1@mail-itl \
    --to=marmarek@invisiblethingslab.com \
    --cc=chenqiuji666@gmail.com \
    --cc=jason.andryuk@amd.com \
    --cc=jgross@suse.com \
    --cc=jiang.peng9@zte.com.cn \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleksandr_tyshchenko@epam.com \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox