Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH v2 1/4] vhost/vsock: split out vhost_vsock_drop_backends helper
From: Andrey Drobyshev @ 2026-06-22 17:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260622175808.508084-1-andrey.drobyshev@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

Split the actual backend dropping part from vhost_vsock_stop.  We're
going to need it for the VHOST_RESET_OWNER implementation in the
following patch, when vsock->dev.mutex is already taken and owner is
checked.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 drivers/vhost/vsock.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 9aaab6bb8061..b12221ce6faf 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -664,9 +664,24 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	return ret;
 }
 
-static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
 {
+	struct vhost_virtqueue *vq;
 	size_t i;
+
+	lockdep_assert_held(&vsock->dev.mutex);
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vhost_vq_set_backend(vq, NULL);
+		mutex_unlock(&vq->mutex);
+	}
+}
+
+static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
+{
 	int ret = 0;
 
 	mutex_lock(&vsock->dev.mutex);
@@ -677,14 +692,7 @@ static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
 			goto err;
 	}
 
-	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
-		struct vhost_virtqueue *vq = &vsock->vqs[i];
-
-		mutex_lock(&vq->mutex);
-		vhost_vq_set_backend(vq, NULL);
-		mutex_unlock(&vq->mutex);
-	}
-
+	vhost_vsock_drop_backends(vsock);
 err:
 	mutex_unlock(&vsock->dev.mutex);
 	return ret;
-- 
2.47.1


^ permalink raw reply related

* [PATCH v2 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Andrey Drobyshev @ 2026-06-22 17:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev

v1 -> v2:

  * Patch 2 (suppress EHOSTUNREACH): replace 'cpr_paused' + backend check
    with a single 'started' latch;
  * Patch 3 (re-scan TX virtqueue): reword commit message;
  * Patch 4 (VHOST_RESET_OWNER):
      - fix a vhost_worker use-after-free / stuck VHOST_WORK_QUEUED stall
        against the lockless send path;
      - drop the no-op vsock_for_each_connected_socket() iteration;
  * Shuffle the patches, keep RESET_OWNER implementation last to preserve
    bisectability;
  * Reword the cover letter.

v1: https://lore.kernel.org/virtualization/20260612165718.433546-1-andrey.drobyshev@virtuozzo.com

Host<-->guest connections via AF_VSOCK sockets aren't supposed to
outlive VM migration, since VM is moving to another host.  However
there's a special case, which is QEMU live-update, or CPR
(checkpoint-restore) migration.  In this case, VM remains on the same
host, and we'd like such connections to persist.

For this to work, we need to be able to transfer device ownership from
source QEMU to dest QEMU.  Namely, source needs to reset ownership by
issuing VHOST_RESET_OWNER ioctl, and then target has to claim it by
calling VHOST_SET_OWNER.

Since VHOST_RESET_OWNER isn't yet implemented for vhost-vsock, let's add
such implementation.  Patch 1 is a preliminary helper.  Patches 2 and 3
fix the pre-existing issues which do manifest during CPR / RESET_OWNER.
Patch 4 is the ioctl's implementation itself - we keep it last to
preserve bisectability.

There's a complementary series for QEMU [0] adding support of vhost-vsock
devices during CPR migration.

I've tested this (patched QEMU + patched kernel) approximately as follows:

  * Run listener in the guest:
  socat -u VSOCK-LISTEN:9999 - >/tmp/recv.bin

  * Run data transfer from host to guest:
  socat -u FILE:/root/bigfile.bin VSOCK-CONNECT:CID:9999

  * Perform CPR migration during transfer (either cpr-exec or cpr-transfer)
  * Check that file hash sum matches

[0] https://lore.kernel.org/qemu-devel/20260619105514.128812-1-andrey.drobyshev@virtuozzo.com

Andrey Drobyshev (2):
  vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
  vhost/vsock: re-scan TX virtqueue on device start

Pavel Tikhomirov (2):
  vhost/vsock: split out vhost_vsock_drop_backends helper
  vhost/vsock: add VHOST_RESET_OWNER ioctl

 drivers/vhost/vsock.c | 96 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 76 insertions(+), 20 deletions(-)

-- 
2.47.1


^ permalink raw reply

* [PATCH] crypto: virtio - fix missing le64_to_cpu() conversions
From: Ben Dooks @ 2026-06-22 15:03 UTC (permalink / raw)
  To: linux-crypto, virtualization, Gonglei
  Cc: linux-kernel, Michael S. Tsirkin, Jason Wang, Ben Dooks

There are two cases of sending a __le64 type to a print function
so fix this by adding le64_to_cpu() which fixes the following
(prototype) sparse warnings:

drivers/crypto/virtio/virtio_crypto_skcipher_algs.c:234:17: warning: incorrect type in argument 3 (different base types)
drivers/crypto/virtio/virtio_crypto_skcipher_algs.c:234:17:    expected unsigned long long
drivers/crypto/virtio/virtio_crypto_skcipher_algs.c:234:17:    got restricted __le64 [usertype] session_id
drivers/crypto/virtio/virtio_crypto_akcipher_algs.c:196:17: warning: incorrect type in argument 3 (different base types)
drivers/crypto/virtio/virtio_crypto_akcipher_algs.c:196:17:    expected unsigned long long
drivers/crypto/virtio/virtio_crypto_akcipher_algs.c:196:17:    got restricted __le64 [usertype] session_id

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
 drivers/crypto/virtio/virtio_crypto_akcipher_algs.c | 3 ++-
 drivers/crypto/virtio/virtio_crypto_skcipher_algs.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c b/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
index d8d452cac391..404e33b16db6 100644
--- a/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
@@ -194,7 +194,8 @@ static int virtio_crypto_alg_akcipher_close_session(struct virtio_crypto_akciphe
 
 	if (ctrl_status->status != VIRTIO_CRYPTO_OK) {
 		pr_err("virtio_crypto: Close session failed status: %u, session_id: 0x%llx\n",
-			ctrl_status->status, destroy_session->session_id);
+			ctrl_status->status,
+		       le64_to_cpu(destroy_session->session_id));
 		err = -EINVAL;
 		goto out;
 	}
diff --git a/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c b/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c
index e82fc16cab25..3ca441ae2759 100644
--- a/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c
@@ -232,7 +232,8 @@ static int virtio_crypto_alg_skcipher_close_session(
 
 	if (ctrl_status->status != VIRTIO_CRYPTO_OK) {
 		pr_err("virtio_crypto: Close session failed status: %u, session_id: 0x%llx\n",
-			ctrl_status->status, destroy_session->session_id);
+			ctrl_status->status,
+		       le64_to_cpu(destroy_session->session_id));
 
 		err = -EINVAL;
 		goto out;
-- 
2.37.2.352.g3c44437643


^ permalink raw reply related

* Re: [PATCH 1/2] virtio_balloon: factor out virtballoon_quiesce()
From: Michael S. Tsirkin @ 2026-06-22 14:59 UTC (permalink / raw)
  To: David Hildenbrand (Arm); +Cc: Denis V. Lunev, virtualization, linux-kernel
In-Reply-To: <79aa1649-acbb-4898-b533-4c991a6ca19b@kernel.org>

On Mon, Jun 22, 2026 at 04:46:48PM +0200, David Hildenbrand (Arm) wrote:
> On 6/22/26 15:37, Denis V. Lunev wrote:
> > virtballoon_remove() stops all of the balloon's asynchronous work (the
> > free page reporting worker, the inflate/deflate and stats workers, the
> > OOM notifier and the free page shrinker) before tearing the device
> > down. A following change needs the same teardown from a .shutdown
> > handler, so move it into a virtballoon_quiesce() helper.
> > 
> > No functional change.
> > 
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > ---
> >  drivers/virtio/virtio_balloon.c | 27 ++++++++++++++++++++-------
> >  1 file changed, 20 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > index 088b3a0e6ce6..5b02d9191ac6 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -1098,26 +1098,39 @@ static void remove_common(struct virtio_balloon *vb)
> >  	vb->vdev->config->del_vqs(vb->vdev);
> >  }
> >  
> > -static void virtballoon_remove(struct virtio_device *vdev)
> > +/*
> > + * Stop all asynchronous balloon work. The device must still be alive so that
> > + * in-flight requests can drain via the host before it is reset or freed.
> > + */
> > +static void virtballoon_quiesce(struct virtio_balloon *vb)
> >  {
> > -	struct virtio_balloon *vb = vdev->priv;
> > +	struct virtio_device *vdev = vb->vdev;
> >  
> > -	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_REPORTING))
> >  		page_reporting_unregister(&vb->pr_dev_info);
> 
> Most functions here grab mutexes and can sleep. I assume that's fine in shutdown
> context.
> 
> Nothing jumped at me
> 
> 
> Sashiko asks whether ->remove (hotunplug/driver-unloading) can follow a
> ->shutdown callback. I don't know, seems questionable if this could happen and
> should probably be handled by the core?
> 
> If it's possible you'd have to remember that stuff was already quiesce'ed. Maybe
> simply by checking early in virtballoon_quiesce() whether vb->stop_update ==
> true already.
> 
> -- 
> Cheers,
> 
> David

I don't think it can happen.

-- 
MST


^ permalink raw reply

* Re: [PATCH 2/2] virtio_balloon: quiesce balloon work before device shutdown
From: Michael S. Tsirkin @ 2026-06-22 14:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm); +Cc: Denis V. Lunev, virtualization, linux-kernel
In-Reply-To: <8b83f251-3a3e-4fc9-8ea9-8d101fb92919@kernel.org>

On Mon, Jun 22, 2026 at 04:38:54PM +0200, David Hildenbrand (Arm) wrote:
> On 6/22/26 15:37, Denis V. Lunev wrote:
> > Commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
> > device_shutdown()") added a generic virtio bus .shutdown handler that
> > breaks and resets every virtio device during device_shutdown(), i.e. on
> > reboot and kexec.
> > 
> > virtio_balloon provides no .shutdown of its own, so that generic path
> > runs while the balloon's asynchronous work is still armed. Once the
> > device has been broken, virtqueue_add_inbuf() in
> > virtballoon_free_page_report() returns -EIO and trips its
> > WARN_ON_ONCE(). On a kernel booted with panic_on_warn that turns an
> > ordinary reboot, for example a kexec based upgrade, into a fatal panic
> > in the middle of device_shutdown(), so the machine never reaches the
> > new kernel.
> > 
> > Relaxing that single WARN_ON_ONCE() would only hide the symptom: the
> > inflate/deflate and OOM paths do not warn, they call
> > wait_event(vb->acked, ...) and would instead block forever on a broken
> > queue that can no longer complete. The device has to be quiesced, not
> > just kept quiet.
> 
> Ah, so
> 
> 	/* We should always be able to add one buffer to an empty queue. */
> 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> 
> is not actually correct.

Yes - we can't if the device is completely gone)

> Yeah, quiescing sounds cleaner, although I am thinking whether we should also
> warn if virtqueue_add_outbuf() fails, similar to what we do in
> virtballoon_free_page_report().
> 
> > 
> > Add a .shutdown handler that quiesces the balloon via the shared
> > virtballoon_quiesce() helper while the device is still alive, and only
> > then breaks and resets it. The break and reset are repeated here rather
> > than reused from virtio_dev_shutdown(): drv->shutdown replaces the
> > generic handler rather than augmenting it, so that drivers such as
> > virtio-gpu can opt out of the reset. Unlike virtballoon_remove() the
> > balloon workqueue is not destroyed, as shutdown does not free the
> > device and cancel_work_sync() together with stop_update already prevent
> > any further work from being queued.
> > 
> > Fixes: 8bd2fa086a04 ("virtio: break and reset virtio devices on device_shutdown()")
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > ---
> >  drivers/virtio/virtio_balloon.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > index 5b02d9191ac6..e35ada767b4b 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -1137,6 +1137,15 @@ static void virtballoon_remove(struct virtio_device *vdev)
> >  	kfree(vb);
> >  }
> >  
> > +static void virtballoon_shutdown(struct virtio_device *vdev)
> > +{
> > +	virtballoon_quiesce(vdev->priv);
> > +
> > +	virtio_break_device(vdev);
> > +	virtio_synchronize_cbs(vdev);
> > +	vdev->config->reset(vdev);
> 
> I guess it would be good if we wouldn't have to copy what the default handler
> does, but could instead just have it in a reusable core function?
> 
> > +}
> > +
> 
> -- 
> Cheers,
> 
> David


^ permalink raw reply

* Re: [PATCH 1/2] virtio_balloon: factor out virtballoon_quiesce()
From: David Hildenbrand (Arm) @ 2026-06-22 14:46 UTC (permalink / raw)
  To: Denis V. Lunev, mst; +Cc: virtualization, linux-kernel
In-Reply-To: <20260622133715.3707707-2-den@openvz.org>

On 6/22/26 15:37, Denis V. Lunev wrote:
> virtballoon_remove() stops all of the balloon's asynchronous work (the
> free page reporting worker, the inflate/deflate and stats workers, the
> OOM notifier and the free page shrinker) before tearing the device
> down. A following change needs the same teardown from a .shutdown
> handler, so move it into a virtballoon_quiesce() helper.
> 
> No functional change.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  drivers/virtio/virtio_balloon.c | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 088b3a0e6ce6..5b02d9191ac6 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1098,26 +1098,39 @@ static void remove_common(struct virtio_balloon *vb)
>  	vb->vdev->config->del_vqs(vb->vdev);
>  }
>  
> -static void virtballoon_remove(struct virtio_device *vdev)
> +/*
> + * Stop all asynchronous balloon work. The device must still be alive so that
> + * in-flight requests can drain via the host before it is reset or freed.
> + */
> +static void virtballoon_quiesce(struct virtio_balloon *vb)
>  {
> -	struct virtio_balloon *vb = vdev->priv;
> +	struct virtio_device *vdev = vb->vdev;
>  
> -	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_REPORTING))
>  		page_reporting_unregister(&vb->pr_dev_info);

Most functions here grab mutexes and can sleep. I assume that's fine in shutdown
context.

Nothing jumped at me


Sashiko asks whether ->remove (hotunplug/driver-unloading) can follow a
->shutdown callback. I don't know, seems questionable if this could happen and
should probably be handled by the core?

If it's possible you'd have to remember that stuff was already quiesce'ed. Maybe
simply by checking early in virtballoon_quiesce() whether vb->stop_update ==
true already.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 2/2] virtio_balloon: quiesce balloon work before device shutdown
From: David Hildenbrand (Arm) @ 2026-06-22 14:38 UTC (permalink / raw)
  To: Denis V. Lunev, mst; +Cc: virtualization, linux-kernel
In-Reply-To: <20260622133715.3707707-3-den@openvz.org>

On 6/22/26 15:37, Denis V. Lunev wrote:
> Commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
> device_shutdown()") added a generic virtio bus .shutdown handler that
> breaks and resets every virtio device during device_shutdown(), i.e. on
> reboot and kexec.
> 
> virtio_balloon provides no .shutdown of its own, so that generic path
> runs while the balloon's asynchronous work is still armed. Once the
> device has been broken, virtqueue_add_inbuf() in
> virtballoon_free_page_report() returns -EIO and trips its
> WARN_ON_ONCE(). On a kernel booted with panic_on_warn that turns an
> ordinary reboot, for example a kexec based upgrade, into a fatal panic
> in the middle of device_shutdown(), so the machine never reaches the
> new kernel.
> 
> Relaxing that single WARN_ON_ONCE() would only hide the symptom: the
> inflate/deflate and OOM paths do not warn, they call
> wait_event(vb->acked, ...) and would instead block forever on a broken
> queue that can no longer complete. The device has to be quiesced, not
> just kept quiet.

Ah, so

	/* We should always be able to add one buffer to an empty queue. */
	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);

is not actually correct.

Yeah, quiescing sounds cleaner, although I am thinking whether we should also
warn if virtqueue_add_outbuf() fails, similar to what we do in
virtballoon_free_page_report().

> 
> Add a .shutdown handler that quiesces the balloon via the shared
> virtballoon_quiesce() helper while the device is still alive, and only
> then breaks and resets it. The break and reset are repeated here rather
> than reused from virtio_dev_shutdown(): drv->shutdown replaces the
> generic handler rather than augmenting it, so that drivers such as
> virtio-gpu can opt out of the reset. Unlike virtballoon_remove() the
> balloon workqueue is not destroyed, as shutdown does not free the
> device and cancel_work_sync() together with stop_update already prevent
> any further work from being queued.
> 
> Fixes: 8bd2fa086a04 ("virtio: break and reset virtio devices on device_shutdown()")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  drivers/virtio/virtio_balloon.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 5b02d9191ac6..e35ada767b4b 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1137,6 +1137,15 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	kfree(vb);
>  }
>  
> +static void virtballoon_shutdown(struct virtio_device *vdev)
> +{
> +	virtballoon_quiesce(vdev->priv);
> +
> +	virtio_break_device(vdev);
> +	virtio_synchronize_cbs(vdev);
> +	vdev->config->reset(vdev);

I guess it would be good if we wouldn't have to copy what the default handler
does, but could instead just have it in a reusable core function?

> +}
> +

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 0/2] virtio_balloon: quiesce balloon work on device shutdown
From: Denis V. Lunev @ 2026-06-22 14:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Denis V. Lunev, mst; +Cc: virtualization, linux-kernel
In-Reply-To: <de197e27-e56e-48e1-ac2d-e76ba0c3e6b2@kernel.org>

On 6/22/26 16:29, David Hildenbrand (Arm) wrote:
> This email originated from an IP that might not be authorized by the domain it was sent from.
> Do not click links or open attachments unless it is an email you expected to receive.
> On 6/22/26 15:37, Denis V. Lunev wrote:
>> Since commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
>> device_shutdown()") the virtio bus breaks and resets every virtio device
>> during device_shutdown(), i.e. on reboot and kexec. virtio_balloon has no
>> .shutdown of its own, so that generic path runs while the balloon's
>> asynchronous work is still armed: the free page reporting worker, the
>> inflate/deflate and stats workers, the OOM notifier and the free page
>> shrinker.
>>
>> Once the device has been broken, virtqueue_add_inbuf() in
>> virtballoon_free_page_report() returns -EIO and trips its WARN_ON_ONCE().
>> On a kernel booted with panic_on_warn that turns an ordinary reboot into a
>> fatal panic in the middle of device_shutdown(), so the machine never
>> reaches the new kernel. The inflate/deflate and OOM paths do not warn but
>> are no better off: they call wait_event(vb->acked, ...) and would block
>> forever on a queue that can no longer complete.
>>
>> This was hit in the field as an intermittent failure of a virtualization
>> cluster upgrade: guest storage nodes were rebooted via kexec into the new
>> kernel, and the ones whose free page reporting happened to run during
>> device_shutdown() panicked (the guests run with panic_on_warn) and never
>> came back, stalling the rolling upgrade. The crash dump showed the WARN at
>> virtio_balloon.c:216 in a page_reporting kworker, with all the balloon
>> virtqueues already broken.
>>
>> Patch 1 factors the teardown out of virtballoon_remove() into a
>> virtballoon_quiesce() helper (no functional change). Patch 2 adds a
>> virtio_balloon .shutdown handler that quiesces via that helper while the
>> device is still alive, then breaks and resets it the way the generic
>> virtio_dev_shutdown() would.
>>
>> Relaxing the single WARN_ON_ONCE() instead was considered and rejected: it
>> would silence the panic but leave the inflate/deflate and OOM paths
>> hanging on the broken device. The device has to be quiesced, not just kept
>> quiet.
> Do you have a link to that discussion you could add?
>
>
that was internal discussion for that fix. We have faced this
during tests, made the fix and validated it before sending.

Thus - sorry, this is my mistake.

Den

^ permalink raw reply

* Re: [PATCH 0/2] virtio_balloon: quiesce balloon work on device shutdown
From: David Hildenbrand (Arm) @ 2026-06-22 14:29 UTC (permalink / raw)
  To: Denis V. Lunev, mst; +Cc: virtualization, linux-kernel
In-Reply-To: <20260622133715.3707707-1-den@openvz.org>

On 6/22/26 15:37, Denis V. Lunev wrote:
> Since commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
> device_shutdown()") the virtio bus breaks and resets every virtio device
> during device_shutdown(), i.e. on reboot and kexec. virtio_balloon has no
> .shutdown of its own, so that generic path runs while the balloon's
> asynchronous work is still armed: the free page reporting worker, the
> inflate/deflate and stats workers, the OOM notifier and the free page
> shrinker.
> 
> Once the device has been broken, virtqueue_add_inbuf() in
> virtballoon_free_page_report() returns -EIO and trips its WARN_ON_ONCE().
> On a kernel booted with panic_on_warn that turns an ordinary reboot into a
> fatal panic in the middle of device_shutdown(), so the machine never
> reaches the new kernel. The inflate/deflate and OOM paths do not warn but
> are no better off: they call wait_event(vb->acked, ...) and would block
> forever on a queue that can no longer complete.
> 
> This was hit in the field as an intermittent failure of a virtualization
> cluster upgrade: guest storage nodes were rebooted via kexec into the new
> kernel, and the ones whose free page reporting happened to run during
> device_shutdown() panicked (the guests run with panic_on_warn) and never
> came back, stalling the rolling upgrade. The crash dump showed the WARN at
> virtio_balloon.c:216 in a page_reporting kworker, with all the balloon
> virtqueues already broken.
> 
> Patch 1 factors the teardown out of virtballoon_remove() into a
> virtballoon_quiesce() helper (no functional change). Patch 2 adds a
> virtio_balloon .shutdown handler that quiesces via that helper while the
> device is still alive, then breaks and resets it the way the generic
> virtio_dev_shutdown() would.
> 
> Relaxing the single WARN_ON_ONCE() instead was considered and rejected: it
> would silence the panic but leave the inflate/deflate and OOM paths
> hanging on the broken device. The device has to be quiesced, not just kept
> quiet.

Do you have a link to that discussion you could add?


-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-22 14:25 UTC (permalink / raw)
  To: Gregory Price
  Cc: Zi Yan, Michael S. Tsirkin, Matthew Wilcox, Lorenzo Stoakes,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aidCGkafm9yUDTuf@gourry-fedora-PF4VCD3F>

On 6/9/26 00:28, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 11:51:47PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 23:16, Zi Yan wrote:
>>
>> There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
>> no longer followed up to my initial push back and Michael's question later [2].
>>
>> That would have probably been the right time to wait for more discussion.
>>
>> RFC v4 had 22 patches with little replies.
>> v5 had 28 patches with little replies.
>> v6 had 30 patches with no replies.
>> v7 had 31 patches with little replies.
>> v8 had 37 patches with no replies.
>>
>> [1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
>> [2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/
>>
> 
> Hm, rewinding on this back to v3 here:

I'm late ...

> https://lore.kernel.org/lkml/016cc5e5-044c-46c6-a668-200f90a64d85@kernel.org/
> 
> You said:
> 
>   ```
>   Exactly, that's why I am saying that vma_alloc_folio() is the only
>   external interface people should be using with a user address.
>   ```
> 
> Going through the list of folio_zero_user references:
> 
> Called unconditionally if a folio is acquired:
>    fs/hugetlbfs/inode.c:   folio_zero_user(folio, addr);
>    mm/hugetlb.c:           folio_zero_user(folio, vmf->real_address);
>    mm/memfd.c:             folio_zero_user(folio, 0);
> 
> Called when user_alloc_needs_zeroing() and charging passes:
>    mm/huge_memory.c:       folio_zero_user(folio, addr);
>    mm/memory.c:            folio_zero_user(folio, vmf->address);
> 
> No one outside mm/ should know about this interface at all.

Exactly.

> Arguably none of these should know about this interface either.
> 
> The appropriate place for this logic appears to be:
>     vma_alloc_folio
>     alloc_hugetlb_folio
>     alloc_hugetlb_folio_reserve

Yes. And the hugetlb stuff should just be left alone for now. We don't encourage
new hugetlb features, and the same should apply to new optimizations that are
not straight forward.

> 
> The reason to sink it into the post_alloc_hook is to let the buddy
> decide whether the page actually needs to be zeroed (like the virtio
> situation) based on PG_zeroed or whatever.

It's a bit related to your private node work .... if we have an interface that
consumes an alloc_context, we could just forward the address easily.

Now, there are different ways of doing it, but having folio allocation sometimes
use GFP_ZERO and sometimes not is just nasty. It's all a big hack.

> 
> It seems like at a minimum moving the logic all the way into
> post_alloc_hook lets us actually delete folio_zero_user() as a published
> interface and move it entirely within page_alloc.c.
> 
> The catch is user_alloc_needs_zeroing() coming along with it.

Yes. And once there is only one user remaining, we could inline it and get rid
of this horrible helper :)

-- 
Cheers,

David

^ permalink raw reply

* [PATCH 1/2] virtio_balloon: factor out virtballoon_quiesce()
From: Denis V. Lunev @ 2026-06-22 13:37 UTC (permalink / raw)
  To: mst, david; +Cc: virtualization, linux-kernel, Denis V. Lunev
In-Reply-To: <20260622133715.3707707-1-den@openvz.org>

virtballoon_remove() stops all of the balloon's asynchronous work (the
free page reporting worker, the inflate/deflate and stats workers, the
OOM notifier and the free page shrinker) before tearing the device
down. A following change needs the same teardown from a .shutdown
handler, so move it into a virtballoon_quiesce() helper.

No functional change.

Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 drivers/virtio/virtio_balloon.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 088b3a0e6ce6..5b02d9191ac6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1098,26 +1098,39 @@ static void remove_common(struct virtio_balloon *vb)
 	vb->vdev->config->del_vqs(vb->vdev);
 }
 
-static void virtballoon_remove(struct virtio_device *vdev)
+/*
+ * Stop all asynchronous balloon work. The device must still be alive so that
+ * in-flight requests can drain via the host before it is reset or freed.
+ */
+static void virtballoon_quiesce(struct virtio_balloon *vb)
 {
-	struct virtio_balloon *vb = vdev->priv;
+	struct virtio_device *vdev = vb->vdev;
 
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_REPORTING))
 		page_reporting_unregister(&vb->pr_dev_info);
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		unregister_oom_notifier(&vb->oom_nb);
-	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		virtio_balloon_unregister_shrinker(vb);
+
 	spin_lock_irq(&vb->stop_update_lock);
 	vb->stop_update = true;
 	spin_unlock_irq(&vb->stop_update_lock);
 	cancel_work_sync(&vb->update_balloon_size_work);
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
-	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		cancel_work_sync(&vb->report_free_page_work);
+}
+
+static void virtballoon_remove(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+
+	virtballoon_quiesce(vb);
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		destroy_workqueue(vb->balloon_wq);
-	}
 
 	remove_common(vb);
 	mutex_destroy(&vb->balloon_lock);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 0/2] virtio_balloon: quiesce balloon work on device shutdown
From: Denis V. Lunev @ 2026-06-22 13:37 UTC (permalink / raw)
  To: mst, david; +Cc: virtualization, linux-kernel, Denis V. Lunev

Since commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
device_shutdown()") the virtio bus breaks and resets every virtio device
during device_shutdown(), i.e. on reboot and kexec. virtio_balloon has no
.shutdown of its own, so that generic path runs while the balloon's
asynchronous work is still armed: the free page reporting worker, the
inflate/deflate and stats workers, the OOM notifier and the free page
shrinker.

Once the device has been broken, virtqueue_add_inbuf() in
virtballoon_free_page_report() returns -EIO and trips its WARN_ON_ONCE().
On a kernel booted with panic_on_warn that turns an ordinary reboot into a
fatal panic in the middle of device_shutdown(), so the machine never
reaches the new kernel. The inflate/deflate and OOM paths do not warn but
are no better off: they call wait_event(vb->acked, ...) and would block
forever on a queue that can no longer complete.

This was hit in the field as an intermittent failure of a virtualization
cluster upgrade: guest storage nodes were rebooted via kexec into the new
kernel, and the ones whose free page reporting happened to run during
device_shutdown() panicked (the guests run with panic_on_warn) and never
came back, stalling the rolling upgrade. The crash dump showed the WARN at
virtio_balloon.c:216 in a page_reporting kworker, with all the balloon
virtqueues already broken.

Patch 1 factors the teardown out of virtballoon_remove() into a
virtballoon_quiesce() helper (no functional change). Patch 2 adds a
virtio_balloon .shutdown handler that quiesces via that helper while the
device is still alive, then breaks and resets it the way the generic
virtio_dev_shutdown() would.

Relaxing the single WARN_ON_ONCE() instead was considered and rejected: it
would silence the panic but leave the inflate/deflate and OOM paths
hanging on the broken device. The device has to be quiesced, not just kept
quiet.

Validated by churning balloon inflate/deflate from the host while
kexec-rebooting the guest in a loop under panic_on_warn: the unpatched
module reproduces the WARN within a couple of cycles, while the patched
module survives many consecutive kexec cycles cleanly (12/12 in the final
run, 0 WARNs). checkpatch is clean on both patches.

Denis V. Lunev (2):
  virtio_balloon: factor out virtballoon_quiesce()
  virtio_balloon: quiesce balloon work before device shutdown

 drivers/virtio/virtio_balloon.c | 37 ++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

-- 
2.53.0


^ permalink raw reply

* [PATCH 2/2] virtio_balloon: quiesce balloon work before device shutdown
From: Denis V. Lunev @ 2026-06-22 13:37 UTC (permalink / raw)
  To: mst, david; +Cc: virtualization, linux-kernel, Denis V. Lunev
In-Reply-To: <20260622133715.3707707-1-den@openvz.org>

Commit 8bd2fa086a04 ("virtio: break and reset virtio devices on
device_shutdown()") added a generic virtio bus .shutdown handler that
breaks and resets every virtio device during device_shutdown(), i.e. on
reboot and kexec.

virtio_balloon provides no .shutdown of its own, so that generic path
runs while the balloon's asynchronous work is still armed. Once the
device has been broken, virtqueue_add_inbuf() in
virtballoon_free_page_report() returns -EIO and trips its
WARN_ON_ONCE(). On a kernel booted with panic_on_warn that turns an
ordinary reboot, for example a kexec based upgrade, into a fatal panic
in the middle of device_shutdown(), so the machine never reaches the
new kernel.

Relaxing that single WARN_ON_ONCE() would only hide the symptom: the
inflate/deflate and OOM paths do not warn, they call
wait_event(vb->acked, ...) and would instead block forever on a broken
queue that can no longer complete. The device has to be quiesced, not
just kept quiet.

Add a .shutdown handler that quiesces the balloon via the shared
virtballoon_quiesce() helper while the device is still alive, and only
then breaks and resets it. The break and reset are repeated here rather
than reused from virtio_dev_shutdown(): drv->shutdown replaces the
generic handler rather than augmenting it, so that drivers such as
virtio-gpu can opt out of the reset. Unlike virtballoon_remove() the
balloon workqueue is not destroyed, as shutdown does not free the
device and cancel_work_sync() together with stop_update already prevent
any further work from being queued.

Fixes: 8bd2fa086a04 ("virtio: break and reset virtio devices on device_shutdown()")
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 drivers/virtio/virtio_balloon.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5b02d9191ac6..e35ada767b4b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1137,6 +1137,15 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	kfree(vb);
 }
 
+static void virtballoon_shutdown(struct virtio_device *vdev)
+{
+	virtballoon_quiesce(vdev->priv);
+
+	virtio_break_device(vdev);
+	virtio_synchronize_cbs(vdev);
+	vdev->config->reset(vdev);
+}
+
 #ifdef CONFIG_PM_SLEEP
 static int virtballoon_freeze(struct virtio_device *vdev)
 {
@@ -1202,6 +1211,7 @@ static struct virtio_driver virtio_balloon_driver = {
 	.validate =	virtballoon_validate,
 	.probe =	virtballoon_probe,
 	.remove =	virtballoon_remove,
+	.shutdown =	virtballoon_shutdown,
 	.config_changed = virtballoon_changed,
 #ifdef CONFIG_PM_SLEEP
 	.freeze	=	virtballoon_freeze,
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Michael S. Tsirkin @ 2026-06-22 13:24 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Menglong Dong, xuanzhuo, eperezma, jasowang, andrew+netdev, davem,
	edumazet, kuba, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <BMgCMK8nRYC94QK7N1SXpQ@linux.dev>

On Mon, Jun 22, 2026 at 08:27:12PM +0800, Menglong Dong wrote:
> On 2026/6/22 06:31 Michael S. Tsirkin <mst@redhat.com> write:
> > On Tue, Jun 16, 2026 at 07:59:12PM +0800, Menglong Dong wrote:
> [...]
> > >  
> > > +	vring_size = virtqueue_get_vring_size(sq->vq);
> > > +	need_wakeup = xsk_uses_need_wakeup(pool);
> > > +
> > > +	if (need_wakeup && vring_size == sq->vq->num_free)
> > > +		xsk_set_tx_need_wakeup(pool);
> > > +
> > 
> > why are we doing this here?
> > the check after virtnet_xsk_xmit_batch not enough?
> > I vaguely think it's some kind of race we are closing?
> > Pls add a comment to explain.
> 
> Hi, Michael. Thanks for your review.
> 
> Yeah, it's for a race condition between user space and kernel
> space. I added a comment in V2, which is too confusing, and
> I removed it 😢. I'll make it more clear and add it in the V4. The
> origin comment is:
> 
>  * If the sq->vq is empty, and the tx ring is empty, and the user
>  * submit an entry to the tx ring after virtnet_xsk_xmit_batch() and
>  * before xsk_set_tx_need_wakeup(), we will lose the chance to wake
>  * up the tx napi, so we have to set the need_wakeup flag here.
> 
> And the logic is like this:
> 
> Kernel: tx NAPI is waked up from skb_xmit_done() ->
> Kernel: sq->vq and xsk->tx_ring are both empty ->
> Kernel: call virtnet_xsk_xmit_batch()
> 
>     User: submit a entry to the xsk->tx_ring
>     User: check the wakeup flag
>     User: wakeup flag is not set, skip send()
> 
> Kernel: call xsk_set_tx_need_wakeup(), because sq->vq is empty
> 
> If we don't send more data, the data in the xsk->tx_ring will
> not be sent forever.

I'm not 100% sure I understand, but when someone fixes cross-CPU races
with no synchronization or CPU memory barriers just with extra checks,
this always gives me pause.

AI helped write this for me, for example:
  1. Kernel: xsk_set_tx_need_wakeup stores NEED_WAKEUP (sits in store buffer)
  2. Kernel: xsk_tx_peek_release_desc_batch - load, sees empty (reordered before the store is globally visible)
  3. Kernel: peek finds nothing, returns 0
  4. Userspace: stores entry + producer
  5. Userspace: loads flags - doesn't see NEED_WAKEUP yet (still in kernel's store buffer)
  6. Userspeace: skips send()
  7. Kernel: NEED_WAKEUP store finally becomes visible - too late

Seems legit?



> > 
> > >  	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> > >  
> > > +	if (need_wakeup) {
> > > +		if (vring_size == sq->vq->num_free)
> > > +			/* we can't wake up by ourself, and it should be done
> > > +			 * by the user.
> > > +			 */
> > > +			xsk_set_tx_need_wakeup(pool);
> > > +		else
> > > +			/* we can wake up from skb_xmit_done() */
> > > +			xsk_clear_tx_need_wakeup(pool);
> > 
> > But what if we don't have get tx napi so no wakeup in skb_xmit_done?
> 
> Sorry that I'm not sure what "get tx napi" means here ;(
> 
> There are entry in sq->vq, so skb_xmit_done() will be called after
> the entries in the ring is consumed by the HOST, right?
> Then, the corresponding sq->napi will be scheduled, as we ensure
> that tx napi is always enabled, which means napi->weight is not
> zero, in this commit:
> 1df5116a41a8 ("virtio_net: xsk: prevent disable tx napi")

Oh I forgot we did that. But can xsk bind when tx napi has already
been disabled previously?


> Right?
> 
> Thanks!
> Menglong Dong
> 
> > 
> > 
> > > +	}
> > > +
> > >  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
> > >  		check_sq_full_and_disable(vi, vi->dev, sq);
> > >  
> > > @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> > >  	u64_stats_add(&sq->stats.xdp_tx,  sent);
> > >  	u64_stats_update_end(&sq->stats.syncp);
> > >  
> > > -	if (xsk_uses_need_wakeup(pool))
> > > -		xsk_set_tx_need_wakeup(pool);
> > > -
> > >  	return sent;
> > >  }
> > >  
> > > -- 
> > > 2.54.0
> > 
> > 
> > 
> 
> 
> 


^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Menglong Dong @ 2026-06-22 12:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: xuanzhuo, Menglong Dong, eperezma, mst, jasowang, andrew+netdev,
	davem, edumazet, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260621150610.0ad5d02e@kernel.org>

On 2026/6/22 06:06 Jakub Kicinski <kuba@kernel.org> write:
> On Tue, 16 Jun 2026 19:59:12 +0800 Menglong Dong wrote:
> > For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
> > in the tx path for example: we set xsk_set_tx_need_wakeup() in
> > virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> > anywhere, which means the user will call send() for every packet.
> > 
> > We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
> > is empty, as we can't be wakeup by the skb_xmit_done() in this case.
> > Otherwise, we will clear the wakeup flag.
> > 
> > Race condition is considered for tx path.
> 
> Seems to follow what mlx5 does so presumably this is fine but IDK if

Yeah, I followed the logic of mlx5. It's amazing that you found it :)

> there's anything virtio-specific that we need to be worried about.
> 
> Xuan Zhuo, please TAL?
> -- 
> mping: VIRTIO NET DRIVER
> 
> 





^ permalink raw reply

* Re: [patch V2 18/25] timekeeping: Prepare for cross timestamps on arbitrary clock IDs
From: David Woodhouse @ 2026-06-22 12:34 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Miroslav Lichvar, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, thomas.weissschuh, Arthur Kiyanovski,
	Rodolfo Giometti, Vincent Donnefort, Marc Zyngier, Oliver Upton,
	kvmarm, Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, Vadim Fedorenko
In-Reply-To: <87se6eltod.ffs@fw13>

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On Mon, 2026-06-22 at 13:07 +0200, Thomas Gleixner wrote:
> On Mon, Jun 22 2026 at 09:55, David Woodhouse wrote:
> > We ended up with ktime_get_snapshot_id() also supporting CLOCK_BOOTTIME
> > and CLOCK_MONOTONIC_RAW, but not get_device_system_crosststamp().
> > Should we make that consistent?
> 
> Maybe. The BOOTTIME support is only there for that ARM64 hyper trace muck,
> but has no other relevance.
> 
> MONORAW is there for the PTP EXTENDED IOCTL, but with PRECISE the
> snapshot already contains the raw value and you'd have to prevent the
> historical adjustment part for RAW. So I don't see the actual value, but
> I don't have a strong opinion either.

Yeah, I'm not sure I see the need for it; it's just the consistency
thing that slightly bothered me once I had them both in my sights doing
the snapshot_ntp_error() thing in both.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Menglong Dong @ 2026-06-22 12:28 UTC (permalink / raw)
  To: Menglong Dong, Xuan Zhuo
  Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, virtualization, linux-kernel, eperezma
In-Reply-To: <1782096043.3540094-1-xuanzhuo@linux.alibaba.com>

On 2026/6/22 10:40 Xuan Zhuo <xuanzhuo@linux.alibaba.com> write:
> On Tue, 16 Jun 2026 19:59:12 +0800, Menglong Dong <menglong8.dong@gmail.com> wrote:
> > For now, XDP_RING_NEED_WAKEUP is not supported properly by the virtio-net
> > in the tx path for example: we set xsk_set_tx_need_wakeup() in
> > virtnet_xsk_xmit(), but we didn't call xsk_clear_tx_need_wakeup()
> > anywhere, which means the user will call send() for every packet.
> >
> > We call xsk_set_tx_need_wakeup() after virtnet_xsk_xmit_batch() if sq->vq
> > is empty, as we can't be wakeup by the skb_xmit_done() in this case.
> > Otherwise, we will clear the wakeup flag.
> >
> > Race condition is considered for tx path.
> >
> > Fixes: 89f86675cb03 ("virtio_net: xsk: tx: support xmit xsk buffer")
> 
> This is not a bug, so we do not need this.
> And you post this to net-next.

Okay, I'll remove this tag in the V4.

> 
> 
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> > ---
> > v3:
[...]
> > +
> > +	if (need_wakeup && vring_size == sq->vq->num_free)
> > +		xsk_set_tx_need_wakeup(pool);
> 
> You need to comment this.

Ack!

> 
> 
> > +
[...]
> > +
> >  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
> >  		check_sq_full_and_disable(vi, vi->dev, sq);
> 
> 
> After fixed above comments, you can add:
> 
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

OK! Thanks for the review :)

> 
> Thanks.
> 
> 
> >
> > @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> >  	u64_stats_add(&sq->stats.xdp_tx,  sent);
> >  	u64_stats_update_end(&sq->stats.syncp);
> >
> > -	if (xsk_uses_need_wakeup(pool))
> > -		xsk_set_tx_need_wakeup(pool);
> > -
> >  	return sent;
> >  }
> >
> > --
> > 2.54.0
> >
> 
> 





^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Menglong Dong @ 2026-06-22 12:27 UTC (permalink / raw)
  To: Menglong Dong, Michael S. Tsirkin
  Cc: xuanzhuo, eperezma, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260621182119-mutt-send-email-mst@kernel.org>

On 2026/6/22 06:31 Michael S. Tsirkin <mst@redhat.com> write:
> On Tue, Jun 16, 2026 at 07:59:12PM +0800, Menglong Dong wrote:
[...]
> >  
> > +	vring_size = virtqueue_get_vring_size(sq->vq);
> > +	need_wakeup = xsk_uses_need_wakeup(pool);
> > +
> > +	if (need_wakeup && vring_size == sq->vq->num_free)
> > +		xsk_set_tx_need_wakeup(pool);
> > +
> 
> why are we doing this here?
> the check after virtnet_xsk_xmit_batch not enough?
> I vaguely think it's some kind of race we are closing?
> Pls add a comment to explain.

Hi, Michael. Thanks for your review.

Yeah, it's for a race condition between user space and kernel
space. I added a comment in V2, which is too confusing, and
I removed it 😢. I'll make it more clear and add it in the V4. The
origin comment is:

 * If the sq->vq is empty, and the tx ring is empty, and the user
 * submit an entry to the tx ring after virtnet_xsk_xmit_batch() and
 * before xsk_set_tx_need_wakeup(), we will lose the chance to wake
 * up the tx napi, so we have to set the need_wakeup flag here.

And the logic is like this:

Kernel: tx NAPI is waked up from skb_xmit_done() ->
Kernel: sq->vq and xsk->tx_ring are both empty ->
Kernel: call virtnet_xsk_xmit_batch()

    User: submit a entry to the xsk->tx_ring
    User: check the wakeup flag
    User: wakeup flag is not set, skip send()

Kernel: call xsk_set_tx_need_wakeup(), because sq->vq is empty

If we don't send more data, the data in the xsk->tx_ring will
not be sent forever.

> 
> >  	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> >  
> > +	if (need_wakeup) {
> > +		if (vring_size == sq->vq->num_free)
> > +			/* we can't wake up by ourself, and it should be done
> > +			 * by the user.
> > +			 */
> > +			xsk_set_tx_need_wakeup(pool);
> > +		else
> > +			/* we can wake up from skb_xmit_done() */
> > +			xsk_clear_tx_need_wakeup(pool);
> 
> But what if we don't have get tx napi so no wakeup in skb_xmit_done?

Sorry that I'm not sure what "get tx napi" means here ;(

There are entry in sq->vq, so skb_xmit_done() will be called after
the entries in the ring is consumed by the HOST, right?
Then, the corresponding sq->napi will be scheduled, as we ensure
that tx napi is always enabled, which means napi->weight is not
zero, in this commit:
1df5116a41a8 ("virtio_net: xsk: prevent disable tx napi")

Right?

Thanks!
Menglong Dong

> 
> 
> > +	}
> > +
> >  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
> >  		check_sq_full_and_disable(vi, vi->dev, sq);
> >  
> > @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> >  	u64_stats_add(&sq->stats.xdp_tx,  sent);
> >  	u64_stats_update_end(&sq->stats.syncp);
> >  
> > -	if (xsk_uses_need_wakeup(pool))
> > -		xsk_set_tx_need_wakeup(pool);
> > -
> >  	return sent;
> >  }
> >  
> > -- 
> > 2.54.0
> 
> 
> 





^ permalink raw reply

* Re: [PATCH 0/2] tools: Fix tools/virtio test build
From: Michael S. Tsirkin @ 2026-06-22 11:54 UTC (permalink / raw)
  To: Yichong Chen
  Cc: akpm, eperezma, jasowang, linux-kernel, ljs, rppt, virtualization,
	xuanzhuo
In-Reply-To: <A1DCE9FEF645A3F9+20260622070113.1207518-1-chenyichong@uniontech.com>

On Mon, Jun 22, 2026 at 03:01:13PM +0800, Yichong Chen wrote:
> Hi Michael,
> 
> I checked the history again. The tree was based on v7.1, but I had
> unrelated local commits on top when I generated the series.
> 
> I rechecked this on a clean v7.1 tree:
> 
>   8cd9520d35a6 ("Linux 7.1")
> 
> The build failure is still reproducible with:
> 
>   make -C tools/virtio test
> 
> The first failure is:
> 
>   include/linux/virtio.h:10:10: fatal error:
>   linux/mod_devicetable.h: No such file or directory
> 
> The two patches also apply cleanly on top of v7.1 and make
> the tools/virtio test target build virtio_test, vringh_test and
> vhost_net_test.
> 
> I can resend a v2 based on v7.1 with a base-commit if you prefer.
> 
> Thanks,
> Yichong

Hello, thanks for the report!
the commit to try is 8cb2c9285e4ce9154f45fb15633ebd45dfd8d9cf - that is
not in 7.1.


^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: David Hildenbrand (Arm) @ 2026-06-22 11:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Kaitao Cheng
  Cc: Andrew Morton, Jens Axboe, Tejun Heo, Alexander Viro,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, Christian König,
	David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao
In-Reply-To: <CAADnVQJmPWFT01b7DuLdtafv=8FyB84GYHNZ8zSTck+9Aw0JpA@mail.gmail.com>

On 6/22/26 07:28, Alexei Starovoitov wrote:
> On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>
>> From: chengkaitao <chengkaitao@kylinos.cn>
>>
>> The list_for_each*_safe() helpers are used when the loop body may remove
>> the current entry.  Their current interface, however, forces every caller
>> to define a temporary cursor outside the macro and pass it in, even when
>> the caller never uses that cursor directly.  For most call sites this
>> extra cursor is just boilerplate required by the macro implementation.
>>
>> This is awkward because the saved next pointer is an internal detail of
>> the iteration.  Callers that only remove or move the current entry do not
>> need to spell it out.
>>
>> The _safe() suffix has also caused confusion.  Christian Koenig pointed
>> out that the name is easy to read as a thread-safe variant, especially
>> for beginners, even though it only means that the iterator keeps enough
>> state to tolerate removal of the current entry.  He suggested _mutable()
>> as a clearer description of what the loop permits.
>>
>> Add *_mutable() iterator variants for list, hlist and llist.  The new
>> helpers are variadic and support both forms.  In the common case, the
>> caller omits the temporary cursor and the macro creates a unique internal
>> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
>> explicit temporary cursor, the caller can still pass it and the helper
>> keeps the existing *_safe() behaviour.
>>
>> For example, a call site may use the shorter form:
>>
>>   list_for_each_entry_mutable(pos, head, member)
>>
>> or keep the explicit temporary cursor form:
>>
>>   list_for_each_entry_mutable(pos, tmp, head, member)
>>
>> The existing *_safe() helpers remain available for compatibility.  This
>> series only converts users in mm, block, kernel, init and io_uring.  If
>> this approach looks acceptable, the remaining users can be converted in
>> follow-up series.
>>
>> Changes in v3 (Christian König, Andy Shevchenko):
>> - Convert safe list walks to mutable iterators
>>
>> Changes in v2 (Muchun Song, Andy Shevchenko):
>> - Drop the list_for_each_entry_mutable*() helpers from v1 and make the
>>   cursor change directly in the existing list_for_each_entry*() helpers.
>> - Open-code special list walks that rely on updating the loop cursor in
>>   the body, preserving their existing traversal semantics.
>>
>> Link to v2:
>> https://lore.kernel.org/all/20260609061347.93688-1-kaitao.cheng@linux.dev/
>>
>> Link to v1:
>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>
>> Kaitao Cheng (7):
>>   list: Add mutable iterator variants
>>   llist: Add mutable iterator variants
>>   mm: Use mutable list iterators
>>   block: Use mutable list iterators
>>   kernel: Use mutable list iterators
>>   initramfs: Use mutable list iterator
>>   io_uring: Use mutable list iterators
>>
>>  block/bfq-iosched.c                 |  17 +-
>>  block/blk-cgroup.c                  |  12 +-
>>  block/blk-flush.c                   |   4 +-
>>  block/blk-iocost.c                  |  18 +-
>>  block/blk-mq.c                      |   8 +-
>>  block/blk-throttle.c                |   4 +-
>>  block/kyber-iosched.c               |   4 +-
>>  block/partitions/ldm.c              |   8 +-
>>  block/sed-opal.c                    |   4 +-
>>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
>>  include/linux/llist.h               |  81 +++++++--
>>  init/initramfs.c                    |   5 +-
>>  io_uring/cancel.c                   |   6 +-
>>  io_uring/poll.c                     |   3 +-
>>  io_uring/rw.c                       |   4 +-
>>  io_uring/timeout.c                  |   8 +-
>>  io_uring/uring_cmd.c                |   3 +-
>>  kernel/audit_tree.c                 |   4 +-
>>  kernel/audit_watch.c                |  16 +-
>>  kernel/auditfilter.c                |   4 +-
>>  kernel/auditsc.c                    |   4 +-
>>  kernel/bpf/arena.c                  |  10 +-
>>  kernel/bpf/arraymap.c               |   8 +-
>>  kernel/bpf/bpf_local_storage.c      |   3 +-
>>  kernel/bpf/bpf_lru_list.c           |  25 ++-
>>  kernel/bpf/btf.c                    |  18 +-
>>  kernel/bpf/cgroup.c                 |   7 +-
>>  kernel/bpf/cpumap.c                 |   4 +-
>>  kernel/bpf/devmap.c                 |  10 +-
>>  kernel/bpf/helpers.c                |   8 +-
>>  kernel/bpf/local_storage.c          |   4 +-
>>  kernel/bpf/memalloc.c               |  16 +-
>>  kernel/bpf/offload.c                |   8 +-
>>  kernel/bpf/states.c                 |   4 +-
>>  kernel/bpf/stream.c                 |   4 +-
>>  kernel/bpf/verifier.c               |   6 +-
>>  kernel/cgroup/cgroup-v1.c           |   4 +-
>>  kernel/cgroup/cgroup.c              |  54 +++---
>>  kernel/cgroup/dmem.c                |  12 +-
>>  kernel/cgroup/rdma.c                |   8 +-
>>  kernel/events/core.c                |  44 +++--
>>  kernel/events/uprobes.c             |  12 +-
>>  kernel/exit.c                       |   8 +-
>>  kernel/fail_function.c              |   4 +-
>>  kernel/gcov/clang.c                 |   4 +-
>>  kernel/irq_work.c                   |   4 +-
>>  kernel/kexec_core.c                 |   4 +-
>>  kernel/kprobes.c                    |  16 +-
>>  kernel/livepatch/core.c             |   4 +-
>>  kernel/livepatch/core.h             |   4 +-
>>  kernel/liveupdate/kho_block.c       |   4 +-
>>  kernel/liveupdate/luo_flb.c         |   4 +-
>>  kernel/locking/rwsem.c              |   2 +-
>>  kernel/locking/test-ww_mutex.c      |   2 +-
>>  kernel/module/main.c                |  11 +-
>>  kernel/padata.c                     |   4 +-
>>  kernel/power/snapshot.c             |   8 +-
>>  kernel/power/wakelock.c             |   4 +-
>>  kernel/printk/printk.c              |  11 +-
>>  kernel/ptrace.c                     |   4 +-
>>  kernel/rcu/rcutorture.c             |   3 +-
>>  kernel/rcu/tasks.h                  |   9 +-
>>  kernel/rcu/tree.c                   |   6 +-
>>  kernel/resource.c                   |   4 +-
>>  kernel/sched/core.c                 |   4 +-
>>  kernel/sched/ext.c                  |  22 +--
>>  kernel/sched/fair.c                 |  28 +--
>>  kernel/sched/topology.c             |   4 +-
>>  kernel/sched/wait.c                 |   4 +-
>>  kernel/seccomp.c                    |   4 +-
>>  kernel/signal.c                     |  11 +-
>>  kernel/smp.c                        |   4 +-
>>  kernel/taskstats.c                  |   8 +-
>>  kernel/time/clockevents.c           |   6 +-
>>  kernel/time/clocksource.c           |   4 +-
>>  kernel/time/posix-cpu-timers.c      |   4 +-
>>  kernel/time/posix-timers.c          |   3 +-
>>  kernel/torture.c                    |   3 +-
>>  kernel/trace/bpf_trace.c            |   4 +-
>>  kernel/trace/ftrace.c               |  49 +++--
>>  kernel/trace/ring_buffer.c          |  25 ++-
>>  kernel/trace/trace.c                |  12 +-
>>  kernel/trace/trace_dynevent.c       |   6 +-
>>  kernel/trace/trace_dynevent.h       |   5 +-
>>  kernel/trace/trace_events.c         |  35 ++--
>>  kernel/trace/trace_events_filter.c  |   4 +-
>>  kernel/trace/trace_events_hist.c    |   8 +-
>>  kernel/trace/trace_events_trigger.c |  17 +-
>>  kernel/trace/trace_events_user.c    |  16 +-
>>  kernel/trace/trace_stat.c           |   4 +-
>>  kernel/user-return-notifier.c       |   3 +-
>>  kernel/workqueue.c                  |  16 +-
>>  mm/backing-dev.c                    |   8 +-
>>  mm/balloon.c                        |   8 +-
>>  mm/cma.c                            |   4 +-
>>  mm/compaction.c                     |   4 +-
>>  mm/damon/core.c                     |   4 +-
>>  mm/damon/sysfs-schemes.c            |   4 +-
>>  mm/dmapool.c                        |   4 +-
>>  mm/huge_memory.c                    |   8 +-
>>  mm/hugetlb.c                        |  56 +++---
>>  mm/hugetlb_vmemmap.c                |  16 +-
>>  mm/khugepaged.c                     |  14 +-
>>  mm/kmemleak.c                       |   7 +-
>>  mm/ksm.c                            |  25 +--
>>  mm/list_lru.c                       |   4 +-
>>  mm/memcontrol-v1.c                  |   8 +-
>>  mm/memory-failure.c                 |  12 +-
>>  mm/memory-tiers.c                   |   4 +-
>>  mm/migrate.c                        |  23 ++-
>>  mm/mmu_notifier.c                   |   9 +-
>>  mm/page_alloc.c                     |   8 +-
>>  mm/page_reporting.c                 |   2 +-
>>  mm/percpu.c                         |  11 +-
>>  mm/pgtable-generic.c                |   4 +-
>>  mm/rmap.c                           |  10 +-
>>  mm/shmem.c                          |   9 +-
>>  mm/slab_common.c                    |  14 +-
>>  mm/slub.c                           |  33 ++--
>>  mm/swapfile.c                       |   4 +-
>>  mm/userfaultfd.c                    |  12 +-
>>  mm/vmalloc.c                        |  24 +--
>>  mm/vmscan.c                         |   7 +-
>>  mm/zsmalloc.c                       |   4 +-
>>  124 files changed, 875 insertions(+), 681 deletions(-)
> 
> Not sure what you were thinking, but this diff stat
> is not landable.

Agreed. If we decide we want this, I guess we should target per-subsystem
conversions.

If this goes through the MM tree, I would even appreciate doing this on a per-MM
component granularity.

(unless we have some magic "Linus converts all of them" script, which I doubt we
will have)

Is there a way forward to replace list_for_each_*_safe entirely, possibly just
reusing the old name but simply the parameter?

-- 
Cheers,

David

^ permalink raw reply

* Re: [patch V2 18/25] timekeeping: Prepare for cross timestamps on arbitrary clock IDs
From: Thomas Gleixner @ 2026-06-22 11:07 UTC (permalink / raw)
  To: David Woodhouse, LKML
  Cc: Miroslav Lichvar, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, thomas.weissschuh, Arthur Kiyanovski,
	Rodolfo Giometti, Vincent Donnefort, Marc Zyngier, Oliver Upton,
	kvmarm, Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, Vadim Fedorenko
In-Reply-To: <b296182e2e2c1ed2fe1c4879fd6f12d67a7ad22f.camel@infradead.org>

On Mon, Jun 22 2026 at 09:55, David Woodhouse wrote:
> On Fri, 2026-05-29 at 22:01 +0200, Thomas Gleixner wrote:
>> From: Thomas Gleixner <tglx@kernel.org>
>> 
>> PTP device system crosstime stamps support only CLOCK_REALTIME, which is
>> meaningless for AUX clocks. The PTP core hands in the clock ID already, so
>> prepare the core code to honor it.
>> 
>>  - Add a new sys_systime field to struct system_device_crosststamp which
>>    aliases the sys_realtime field. Once all users are converted
>>    sys_realtime can be removed.
>> 
>>  - Prepare get_device_system_crosststamp() and the related code for it by
>>    switching to sys_systime and providing the initial changes to utilize
>>    different time keepers.
>> 
>> No functional change intended.
>
> We ended up with ktime_get_snapshot_id() also supporting CLOCK_BOOTTIME
> and CLOCK_MONOTONIC_RAW, but not get_device_system_crosststamp().
> Should we make that consistent?

Maybe. The BOOTTIME support is only there for that ARM64 hyper trace muck,
but has no other relevance.

MONORAW is there for the PTP EXTENDED IOCTL, but with PRECISE the
snapshot already contains the raw value and you'd have to prevent the
historical adjustment part for RAW. So I don't see the actual value, but
I don't have a strong opinion either.

Thanks,

        tglx




^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-22 10:46 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Alexei Starovoitov, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Paul E. McKenney, Shakeel Butt, Christian König,
	David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao, Muchun Song
In-Reply-To: <8c8f1849-86d3-4c69-be27-30bbdffdf616@linux.dev>

On Mon, Jun 22, 2026 at 02:15:01PM +0800, Kaitao Cheng wrote:
> 在 2026/6/22 13:28, Alexei Starovoitov 写道:
> > On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

...

> >>  block/bfq-iosched.c                 |  17 +-
> >>  block/blk-cgroup.c                  |  12 +-
> >>  block/blk-flush.c                   |   4 +-
> >>  block/blk-iocost.c                  |  18 +-
> >>  block/blk-mq.c                      |   8 +-
> >>  block/blk-throttle.c                |   4 +-
> >>  block/kyber-iosched.c               |   4 +-
> >>  block/partitions/ldm.c              |   8 +-
> >>  block/sed-opal.c                    |   4 +-
> >>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
> >>  include/linux/llist.h               |  81 +++++++--
> >>  init/initramfs.c                    |   5 +-
> >>  io_uring/cancel.c                   |   6 +-
> >>  io_uring/poll.c                     |   3 +-
> >>  io_uring/rw.c                       |   4 +-
> >>  io_uring/timeout.c                  |   8 +-
> >>  io_uring/uring_cmd.c                |   3 +-
> >>  kernel/audit_tree.c                 |   4 +-
> >>  kernel/audit_watch.c                |  16 +-
> >>  kernel/auditfilter.c                |   4 +-
> >>  kernel/auditsc.c                    |   4 +-
> >>  kernel/bpf/arena.c                  |  10 +-
> >>  kernel/bpf/arraymap.c               |   8 +-
> >>  kernel/bpf/bpf_local_storage.c      |   3 +-
> >>  kernel/bpf/bpf_lru_list.c           |  25 ++-
> >>  kernel/bpf/btf.c                    |  18 +-
> >>  kernel/bpf/cgroup.c                 |   7 +-
> >>  kernel/bpf/cpumap.c                 |   4 +-
> >>  kernel/bpf/devmap.c                 |  10 +-
> >>  kernel/bpf/helpers.c                |   8 +-
> >>  kernel/bpf/local_storage.c          |   4 +-
> >>  kernel/bpf/memalloc.c               |  16 +-
> >>  kernel/bpf/offload.c                |   8 +-
> >>  kernel/bpf/states.c                 |   4 +-
> >>  kernel/bpf/stream.c                 |   4 +-
> >>  kernel/bpf/verifier.c               |   6 +-
> >>  kernel/cgroup/cgroup-v1.c           |   4 +-
> >>  kernel/cgroup/cgroup.c              |  54 +++---
> >>  kernel/cgroup/dmem.c                |  12 +-
> >>  kernel/cgroup/rdma.c                |   8 +-
> >>  kernel/events/core.c                |  44 +++--
> >>  kernel/events/uprobes.c             |  12 +-
> >>  kernel/exit.c                       |   8 +-
> >>  kernel/fail_function.c              |   4 +-
> >>  kernel/gcov/clang.c                 |   4 +-
> >>  kernel/irq_work.c                   |   4 +-
> >>  kernel/kexec_core.c                 |   4 +-
> >>  kernel/kprobes.c                    |  16 +-
> >>  kernel/livepatch/core.c             |   4 +-
> >>  kernel/livepatch/core.h             |   4 +-
> >>  kernel/liveupdate/kho_block.c       |   4 +-
> >>  kernel/liveupdate/luo_flb.c         |   4 +-
> >>  kernel/locking/rwsem.c              |   2 +-
> >>  kernel/locking/test-ww_mutex.c      |   2 +-
> >>  kernel/module/main.c                |  11 +-
> >>  kernel/padata.c                     |   4 +-
> >>  kernel/power/snapshot.c             |   8 +-
> >>  kernel/power/wakelock.c             |   4 +-
> >>  kernel/printk/printk.c              |  11 +-
> >>  kernel/ptrace.c                     |   4 +-
> >>  kernel/rcu/rcutorture.c             |   3 +-
> >>  kernel/rcu/tasks.h                  |   9 +-
> >>  kernel/rcu/tree.c                   |   6 +-
> >>  kernel/resource.c                   |   4 +-
> >>  kernel/sched/core.c                 |   4 +-
> >>  kernel/sched/ext.c                  |  22 +--
> >>  kernel/sched/fair.c                 |  28 +--
> >>  kernel/sched/topology.c             |   4 +-
> >>  kernel/sched/wait.c                 |   4 +-
> >>  kernel/seccomp.c                    |   4 +-
> >>  kernel/signal.c                     |  11 +-
> >>  kernel/smp.c                        |   4 +-
> >>  kernel/taskstats.c                  |   8 +-
> >>  kernel/time/clockevents.c           |   6 +-
> >>  kernel/time/clocksource.c           |   4 +-
> >>  kernel/time/posix-cpu-timers.c      |   4 +-
> >>  kernel/time/posix-timers.c          |   3 +-
> >>  kernel/torture.c                    |   3 +-
> >>  kernel/trace/bpf_trace.c            |   4 +-
> >>  kernel/trace/ftrace.c               |  49 +++--
> >>  kernel/trace/ring_buffer.c          |  25 ++-
> >>  kernel/trace/trace.c                |  12 +-
> >>  kernel/trace/trace_dynevent.c       |   6 +-
> >>  kernel/trace/trace_dynevent.h       |   5 +-
> >>  kernel/trace/trace_events.c         |  35 ++--
> >>  kernel/trace/trace_events_filter.c  |   4 +-
> >>  kernel/trace/trace_events_hist.c    |   8 +-
> >>  kernel/trace/trace_events_trigger.c |  17 +-
> >>  kernel/trace/trace_events_user.c    |  16 +-
> >>  kernel/trace/trace_stat.c           |   4 +-
> >>  kernel/user-return-notifier.c       |   3 +-
> >>  kernel/workqueue.c                  |  16 +-
> >>  mm/backing-dev.c                    |   8 +-
> >>  mm/balloon.c                        |   8 +-
> >>  mm/cma.c                            |   4 +-
> >>  mm/compaction.c                     |   4 +-
> >>  mm/damon/core.c                     |   4 +-
> >>  mm/damon/sysfs-schemes.c            |   4 +-
> >>  mm/dmapool.c                        |   4 +-
> >>  mm/huge_memory.c                    |   8 +-
> >>  mm/hugetlb.c                        |  56 +++---
> >>  mm/hugetlb_vmemmap.c                |  16 +-
> >>  mm/khugepaged.c                     |  14 +-
> >>  mm/kmemleak.c                       |   7 +-
> >>  mm/ksm.c                            |  25 +--
> >>  mm/list_lru.c                       |   4 +-
> >>  mm/memcontrol-v1.c                  |   8 +-
> >>  mm/memory-failure.c                 |  12 +-
> >>  mm/memory-tiers.c                   |   4 +-
> >>  mm/migrate.c                        |  23 ++-
> >>  mm/mmu_notifier.c                   |   9 +-
> >>  mm/page_alloc.c                     |   8 +-
> >>  mm/page_reporting.c                 |   2 +-
> >>  mm/percpu.c                         |  11 +-
> >>  mm/pgtable-generic.c                |   4 +-
> >>  mm/rmap.c                           |  10 +-
> >>  mm/shmem.c                          |   9 +-
> >>  mm/slab_common.c                    |  14 +-
> >>  mm/slub.c                           |  33 ++--
> >>  mm/swapfile.c                       |   4 +-
> >>  mm/userfaultfd.c                    |  12 +-
> >>  mm/vmalloc.c                        |  24 +--
> >>  mm/vmscan.c                         |   7 +-
> >>  mm/zsmalloc.c                       |   4 +-
> >>  124 files changed, 875 insertions(+), 681 deletions(-)
> > 
> > Not sure what you were thinking, but this diff stat
> > is not landable.
> 
> [PATCH v3 1/7] and [PATCH v3 2/7] contain the main logic and can
> be merged directly. They are also compatible with the old API.
> [PATCH v3 3/7] through [PATCH v3 7/7] are just simple interface
> replacements and do not change any functional logic. They can be
> left unmerged for now; individual modules can pick them up later
> if needed.
> 
> In v2, Andy Shevchenko mentioned: "If it's done by Linus himself
> during the day when he prepares -rc1, it's fine."

Yes, but you need to get his blessing first to go with this.
Have you communicated with him on this?

> Even so, the
> changes in this patch series are indeed quite large and touch
> almost every subsystem. I have only converted part of them for
> now, so I wanted to send this out first and see what people think.

That's why it's better to provide a script to convert (e.g., coccinelle)
instead of tons of patches.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [patch V2 18/25] timekeeping: Prepare for cross timestamps on arbitrary clock IDs
From: David Woodhouse @ 2026-06-22  8:55 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Miroslav Lichvar, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, thomas.weissschuh, Arthur Kiyanovski,
	Rodolfo Giometti, Vincent Donnefort, Marc Zyngier, Oliver Upton,
	kvmarm, Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, Vadim Fedorenko
In-Reply-To: <20260529195557.846634842@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 890 bytes --]

On Fri, 2026-05-29 at 22:01 +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@kernel.org>
> 
> PTP device system crosstime stamps support only CLOCK_REALTIME, which is
> meaningless for AUX clocks. The PTP core hands in the clock ID already, so
> prepare the core code to honor it.
> 
>  - Add a new sys_systime field to struct system_device_crosststamp which
>    aliases the sys_realtime field. Once all users are converted
>    sys_realtime can be removed.
> 
>  - Prepare get_device_system_crosststamp() and the related code for it by
>    switching to sys_systime and providing the initial changes to utilize
>    different time keepers.
> 
> No functional change intended.

We ended up with ktime_get_snapshot_id() also supporting CLOCK_BOOTTIME
and CLOCK_MONOTONIC_RAW, but not get_device_system_crosststamp().
Should we make that consistent?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Christian König @ 2026-06-22  8:51 UTC (permalink / raw)
  To: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-2-kaitao.cheng@linux.dev>

On 6/22/26 06:05, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The list_for_each*_safe() helpers are used when the loop body may
> remove the current entry.  Their API exposes the temporary cursor at
> every call site, even though most users only need it for the iterator
> implementation and never reference it in the loop body.
> 
> Add *_mutable() variants for list and hlist iteration.  The new helpers
> support both forms: callers may keep passing an explicit temporary cursor
> when they need to inspect or reset it, or omit it and let the helper use
> a unique internal cursor.

That sounds like a bad idea to me. The macro should really be doing one job and that as best as it can.

> This makes call sites that only mutate the list through the current entry
> less noisy, while keeping the existing *_safe() helpers available for
> compatibility.

This can be perfectly used for code that which really needs the separate variable for the next entry.

Regards,
Christian.


> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 231 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 09d979976b3b..1081def7cea9 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/poison.h>
>  #include <linux/const.h>
> +#include <linux/args.h>
>  
>  #include <asm/barrier.h>
>  
> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>  #define list_for_each_prev(pos, head) \
>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>  
> -/**
> - * list_for_each_safe - iterate over a list safe against removal of list entry
> - * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> +/*
> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>   */
>  #define list_for_each_safe(pos, n, head) \
>  	for (pos = (head)->next, n = pos->next; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->next)
>  
> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->next)
> +
> +#define __list_for_each_mutable1(pos, head)				\
> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __list_for_each_mutable2(pos, next, head)			\
> +	list_for_each_safe(pos, next, head)
> +
>  /**
> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> + * list_for_each_mutable - iterate over a list safe against entry removal
>   * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_mutable(pos, ...)					\
> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_prev_safe is an old interface, use list_for_each_prev_mutable instead.
>   */
>  #define list_for_each_prev_safe(pos, n, head) \
>  	for (pos = (head)->prev, n = pos->prev; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->prev)
>  
> +#define __list_for_each_prev_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->prev)->prev;		\
> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->prev)
> +
> +#define __list_for_each_prev_mutable1(pos, head)			\
> +	__list_for_each_prev_mutable_internal(pos, __UNIQUE_ID(prev), head)
> +
> +#define __list_for_each_prev_mutable2(pos, prev, head)			\
> +	list_for_each_prev_safe(pos, prev, head)
> +
> +/**
> + * list_for_each_prev_mutable - iterate over a list backwards safe against entry removal
> + * @pos:	the &struct list_head to use as a loop cursor.
> + * @...:	either (head) or (prev, head)
> + *
> + * prev:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_prev_mutable(pos, ...)				\
> +	CONCATENATE(__list_for_each_prev_mutable, COUNT_ARGS(__VA_ARGS__)) \
> +		(pos, __VA_ARGS__)
> +
>  /**
>   * list_count_nodes - count nodes in the list
>   * @head:	the head for your list.
> @@ -895,12 +940,8 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	for (; !list_entry_is_head(pos, head, member);			\
>  	     pos = list_prev_entry(pos, member))
>  
> -/**
> - * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
> - * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> +/*
> + * list_for_each_entry_safe is an old interface, use list_for_each_entry_mutable instead.
>   */
>  #define list_for_each_entry_safe(pos, n, head, member)			\
>  	for (pos = list_first_entry(head, typeof(*pos), member),	\
> @@ -908,15 +949,36 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member); 			\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_internal(pos, tmp, head, member)	\
> +	for (typeof(pos) tmp = list_next_entry(pos =			\
> +		list_first_entry(head, typeof(*pos), member), member);	\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable2(pos, head, member)		\
> +	__list_for_each_entry_mutable_internal(pos, __UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable3(pos, next, head, member)		\
> +	list_for_each_entry_safe(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_continue - continue list iteration safe against removal
> + * list_for_each_entry_mutable - iterate over a list safe against entry removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate over list of given type, continuing after current point,
> - * safe against removal of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + */
> +#define list_for_each_entry_mutable(pos, ...)				\
> +	CONCATENATE(__list_for_each_entry_mutable, COUNT_ARGS(__VA_ARGS__)) \
> +		(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_continue is an old interface,
> + * use list_for_each_entry_mutable_continue instead.
>   */
>  #define list_for_each_entry_safe_continue(pos, n, head, member) 		\
>  	for (pos = list_next_entry(pos, member), 				\
> @@ -924,30 +986,79 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member);				\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_continue_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_next_entry(pos =			\
> +		list_next_entry(pos, member), member);			\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_continue2(pos, head, member)	\
> +	__list_for_each_entry_mutable_continue_internal(pos,		\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable_continue3(pos, next, head, member) \
> +	list_for_each_entry_safe_continue(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_from - iterate over list from current point safe against removal
> + * list_for_each_entry_mutable_continue - continue list iteration safe against removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate over list of given type from current point, safe against
> - * removal of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate over list of given type, continuing after current point,
> + * safe against removal of list entry.
> + */
> +#define list_for_each_entry_mutable_continue(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_continue,		\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_from is an old interface,
> + * use list_for_each_entry_mutable_from instead.
>   */
>  #define list_for_each_entry_safe_from(pos, n, head, member) 			\
>  	for (n = list_next_entry(pos, member);					\
>  	     !list_entry_is_head(pos, head, member);				\
>  	     pos = n, n = list_next_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_from_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_next_entry(pos, member);		\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_next_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_from2(pos, head, member)		\
> +	__list_for_each_entry_mutable_from_internal(pos,		\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __list_for_each_entry_mutable_from3(pos, next, head, member)	\
> +	list_for_each_entry_safe_from(pos, next, head, member)
> +
>  /**
> - * list_for_each_entry_safe_reverse - iterate backwards over list safe against removal
> + * list_for_each_entry_mutable_from - iterate over list from current point safe against removal
>   * @pos:	the type * to use as a loop cursor.
> - * @n:		another type * to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the list_head within the struct.
> + * @...:	either (head, member) or (next, head, member)
>   *
> - * Iterate backwards over list of given type, safe against removal
> - * of list entry.
> + * next:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate over list of given type from current point, safe against
> + * removal of list entry.
> + */
> +#define list_for_each_entry_mutable_from(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_from,			\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
> +/*
> + * list_for_each_entry_safe_reverse is an old interface,
> + * use list_for_each_entry_mutable_reverse instead.
>   */
>  #define list_for_each_entry_safe_reverse(pos, n, head, member)		\
>  	for (pos = list_last_entry(head, typeof(*pos), member),		\
> @@ -955,6 +1066,37 @@ static inline size_t list_count_nodes(struct list_head *head)
>  	     !list_entry_is_head(pos, head, member); 			\
>  	     pos = n, n = list_prev_entry(n, member))
>  
> +#define __list_for_each_entry_mutable_reverse_internal(pos, tmp, head, member) \
> +	for (typeof(pos) tmp = list_prev_entry(pos =			\
> +		list_last_entry(head, typeof(*pos), member), member);	\
> +	     !list_entry_is_head(pos, head, member);			\
> +	     pos = tmp, tmp = list_prev_entry(tmp, member))
> +
> +#define __list_for_each_entry_mutable_reverse2(pos, head, member)	\
> +	__list_for_each_entry_mutable_reverse_internal(pos,		\
> +		__UNIQUE_ID(prev), head, member)
> +
> +#define __list_for_each_entry_mutable_reverse3(pos, prev, head, member)	\
> +	list_for_each_entry_safe_reverse(pos, prev, head, member)
> +
> +/**
> + * list_for_each_entry_mutable_reverse - iterate backwards over list safe against removal
> + * @pos:	the type * to use as a loop cursor.
> + * @...:	either (head, member) or (prev, head, member)
> + *
> + * prev:	another type * to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your list.
> + * member:	the name of the list_head within the struct.
> + *
> + * Iterate backwards over list of given type, safe against removal
> + * of list entry.
> + */
> +#define list_for_each_entry_mutable_reverse(pos, ...)			\
> +	CONCATENATE(__list_for_each_entry_mutable_reverse,		\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
>  /**
>   * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
>   * @pos:	the loop cursor used in the list_for_each_entry_safe loop
> @@ -1189,6 +1331,31 @@ static inline void hlist_splice_init(struct hlist_head *from,
>  	for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \
>  	     pos = n)
>  
> +#define __hlist_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->first) ? pos->next : NULL; \
> +	     pos;							\
> +	     pos = tmp, tmp = pos ? pos->next : NULL)
> +
> +#define __hlist_for_each_mutable1(pos, head)				\
> +	__hlist_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __hlist_for_each_mutable2(pos, next, head)			\
> +	hlist_for_each_safe(pos, next, head)
> +
> +/**
> + * hlist_for_each_mutable - iterate over a hlist safe against entry removal
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct hlist_node to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your hlist.
> + */
> +#define hlist_for_each_mutable(pos, ...)				\
> +	CONCATENATE(__hlist_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)
> +
>  #define hlist_entry_safe(ptr, type, member) \
>  	({ typeof(ptr) ____ptr = (ptr); \
>  	   ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
> @@ -1224,18 +1391,44 @@ static inline void hlist_splice_init(struct hlist_head *from,
>  	for (; pos;							\
>  	     pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member))
>  
> -/**
> - * hlist_for_each_entry_safe - iterate over list of given type safe against removal of list entry
> - * @pos:	the type * to use as a loop cursor.
> - * @n:		a &struct hlist_node to use as temporary storage
> - * @head:	the head for your list.
> - * @member:	the name of the hlist_node within the struct.
> +/*
> + * hlist_for_each_entry_safe is an old interface, use hlist_for_each_entry_mutable instead.
>   */
>  #define hlist_for_each_entry_safe(pos, n, head, member) 		\
>  	for (pos = hlist_entry_safe((head)->first, typeof(*pos), member);\
>  	     pos && ({ n = pos->member.next; 1; });			\
>  	     pos = hlist_entry_safe(n, typeof(*pos), member))
>  
> +#define __hlist_for_each_entry_mutable_internal(pos, tmp, head, member)	\
> +	for (struct hlist_node *tmp = (pos =				\
> +		hlist_entry_safe((head)->first, typeof(*pos), member)) ? \
> +		pos->member.next : NULL;				\
> +	     pos;							\
> +	     pos = hlist_entry_safe((tmp), typeof(*pos), member),	\
> +		tmp = pos ? pos->member.next : NULL)
> +
> +#define __hlist_for_each_entry_mutable2(pos, head, member)		\
> +	__hlist_for_each_entry_mutable_internal(pos,			\
> +		__UNIQUE_ID(next), head, member)
> +
> +#define __hlist_for_each_entry_mutable3(pos, next, head, member)	\
> +	hlist_for_each_entry_safe(pos, next, head, member)
> +
> +/**
> + * hlist_for_each_entry_mutable - iterate over hlist safe against entry removal
> + * @pos:	the type * to use as a loop cursor.
> + * @...:	either (head, member) or (next, head, member)
> + *
> + * next:	a &struct hlist_node to use as optional temporary storage. The
> + *		temporary cursor is internal unless explicitly supplied by the
> + *		caller.
> + * head:	the head for your hlist.
> + * member:	the name of the hlist_node within the struct.
> + */
> +#define hlist_for_each_entry_mutable(pos, ...)				\
> +	CONCATENATE(__hlist_for_each_entry_mutable,			\
> +		COUNT_ARGS(__VA_ARGS__))(pos, __VA_ARGS__)
> +
>  /**
>   * hlist_count_nodes - count nodes in the hlist
>   * @head:	the head for your hlist.


^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: David Laight @ 2026-06-22  8:42 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, linux-kernel,
	cgroups, linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf,
	netdev, dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622040533.29824-2-kaitao.cheng@linux.dev>

On Mon, 22 Jun 2026 12:05:31 +0800
Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> The list_for_each*_safe() helpers are used when the loop body may
> remove the current entry.  Their API exposes the temporary cursor at
> every call site, even though most users only need it for the iterator
> implementation and never reference it in the loop body.
> 
> Add *_mutable() variants for list and hlist iteration.  The new helpers
> support both forms: callers may keep passing an explicit temporary cursor
> when they need to inspect or reset it, or omit it and let the helper use
> a unique internal cursor.

I'm not really sure 'mutable' means anything either.
It is possible to make it valid for the loop body (or even other threads)
to delete arbitrary list items - but that needs significant extra overheads.

It might be worth doing something that doesn't need the extra variable,
but there is little point doing all the churn just to rename things.

> 
> This makes call sites that only mutate the list through the current entry
> less noisy, while keeping the existing *_safe() helpers available for
> compatibility.
> 
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>  1 file changed, 231 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 09d979976b3b..1081def7cea9 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -7,6 +7,7 @@
>  #include <linux/stddef.h>
>  #include <linux/poison.h>
>  #include <linux/const.h>
> +#include <linux/args.h>
>  
>  #include <asm/barrier.h>
>  
> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>  #define list_for_each_prev(pos, head) \
>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>  
> -/**
> - * list_for_each_safe - iterate over a list safe against removal of list entry
> - * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> +/*
> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>   */
>  #define list_for_each_safe(pos, n, head) \
>  	for (pos = (head)->next, n = pos->next; \
>  	     !list_is_head(pos, (head)); \
>  	     pos = n, n = pos->next)
>  
> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\

Use auto

> +	     !list_is_head(pos, (head));				\
> +	     pos = tmp, tmp = pos->next)
> +
> +#define __list_for_each_mutable1(pos, head)				\
> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> +
> +#define __list_for_each_mutable2(pos, next, head)			\
> +	list_for_each_safe(pos, next, head)
> +
>  /**
> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> + * list_for_each_mutable - iterate over a list safe against entry removal
>   * @pos:	the &struct list_head to use as a loop cursor.
> - * @n:		another &struct list_head to use as temporary storage
> - * @head:	the head for your list.
> + * @...:	either (head) or (next, head)
> + *
> + * next:	another &struct list_head to use as optional temporary storage.
> + *		The temporary cursor is internal unless explicitly supplied by
> + *		the caller.
> + * head:	the head for your list.
> + */
> +#define list_for_each_mutable(pos, ...)					\
> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> +		(pos, __VA_ARGS__)

The variable argument count logic really just slows down compilation.
Maybe there aren't enough copies of this code to make that significant.
But just because you can do it doesn't mean it is a gooD idea.
I'm also not sure it really adds anything to the readability.

And, it you are going to make the middle argument optional there is
no need to change the macro name.

	David



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox