Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH v3] hwrng: virtio: clamp device-reported used.len at copy_data()
From: Michael Bommarito @ 2026-05-31 14:22 UTC (permalink / raw)
  To: Olivia Mackall, Herbert Xu, linux-crypto
  Cc: Michael S . Tsirkin, Jason Wang, Kees Cook, Christian Borntraeger,
	virtualization, linux-kernel

random_recv_done() stores the device-reported used.len directly into
vi->data_avail.  copy_data() then indexes vi->data[] using
vi->data_idx (advanced by previous copy_data() calls) and issues a
memcpy() without re-validating either value against the posted
buffer size sizeof(vi->data) (SMP_CACHE_BYTES bytes, typically 32
or 64).

A malicious or buggy virtio-rng backend can set used.len beyond
sizeof(vi->data), steering the memcpy() past the end of the inline
array into adjacent kmalloc-1k slab bytes.  hwrng_fillfn() mixes
those bytes into the guest RNG, and guest root can also observe
them directly via /dev/hwrng.

Concrete impact is inside the guest:

 - Memory-safety / hardening: any virtio-rng backend that
   over-reports used.len causes the driver to read past vi->data
   into unrelated slab contents.  hwrng_fillfn() is a kernel thread
   that runs as soon as the device is probed; no guest userspace
   interaction is required to first-trigger the OOB.

 - Cross-boundary leak (confidential-compute threat model): a
   malicious hypervisor cooperating with a malicious or compromised
   guest root userspace can use /dev/hwrng as a leak channel for
   guest-kernel heap data.  The host sets a large used.len, guest
   root reads /dev/hwrng, and the returned bytes contain guest
   kernel slab contents that were adjacent to vi->data.  In
   practice, confidential-compute guests (SEV-SNP, TDX) usually
   disable virtio-rng entirely, so this path is narrow, but the
   fix is still worth carrying because the underlying
   memory-safety bug contaminates the guest RNG on any host.

KASAN confirms the OOB on a 7.1-rc4 guest whose virtio-rng backend
has been patched to report used.len = 0x10000:

  BUG: KASAN: slab-out-of-bounds in virtio_read+0x394/0x5d0
  Read of size 64 at addr ffff88800ae0ba20 by task hwrng/52
  Call Trace:
   __asan_memcpy+0x23/0x60
   virtio_read+0x394/0x5d0
   hwrng_fillfn+0xb2/0x470
   kthread+0x2cc/0x3a0
  Allocated by task 1:
   probe_common+0xa5/0x660
   virtio_dev_probe+0x549/0xbc0
  The buggy address belongs to the object at ffff88800ae0b800
   which belongs to the cache kmalloc-1k of size 1024
  The buggy address is located 0 bytes to the right of
   allocated 544-byte region [ffff88800ae0b800, ffff88800ae0ba20)

Same class of bug as commit c04db81cd028 ("net/9p: Fix buffer
overflow in USB transport layer"), which hardened
usb9pfs_rx_complete() against unchecked device-reported length in
the USB 9p transport.

With the clamp at point of use and array_index_nospec() in place,
the same harness boots cleanly: copy_data() returns zero for the
bogus report, the device-supplied bytes after data_idx are
discarded, and the driver issues a fresh request.

Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Cc: stable@vger.kernel.org
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-8
---
Changes in v3:
- No functional change from v2.  Reposting the v2 clamp after the v2
  thread went quiet on linux-crypto.  Michael S. Tsirkin reconfirmed
  off-list that clamping the device-reported used.len at
  sizeof(vi->data) addresses his earlier concern, so this resends that
  fix unchanged.
- Rebased onto v7.1-rc4.  copy_data() is unchanged since 2023, so the
  clamp applies as-is, and the KASAN reproduction above was re-run on
  v7.1-rc4 (stock splats, patched boots clean).

Changes in v2 (Michael S. Tsirkin review):
- move the bound check from random_recv_done() into copy_data(), so the
  clamp sits immediately next to the memcpy() it protects.
- clamp to sizeof(vi->data) rather than substituting len = 0, so a
  previously-working but buggy device that occasionally over-reports
  used.len does not start returning zero-length reads.
- add array_index_nospec() on vi->data_idx to defeat a speculative
  out-of-bounds read given the malicious-backend threat model.
- expand the commit message with the /dev/hwrng observation path and
  the hypervisor plus guest-root cooperation scenario.

v1: https://lore.kernel.org/all/20260418000020.1847122-1-michael.bommarito@gmail.com/
v2: https://lore.kernel.org/all/20260418150613.3522589-1-michael.bommarito@gmail.com/

 drivers/char/hw_random/virtio-rng.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
index 0ce02d7e5048e..5e83ffa105e41 100644
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -7,6 +7,7 @@
 #include <asm/barrier.h>
 #include <linux/err.h>
 #include <linux/hw_random.h>
+#include <linux/nospec.h>
 #include <linux/scatterlist.h>
 #include <linux/spinlock.h>
 #include <linux/virtio.h>
@@ -69,8 +70,26 @@ static void request_entropy(struct virtrng_info *vi)
 static unsigned int copy_data(struct virtrng_info *vi, void *buf,
 			      unsigned int size)
 {
-	size = min_t(unsigned int, size, vi->data_avail);
-	memcpy(buf, vi->data + vi->data_idx, size);
+	unsigned int idx, avail;
+
+	/*
+	 * vi->data_avail was set from the device-reported used.len and
+	 * vi->data_idx was advanced by previous copy_data() calls.  A
+	 * malicious or buggy virtio-rng backend can drive either past
+	 * sizeof(vi->data).  Clamp at point of use and harden the index
+	 * with array_index_nospec() so the memcpy() below cannot be
+	 * steered into adjacent slab memory, including under
+	 * speculation.
+	 */
+	avail = min_t(unsigned int, vi->data_avail, sizeof(vi->data));
+	if (vi->data_idx >= avail) {
+		vi->data_avail = 0;
+		request_entropy(vi);
+		return 0;
+	}
+	size = min_t(unsigned int, size, avail - vi->data_idx);
+	idx = array_index_nospec(vi->data_idx, sizeof(vi->data));
+	memcpy(buf, vi->data + idx, size);
 	vi->data_idx += size;
 	vi->data_avail -= size;
 	if (vi->data_avail == 0)

base-commit: a1f173eb51db0dc78536334729ef832c62d6c65a
-- 
2.53.0


^ permalink raw reply related

* [PATCH] drm/qxl: fix local_gobj leak in qxl_gem_object_create_with_handle
From: Hongtao Lee @ 2026-06-01  6:49 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Gerd Hoffmann, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, virtualization,
	spice-devel, dri-devel, linux-kernel, Hongtao Lee

If drm_gem_handle_create fails, the reference count of local_gobj
will never be released, resulting in a leak of local_gobj.

Fixes: f64122c1f6ad("drm: add new QXL driver. (v1.4)")
Signed-off-by: Hongtao Lee <lihongtao@kylinos.cn>
---
 drivers/gpu/drm/qxl/qxl_gem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/qxl/qxl_gem.c b/drivers/gpu/drm/qxl/qxl_gem.c
index 4939b57a2a48..2bdff4cee952 100644
--- a/drivers/gpu/drm/qxl/qxl_gem.c
+++ b/drivers/gpu/drm/qxl/qxl_gem.c
@@ -99,8 +99,10 @@ int qxl_gem_object_create_with_handle(struct qxl_device *qdev,
 	if (r)
 		return -ENOMEM;
 	r = drm_gem_handle_create(file_priv, local_gobj, handle);
-	if (r)
+	if (r) {
+		drm_gem_object_put(local_gobj);
 		return r;
+	}
 
 	if (gobj)
 		*gobj = local_gobj;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v9 05/37] mm: hugetlb: remove dead alloc_hugetlb_folio stub
From: Dev Jain @ 2026-06-01  6:55 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <8e2d9431b0e324100d550c765024c883560a4367.1780067977.git.mst@redhat.com>



On 29/05/26 8:52 pm, Michael S. Tsirkin wrote:
> Remove the !CONFIG_HUGETLB_PAGE stub for alloc_hugetlb_folio().
> 
> The stub is dead code: all callers are in mm/hugetlb.c
> (CONFIG_HUGETLB_PAGE) or fs/hugetlbfs/inode.c (CONFIG_HUGETLBFS),
> and CONFIG_HUGETLB_PAGE is def_bool HUGETLBFS with nothing
> selecting it independently.
> 
> The stub is also broken: it returns NULL, but all callers check
> IS_ERR(folio), so a NULL return would not be caught and would
> crash on the subsequent folio dereference.
> 
> Remove it now since follow-up patches change the signature of
> alloc_hugetlb_folio and would otherwise need to update the
> broken stub too.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> ---

Reviewed-by: Dev Jain <dev.jain@arm.com>


>  include/linux/hugetlb.h | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 93418625d3c5..f016bc2e8936 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1129,13 +1129,6 @@ static inline void wait_for_freed_hugetlb_folios(void)
>  {
>  }
>  
> -static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -					   unsigned long addr,
> -					   bool cow_from_owner)
> -{
> -	return NULL;
> -}
> -
>  static inline struct folio *
>  alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>  			    nodemask_t *nmask, gfp_t gfp_mask)


^ permalink raw reply

* Re: [PATCH v9 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Miaohe Lin @ 2026-06-01  7:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	linux-kernel
In-Reply-To: <2c527d20c99cdfe64a77bcf8da75f742d4c6991e.1780067977.git.mst@redhat.com>

On 2026/5/29 23:22, Michael S. Tsirkin wrote:
> TestSetPageHWPoison() is called without zone->lock, so its atomic
> update to page->flags can race with non-atomic flag operations
> that run under zone->lock in the buddy allocator.
> 
> In particular, __free_pages_prepare() does:
> 
>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> This non-atomic read-modify-write, while correctly excluding
> __PG_HWPOISON from the mask, can still lose a concurrent
> TestSetPageHWPoison if the read happens before the poison bit
> is set and the write happens after.  Follow-up patches in this
> series add similar non-atomic flag operations as well.
> 
> Fix by acquiring zone->lock around TestSetPageHWPoison and
> around ClearPageHWPoison in the retry path.  This
> serializes with all buddy flag manipulation.  The cost is
> negligible: one lock/unlock in an extremely rare path
> (hardware memory errors).
> 
> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> in this file operate on pages already removed from the buddy
> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> need zone->lock protection.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/memory-failure.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ee42d4361309..d106f2c135c7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
>  	unsigned long page_flags;
>  	bool retry = true;
>  	int hugetlb = 0;
> +	struct zone *zone;
> +	unsigned long mf_flags;
>  
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
> @@ -2390,7 +2392,10 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (hugetlb)
>  		goto unlock_mutex;
>  
> +	zone = page_zone(p);
> +	spin_lock_irqsave(&zone->lock, mf_flags);

Would it be better to add a comment here why zone->lock is needed?

>  	if (TestSetPageHWPoison(p)) {
> +		spin_unlock_irqrestore(&zone->lock, mf_flags);
>  		res = -EHWPOISON;
>  		if (flags & MF_ACTION_REQUIRED)
>  			res = kill_accessing_process(current, pfn, flags);
> @@ -2399,6 +2404,7 @@ int memory_failure(unsigned long pfn, int flags)
>  		action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
>  		goto unlock_mutex;
>  	}
> +	spin_unlock_irqrestore(&zone->lock, mf_flags);
>  
>  	/*
>  	 * We need/can do nothing about count=0 pages.
> @@ -2420,7 +2426,9 @@ int memory_failure(unsigned long pfn, int flags)
>  			} else {
>  				/* We lost the race, try again */
>  				if (retry) {
> +					spin_lock_irqsave(&zone->lock, mf_flags);
>  					ClearPageHWPoison(p);
> +					spin_unlock_irqrestore(&zone->lock, mf_flags);

Ditto.

Acked-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks.
.

^ permalink raw reply

* [PATCH v3] drm/virtio: abort virtqueue wait on device removal to avoid hung task
From: Ryosuke Yasuoka @ 2026-06-01  7:53 UTC (permalink / raw)
  To: David Airlie, Gerd Hoffmann, Dmitry Osipenko, Gurchetan Singh,
	Chia-I Wu, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter
  Cc: dri-devel, virtualization, linux-kernel, Ryosuke Yasuoka

virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() use
wait_event() without any abort condition when waiting for virtqueue
space. If the host device stops processing commands, these waits block
indefinitely inside a drm_dev_enter/exit() critical section. Since
drm_dev_unplug(), which is called in device removal and system shutdown
call path, blocks on synchronize_srcu() until all critical sections
complete, device removal and system shutdown also hang.

Add a vqs_released flag to virtio_gpu_device and include it in the
wait_event() condition. Set the flag and wake up both queues in a new
virtio_gpu_release_vqs() helper, called before drm_dev_unplug() in both
virtio_gpu_remove() and virtio_gpu_shutdown(). When the flag is set, the
wait returns immediately and the command is aborted, following the same
cleanup path as drm_dev_enter() failure.

Signed-off-by: Ryosuke Yasuoka <ryasuoka@redhat.com>
---
Changes in v3:
- Remove Reported-by and Closes tag from commit msg because they are not
  related to this fix.
- Link to v2: https://lore.kernel.org/r/20260521-virtio-gpu_wait_event-v2-1-5796b3a71d03@redhat.com

Changes in v2:
- Update the commit message.
- Replace wait_event_timeout() with wait_event() using a compound
condition that includes a new vqs_released flag.
- Add virtio_gpu_release_vqs() helper to set the flag and wake up
both queues, called before drm_dev_unplug() in remove and shutdown
paths.
- Remove the hardcoded 5-second timeout. Recovery is now driven by
the driver flag instead of an arbitrary timeout value.
---
 drivers/gpu/drm/virtio/virtgpu_drv.c | 15 +++++++++++++++
 drivers/gpu/drm/virtio/virtgpu_drv.h |  1 +
 drivers/gpu/drm/virtio/virtgpu_vq.c  | 23 +++++++++++++++++++++--
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
index a5ce96fb8a1d..e4fe5e0780f9 100644
--- a/drivers/gpu/drm/virtio/virtgpu_drv.c
+++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
@@ -119,10 +119,24 @@ static int virtio_gpu_probe(struct virtio_device *vdev)
 	return ret;
 }
 
+/*
+ * Release pending virtqueue waits so the drm_dev_enter/exit() critical
+ * sections complete before drm_dev_unplug() blocks on synchronize_srcu().
+ */
+static void virtio_gpu_release_vqs(struct drm_device *dev)
+{
+	struct virtio_gpu_device *vgdev = dev->dev_private;
+
+	vgdev->vqs_released = true;
+	wake_up_all(&vgdev->ctrlq.ack_queue);
+	wake_up_all(&vgdev->cursorq.ack_queue);
+}
+
 static void virtio_gpu_remove(struct virtio_device *vdev)
 {
 	struct drm_device *dev = vdev->priv;
 
+	virtio_gpu_release_vqs(dev);
 	drm_dev_unplug(dev);
 	drm_atomic_helper_shutdown(dev);
 	virtio_gpu_deinit(dev);
@@ -133,6 +147,7 @@ static void virtio_gpu_shutdown(struct virtio_device *vdev)
 {
 	struct drm_device *dev = vdev->priv;
 
+	virtio_gpu_release_vqs(dev);
 	/* stop talking to the device */
 	drm_dev_unplug(dev);
 }
diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h b/drivers/gpu/drm/virtio/virtgpu_drv.h
index 2f3531950aa4..5f7bb6cc6ba7 100644
--- a/drivers/gpu/drm/virtio/virtgpu_drv.h
+++ b/drivers/gpu/drm/virtio/virtgpu_drv.h
@@ -235,6 +235,7 @@ struct virtio_gpu_device {
 
 	struct virtio_gpu_queue ctrlq;
 	struct virtio_gpu_queue cursorq;
+	bool vqs_released;
 	struct kmem_cache *vbufs;
 
 	atomic_t pending_commands;
diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c
index 67865810a2e7..8057a9b7356d 100644
--- a/drivers/gpu/drm/virtio/virtgpu_vq.c
+++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
@@ -396,7 +396,19 @@ static int virtio_gpu_queue_ctrl_sgs(struct virtio_gpu_device *vgdev,
 	if (vq->num_free < elemcnt) {
 		spin_unlock(&vgdev->ctrlq.qlock);
 		virtio_gpu_notify(vgdev);
-		wait_event(vgdev->ctrlq.ack_queue, vq->num_free >= elemcnt);
+		wait_event(vgdev->ctrlq.ack_queue,
+			   vq->num_free >= elemcnt || vgdev->vqs_released);
+		/*
+		 * Set by virtio_gpu_release_vqs() to unblock
+		 * synchronize_srcu() wait in drm_dev_unplug().
+		 */
+		if (vgdev->vqs_released) {
+			if (fence && vbuf->objs)
+				virtio_gpu_array_unlock_resv(vbuf->objs);
+			free_vbuf(vgdev, vbuf);
+			drm_dev_exit(idx);
+			return -ENODEV;
+		}
 		goto again;
 	}
 
@@ -566,7 +578,14 @@ static void virtio_gpu_queue_cursor(struct virtio_gpu_device *vgdev,
 	ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
 	if (ret == -ENOSPC) {
 		spin_unlock(&vgdev->cursorq.qlock);
-		wait_event(vgdev->cursorq.ack_queue, vq->num_free >= outcnt);
+		wait_event(vgdev->cursorq.ack_queue,
+			   vq->num_free >= outcnt || vgdev->vqs_released);
+		/* See comment in virtio_gpu_queue_ctrl_sgs(). */
+		if (vgdev->vqs_released) {
+			free_vbuf(vgdev, vbuf);
+			drm_dev_exit(idx);
+			return;
+		}
 		spin_lock(&vgdev->cursorq.qlock);
 		goto retry;
 	} else {

---
base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
change-id: 20260518-virtio-gpu_wait_event-5aa060754f12

Best regards,
-- 
Ryosuke Yasuoka <ryasuoka@redhat.com>


^ permalink raw reply related

* Re: [PATCH] vsock/vmci: fix sk_ack_backlog leak on failed handshake
From: Paolo Abeni @ 2026-06-01  9:25 UTC (permalink / raw)
  To: Raf Dickson, netdev, virtualization, linux-kernel
  Cc: sgarzare, stefanha, bryan-bt.tan, vishnu.dasa,
	bcm-kernel-feedback-list, stable
In-Reply-To: <20260526104356.469928-1-rafdog35@gmail.com>

On 5/26/26 12:43 PM, Raf Dickson wrote:
> When vmci_transport_recv_connecting_server() returns an error,
> vmci_transport_recv_listen() calls vsock_remove_pending() but never
> calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented
> permanently.
> 
> Repeated handshake failures (malformed packets, queue pair alloc
> failure, event subscribe failure) cause sk_ack_backlog to climb
> toward sk_max_ack_backlog. Once it reaches the limit the listener
> permanently refuses all new connections with -ECONNREFUSED, a
> silent denial of service requiring a process restart to recover.
> 
> The two existing sk_acceptq_removed() calls in af_vsock.c do not
> cover this path: line 764 checks vsock_is_pending() which returns
> false after vsock_remove_pending(), and line 1889 is only reached
> on successful accept().
> 
> Fix by balancing sk_acceptq_added() with sk_acceptq_removed() on
> the error path.
> 
> Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
> Cc: stable@vger.kernel.org
> Signed-off-by: Raf Dickson <rafdog35@gmail.com>

Waiting for Stefano's feedback - should be back in a couple of days.

> ---
>  net/vmw_vsock/vmci_transport.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index d2579380f5..88ccc55455 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -980,8 +980,10 @@ static int vmci_transport_recv_listen(struct sock *sk,
>  			err = -EINVAL;
>  		}
>  
> -		if (err < 0)
> +		if (err < 0) {
>  			vsock_remove_pending(sk, pending);
> +			sk_acceptq_removed(sk);

I'm wondering if sk_acceptq_removed() should be bounded in
vsock_remove_pending() ? (even if that change would probably be net-next
material).

/P



> +		}
>  
>  		release_sock(pending);
>  		vmci_transport_release_pending(pending);


^ permalink raw reply

* Re: [PATCH] vsock/vmci: fix sk_ack_backlog leak on failed handshake
From: Raf Dickson @ 2026-06-01  9:56 UTC (permalink / raw)
  To: pabeni
  Cc: sgarzare, netdev, virtualization, linux-kernel, stefanha,
	bryan-bt.tan, vishnu.dasa, bcm-kernel-feedback-list, stable
In-Reply-To: <97069506-352b-4152-a57b-5a974320529d@redhat.com>

On Mon, Jun 1, 2026 at 9:26 AM Paolo Abeni wrote:
> I'm wondering if sk_acceptq_removed() should be bounded in
> vsock_remove_pending() ? (even if that change would probably be
> net-next material).

Agreed, that would prevent this class of bug entirely. Happy to prepare
a follow-up patch for net-next once this fix lands, if that would be
useful.

Raf

^ permalink raw reply

* Re: [PATCH v4 01/10] drm/damage-helper: Do not alter damage clips on modeset, but ignore them
From: Javier Martinez Canillas @ 2026-06-01 10:16 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann, stable
In-Reply-To: <20260530185716.65688-2-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

Hello Thomas,

> User space supplies rectangles for damage clipping in a plane property.
> For full mode sets, drivers still require a full plane update. In this
> case, leave the information as-is and set the ignore_damage_clips flag
> instead. The damage iterator will later ignore any damage information.
>
> Also fixes a bug where ignore_damage_clips was not cleared across plane-
> state duplications.
>
> Leaving the damage information as-is might be helpful to drivers that
> benefit from this information even on full modesets (e.g., for cache
> management). It will also help with consolidating the damage-handling
> logic.
>
> Also add a new unit test that evaluates the ignore_damage_clips flag. It
> sets two damage clips plus the flag and tests if the reported damage
> covers the entire framebuffer.
>
> v4:
> - slightly reword the commit description
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: <stable@vger.kernel.org> # v6.10+
> ---
>  drivers/gpu/drm/drm_atomic_state_helper.c     |  1 +
>  drivers/gpu/drm/drm_damage_helper.c           |  6 ++--
>  .../gpu/drm/tests/drm_damage_helper_test.c    | 28 +++++++++++++++++++
>  3 files changed, 31 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_atomic_state_helper.c b/drivers/gpu/drm/drm_atomic_state_helper.c
> index cc70508d4fdb..84d5231ccac1 100644
> --- a/drivers/gpu/drm/drm_atomic_state_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_state_helper.c
> @@ -359,6 +359,7 @@ void __drm_atomic_helper_plane_duplicate_state(struct drm_plane *plane,
>  	state->fence = NULL;
>  	state->commit = NULL;
>  	state->fb_damage_clips = NULL;
> +	state->ignore_damage_clips = false;
>  	state->color_mgmt_changed = false;
>  }

I would split this as a separate patch since is the bug you are fixing for
commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to
ignore damage clips").

>  EXPORT_SYMBOL(__drm_atomic_helper_plane_duplicate_state);
> diff --git a/drivers/gpu/drm/drm_damage_helper.c b/drivers/gpu/drm/drm_damage_helper.c
> index 74a7f4252ecf..945fac8dc27b 100644
> --- a/drivers/gpu/drm/drm_damage_helper.c
> +++ b/drivers/gpu/drm/drm_damage_helper.c
> @@ -78,10 +78,8 @@ void drm_atomic_helper_check_plane_damage(struct drm_atomic_commit *state,
>  		if (WARN_ON(!crtc_state))
>  			return;
>  
> -		if (drm_atomic_crtc_needs_modeset(crtc_state)) {
> -			drm_property_blob_put(plane_state->fb_damage_clips);
> -			plane_state->fb_damage_clips = NULL;
> -		}
> +		if (drm_atomic_crtc_needs_modeset(crtc_state))
> +			plane_state->ignore_damage_clips = true;
>  	}
>  }

This makes sense to me as well and I agree that re-using the flag for this
is better than making plane_state->fb_damage_clips == NULL the condition.

As mentioned though, I would make it a separate patch. Both changes look
good to me:

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 02/10] drm/atomic-helpers: Evaluate plane damage after atomic_check
From: Javier Martinez Canillas @ 2026-06-01 10:19 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-3-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Each plane's and CRTC's atomic_check might trigger a full modeset. As
> this affects the plane's damage handling, evaluate damage clips after
> running the atomic_check helpers.
>
> Examples can be found in a number of drivers, such as ast, gud, ingenic,
> mgag200 or vmwgfx, which all set mode_changed in the CRTC state to true.
> Ingenic even re-evaluates damage information in its plane's atomic_check.
> Doing this after the atomic_check helpers ran benefits all drivers.
>
> There's already a damage evaluation before the calls to atomic_check.
> With a few fixes to drivers, this can be removed.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---
>  drivers/gpu/drm/drm_atomic_helper.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
> index 51f39edc31ed..4c37299e8ccb 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -1065,6 +1065,10 @@ drm_atomic_helper_check_planes(struct drm_device *dev,
>  		}
>  	}
>  
> +	for_each_oldnew_plane_in_state(state, plane, old_plane_state, new_plane_state, i) {
> +		drm_atomic_helper_check_plane_damage(state, new_plane_state);
> +	}
> +

I wonder if it's worth to mention this in the drm_atomic_helper_check_planes()
function kernel-doc comment. But regardless, the change makes sense to me:

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 03/10] drm/ingenic: Remove calls to drm_atomic_helper_check_plane_damage()
From: Javier Martinez Canillas @ 2026-06-01 10:20 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-4-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Atomic helpers call drm_atomic_helper_check_plane_damage() after the
> atomic_check anyway. See atomic_helper_check_planes(). Remove the calls
> from the planes' atomic_check.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 04/10] drm/appletbdrm: Allocate request/response buffers in begin_fb_access
From: Javier Martinez Canillas @ 2026-06-01 10:21 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-5-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> In atomic_check, damage handling is not fully evaluated. Another
> atomic_check helper could trigger a full modeset and thus invalidate
> damage clips.
>
> Allocation of the request/response buffers in appletbdrm depends on
> correct damage information. Otherwise it might allocate incorrectly
> sized buffers. Allocate the buffers in the driver's begin_fb_access
> helper. It runs early during the commit when damage clipping has been
> fully evaluated.
>
> v2:
> - allocate before drm_gem_begin_shadow_fb_access() to avoid leak on error

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 05/10] drm/atomic_helper: Do not evaluate plane damage before atomic_check
From: Javier Martinez Canillas @ 2026-06-01 10:22 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-6-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Remove the call to drm_atomic_helper_check_plane_damage() from before
> calling the atomic_check helpers. The call has no longer any purpose,
> as the actual evaluation happens after running atomic_check.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 06/10] drm/damage-helper: Test src coord in drm_atomic_helper_check_plane_damage()
From: Javier Martinez Canillas @ 2026-06-01 10:27 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-7-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Planes require a full update if the source coordinates change across
> atomic commits. Evaluate this during the atomic-check and set the flag
> ignore_damage_clips in the plane state, if so. Remove the check from
> drm_atomic_helper_damage_iter_init().
>
> This will help with removing the old state from the atomic-commit phase
> and simplify atomic_update helpers a bit.
>
> Several unit tests check against the change of the src coordinate. Drop
> them as they do no longer serve a purpose. If the src coordinate changes
> across commits, atomic helpers will set the plane state's
> ignore_damage_clips flag, for which a separate unit test exists.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 07/10] drm/damage-helper: Remove old state from drm_atomic_helper_damage_iter_init()
From: Javier Martinez Canillas @ 2026-06-01 10:28 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-8-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Nothing in drm_atomic_helper_damage_iter_init() requires the old
> plane state. Remove the parameter and mass-convert callers.
>
> Most callers now no longer require the old plane state in their plane's
> atomic_update helper. Remove it as well.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 08/10] drm/damage-helper: Remove old state from drm_atomic_helper_damage_merged()
From: Javier Martinez Canillas @ 2026-06-01 10:29 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-9-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Nothing in drm_atomic_helper_damage_merged() requires the old
> plane state. Remove the parameter and mass-convert callers.
>
> Most callers now no longer require the old plane state in their plane's
> atomic_update helper. Remove it as well.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 09/10] drm/damage-helper: Rename state parameters in damage helpers
From: Javier Martinez Canillas @ 2026-06-01 10:29 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-10-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Rename some of the state parameters of the damage-helper functions to
> align them with each other and other helpers. No functional changes.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 10/10] drm/vmwgfx: Remove unused field struct vmwgfx_du_update_plane.old_state
From: Javier Martinez Canillas @ 2026-06-01 10:30 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-11-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Plane updates no longer require the old plane state. Remove the field
> from struct vmwgfx_du_update_plane and fix all callers.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Reviewed-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* [PATCH net] vhost/net: complete zerocopy ubufs only once
From: Qing Ming @ 2026-06-01 10:43 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Eugenio Pérez, Shirley, David S. Miller, kvm, virtualization,
	netdev, linux-kernel, Qing Ming

vhost-net initializes one ubuf_info per outstanding zerocopy TX
descriptor and hands it to the backend socket.  The networking stack may
then clone a zerocopy skb before all skb references are released.  For
example, batman-adv fragmentation reaches skb_split(), which calls
skb_zerocopy_clone() and increments the same ubuf_info refcount.

vhost_zerocopy_complete() currently treats every ubuf callback as a
completed vhost descriptor.  It dereferences ubuf->ctx, writes the
descriptor completion state, and drops the vhost_net_ubuf_ref even when
the callback only releases a cloned skb reference.  A backend reset can
therefore wait for and free the vhost_net_ubuf_ref while another cloned
skb still carries the same ubuf_info.  A later completion then
dereferences the freed ubufs pointer.

KASAN reports the stale completion as:

  BUG: KASAN: slab-use-after-free in vhost_zerocopy_complete+0x1d7/0x1f0
  BUG: KASAN: slab-use-after-free in vhost_zerocopy_complete+0x101/0x1f0
  vhost_zerocopy_complete
  skb_copy_ubufs
  __dev_forward_skb2
  veth_xmit

The freed object was allocated from vhost_net_ioctl() while setting the
backend and freed through kfree_rcu()/kvfree_rcu_bulk after backend
removal, while delayed skb completion still reached
vhost_zerocopy_complete().

Honor the generic ubuf_info refcount before touching vhost state, and run
the vhost descriptor completion only for the final ubuf reference.  This
matches the msg_zerocopy_complete() ownership rule for cloned zerocopy
skbs.

Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support")
Signed-off-by: Qing Ming <a0yami@mailbox.org>
---
 drivers/vhost/net.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c6536cad9c4f..b9af63fb6306 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -390,13 +390,20 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
 static void vhost_zerocopy_complete(struct sk_buff *skb,
 				    struct ubuf_info *ubuf_base, bool success)
 {
-	struct ubuf_info_msgzc *ubuf = uarg_to_msgzc(ubuf_base);
-	struct vhost_net_ubuf_ref *ubufs = ubuf->ctx;
-	struct vhost_virtqueue *vq = ubufs->vq;
+	struct ubuf_info_msgzc *ubuf;
+	struct vhost_net_ubuf_ref *ubufs;
+	struct vhost_virtqueue *vq;
 	int cnt;
 
-	rcu_read_lock_bh();
+	/* Only the final cloned skb reference completes the vhost descriptor. */
+	if (!refcount_dec_and_test(&ubuf_base->refcnt))
+		return;
+
+	ubuf = uarg_to_msgzc(ubuf_base);
+	ubufs = ubuf->ctx;
+	vq = ubufs->vq;
 
+	rcu_read_lock_bh();
 	/* set len to mark this desc buffers done DMA */
 	vq->heads[ubuf->desc].len = success ?
 		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
-- 
2.53.0


^ permalink raw reply related

* Re: [patch V2 00/25] timekeeping/ptp: Expand snapshot functionality
From: Michael S. Tsirkin @ 2026-06-01 10:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Woodhouse, Miroslav Lichvar, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	thomas.weissschuh, Arthur Kiyanovski, Rodolfo Giometti,
	Vincent Donnefort, Marc Zyngier, Oliver Upton, kvmarm,
	Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, virtualization, linux-wireless,
	linux-sound, David Woodhouse, Vadim Fedorenko
In-Reply-To: <20260529193435.921555544@kernel.org>

On Fri, May 29, 2026 at 09:59:47PM +0200, Thomas Gleixner wrote:
> This is an update to V1 which can be found here:
> 
>    https://lore.kernel.org/lkml/20260526165826.392227559@kernel.org
> 
> PTP wants to grow new snapshot functionality, which provides not only the
> captured CLOCK* values, but also the underlying clocksource counter value.
> 
>    https://lore.kernel.org/20260515164033.6403-1-akiyano@amazon.com
> 
> There was quite some discussion in seemingly related threads how to capture
> these values and how to provide core infrastructure so that driver writers
> have something to work with
> 
>    https://lore.kernel.org/20260514225842.110706-1-hramamurthy@google.com
>    https://lore.kernel.org/20260520135207.37826-1-dwmw2@infradead.org


virtio bits:

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> This series implements the timekeeping related mechanisms to:
> 
>      1) Capture CLOCK values along with the clocksource counter value for
>      	non-hardware based sampling
> 
>      2) Expanding the hardware cross time stamp mechanism to hand back the
>      	clocksource counter value, which was captured by the device, along
>      	with the related CLOCK values
> 
>      3) Adding AUX clock support to the hardware cross timestamping core
> 
>      4) Add support for derived clocksources to the snapshot mechanism (New
>      	in V2)
> 
> Changes vs. V1:
> 
>   - Fixed the ptp_ocp typo - 0-day, Jakub
> 
>   - Renamed the system_time_snapshot members sys and raw so systime and
>     monoraw to make them less ambigous.
> 
>   - Fixed the error case return values of get_device_system_crosststamp()
> 
>   - Made ktime_snapshot_id() void as there is no point for the return
>     value, which is nowhere checked and cannot be propagated.
>     system_time_snapshot::valid has to be evaluated at the call sites
>     anyway. - Jacob
> 
>   - Picked up the first patch from Davids follow up series, which extends
>     the snapshot mechanism so that derived clocksources (like kvmclock and
>     Hyper-V scaled TSC) can return the actual underlying hardware counter
>     value (TSC for the two examples).
> 
>   - Collected Reviewed/Acked/Tested-by tags
> 
> Delta patch against v1 below.
> 
> The series is based on v7.1-rc2 and also available from git:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timekeeping-ptp-extend-v2
> 
> Thanks,
> 
> 	tglx
> ---
> diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c
> index 616062587510..b056c652ff01 100644
> --- a/arch/arm64/kvm/hyp_trace.c
> +++ b/arch/arm64/kvm/hyp_trace.c
> @@ -51,8 +51,8 @@ static void __hyp_clock_work(struct work_struct *work)
>  
>  	hyp_clock = container_of(dwork, struct hyp_trace_clock, work);
>  
> -	ktime_get_snapshot_id(&snap, CLOCK_BOOTTIME);
> -	boot = ktime_to_ns(snap.sys);
> +	ktime_get_snapshot_id(CLOCK_BOOTTIME, &snap);
> +	boot = ktime_to_ns(snap.systime);
>  
>  	delta_boot = boot - hyp_clock->boot;
>  	delta_cycles = snap.cycles - hyp_clock->cycles;
> @@ -120,7 +120,7 @@ static void hyp_trace_clock_enable(struct hyp_trace_clock *hyp_clock, bool enabl
>  
>  	ktime_get_snapshot_id(&snap, CLOCK_BOOTTIME);
>  
> -	hyp_clock->boot = ktime_to_ns(snap.sys);
> +	hyp_clock->boot = ktime_to_ns(snap.systime);
>  	hyp_clock->cycles = snap.cycles;
>  	hyp_clock->mult = 0;
>  
> diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> index e60cc7ed3e70..b11b8821c9fb 100644
> --- a/arch/arm64/kvm/hypercalls.c
> +++ b/arch/arm64/kvm/hypercalls.c
> @@ -28,7 +28,7 @@ static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
>  	 * system time and counter value must captured at the same
>  	 * time to keep consistency and precision.
>  	 */
> -	ktime_get_snapshot_id(&systime_snapshot, CLOCK_REALTIME);
> +	ktime_get_snapshot_id(CLOCK_REALTIME, &systime_snapshot);
>  
>  	/*
>  	 * This is only valid if the current clocksource is the
> @@ -61,8 +61,8 @@ static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val)
>  	 * in the future (about 292 years from 1970, and at that stage
>  	 * nobody will give a damn about it).
>  	 */
> -	val[0] = upper_32_bits(systime_snapshot.sys);
> -	val[1] = lower_32_bits(systime_snapshot.sys);
> +	val[0] = upper_32_bits(systime_snapshot.systime);
> +	val[1] = lower_32_bits(systime_snapshot.systime);
>  	val[2] = upper_32_bits(cycles);
>  	val[3] = lower_32_bits(cycles);
>  }
> diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
> index 1d5fef4df560..2697073dbf90 100644
> --- a/drivers/net/dsa/sja1105/sja1105_main.c
> +++ b/drivers/net/dsa/sja1105/sja1105_main.c
> @@ -2310,10 +2310,10 @@ int sja1105_static_config_reload(struct sja1105_private *priv,
>  		goto out;
>  	}
>  
> -	t1 = ktime_to_ns(ptp_sts_before.pre_sts.sys);
> -	t2 = ktime_to_ns(ptp_sts_before.post_sts.sys);
> -	t3 = ktime_to_ns(ptp_sts_after.pre_sts.sys);
> -	t4 = ktime_to_ns(ptp_sts_after.post_sts.sys);
> +	t1 = ktime_to_ns(ptp_sts_before.pre_sts.systime);
> +	t2 = ktime_to_ns(ptp_sts_before.post_sts.systime);
> +	t3 = ktime_to_ns(ptp_sts_after.pre_sts.systime);
> +	t4 = ktime_to_ns(ptp_sts_after.post_sts.systime);
>  	/* Mid point, corresponds to pre-reset PTPCLKVAL */
>  	t12 = t1 + (t2 - t1) / 2;
>  	/* Mid point, corresponds to post-reset PTPCLKVAL, aka 0 */
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index 5023fc1587f9..f9e4ec6f7ebb 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -2117,7 +2117,7 @@ static int ice_capture_crosststamp(ktime_t *device,
>  	}
>  
>  	/* Snapshot system time for historic interpolation */
> -	ktime_get_snapshot_id(&ctx->snapshot, ctx->snapshot_clock_id);
> +	ktime_get_snapshot_id(ctx->snapshot_clock_id, &ctx->snapshot);
>  
>  	/* Program cmd to master timer */
>  	ice_ptp_src_cmd(hw, ICE_PTP_READ_TIME);
> diff --git a/drivers/net/ethernet/intel/igc/igc_ptp.c b/drivers/net/ethernet/intel/igc/igc_ptp.c
> index 9b8b4a04e32d..b40aba9ab685 100644
> --- a/drivers/net/ethernet/intel/igc/igc_ptp.c
> +++ b/drivers/net/ethernet/intel/igc/igc_ptp.c
> @@ -1049,7 +1049,7 @@ static int igc_phc_get_syncdevicetime(ktime_t *device,
>  	 */
>  	do {
>  		/* Get a snapshot of system clocks to use as historic value. */
> -		ktime_get_snapshot_id(&adapter->snapshot, adapter->snapshot_clock_id);
> +		ktime_get_snapshot_id(adapter->snapshot_clock_id, &adapter->snapshot);
>  
>  		igc_ptm_trigger(hw);
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> index beb80912b9d5..5df786133e4b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
> @@ -340,7 +340,7 @@ static int mlx5_ptp_getcrosststamp(struct ptp_clock_info *ptp,
>  		goto unlock;
>  	}
>  
> -	ktime_get_snapshot_id(&history_begin, cts->clock_id);
> +	ktime_get_snapshot_id(cts->clock_id, &history_begin);
>  
>  	err = get_device_system_crosststamp(mlx5_mtctr_syncdevicetime, mdev,
>  					    &history_begin, cts);
> @@ -366,7 +366,7 @@ static int mlx5_ptp_getcrosscycles(struct ptp_clock_info *ptp,
>  		goto unlock;
>  	}
>  
> -	ktime_get_snapshot_id(&history_begin, cts->clock_id);
> +	ktime_get_snapshot_id(cts->clock_id, &history_begin);
>  
>  	err = get_device_system_crosststamp(mlx5_mtctr_syncdevicecyclestime,
>  					    mdev, &history_begin, cts);
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index aed5c13cd1be..dc23cd708cfe 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -392,11 +392,11 @@ static long ptp_sys_offset_extended(struct ptp_clock *ptp, void __user *arg,
>  		extoff->ts[i][1].sec = ts.tv_sec;
>  		extoff->ts[i][1].nsec = ts.tv_nsec;
>  
> -		ts = ktime_to_timespec64(sts.pre_sts.sys);
> +		ts = ktime_to_timespec64(sts.pre_sts.systime);
>  		extoff->ts[i][0].sec = ts.tv_sec;
>  		extoff->ts[i][0].nsec = ts.tv_nsec;
>  
> -		ts = ktime_to_timespec64(sts.post_sts.sys);
> +		ts = ktime_to_timespec64(sts.post_sts.systime);
>  		extoff->ts[i][2].sec = ts.tv_sec;
>  		extoff->ts[i][2].nsec = ts.tv_nsec;
>  	}
> diff --git a/drivers/ptp/ptp_ocp.c b/drivers/ptp/ptp_ocp.c
> index b7a23936a44d..28b0302c6250 100644
> --- a/drivers/ptp/ptp_ocp.c
> +++ b/drivers/ptp/ptp_ocp.c
> @@ -1492,7 +1492,7 @@ __ptp_ocp_gettime_locked(struct ptp_ocp *bp, struct timespec64 *ts,
>  	ptp_read_system_postts(sts);
>  
>  	if (sts && bp->ts_window_adjust)
> -		sts->post_ts.sys -= bp->ts_window_adjust;
> +		sts->post_sts.systime -= bp->ts_window_adjust;
>  
>  	time_ns = ioread32(&bp->reg->time_ns);
>  	time_sec = ioread32(&bp->reg->time_sec);
> @@ -4592,8 +4592,8 @@ ptp_ocp_summary_show(struct seq_file *s, void *data)
>  		struct timespec64 sys_ts;
>  		s64 pre_ns, post_ns, ns;
>  
> -		pre_ns = ktime_to_ns(sts.pre_sts.sys);
> -		post_ns = ktime_to_ns(sts.post_sts.sys);
> +		pre_ns = ktime_to_ns(sts.pre_sts.systime);
> +		post_ns = ktime_to_ns(sts.post_sts.systime);
>  		ns = (pre_ns + post_ns) / 2;
>  		ns += (s64)bp->utc_tai_offset * NSEC_PER_SEC;
>  		sys_ts = ns_to_timespec64(ns);
> diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
> index cb18c15a4697..d6a5a533164a 100644
> --- a/drivers/ptp/ptp_vmclock.c
> +++ b/drivers/ptp/ptp_vmclock.c
> @@ -263,7 +263,7 @@ static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp,
>  	if (ret == -ENODEV) {
>  		struct system_time_snapshot systime_snapshot;
>  
> -		ktime_get_snapshot_id(&systime_snapshot, CLOCK_REALTIME);
> +		ktime_get_snapshot_id(CLOCK_REALTIME, &systime_snapshot);
>  
>  		if (systime_snapshot.cs_id == CSID_X86_TSC ||
>  		    systime_snapshot.cs_id == CSID_X86_KVM_CLK) {
> diff --git a/drivers/virtio/virtio_rtc_ptp.c b/drivers/virtio/virtio_rtc_ptp.c
> index e15d00aeb01d..ff8d834493dc 100644
> --- a/drivers/virtio/virtio_rtc_ptp.c
> +++ b/drivers/virtio/virtio_rtc_ptp.c
> @@ -139,7 +139,7 @@ static int viortc_ptp_getcrosststamp(struct ptp_clock_info *ptp,
>  	if (ret)
>  		return ret;
>  
> -	ktime_get_snapshot_id(&history_begin, xtstamp->clock_id);
> +	ktime_get_snapshot_id(xtstamp->clock_id, &history_begin);
>  	if (history_begin.cs_id != cs_id)
>  		return -EOPNOTSUPP;
>  
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 7c38190b10bf..6d9ddf1587a2 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -31,6 +31,21 @@ struct module;
>  
>  #include <vdso/clocksource.h>
>  
> +/**
> + * struct clocksource_hw_snapshot - Snapshot for the underlying hardware counter of derived
> + *				    clocksources like kvmclock or Hyper-V scaled TSC
> + * @hw_cycles:		The hardware counter value
> + * @hw_csid:		Clocksource ID of the hardware counter
> + *
> + * Such clocksources must implement the read_snapshot() callback and fill in the
> + * hardware counter value, the clocksource ID of the hardware counter and derive
> + * the actual clocksource cycles from @hw_cycles to provide an atomic snapshot
> + */
> +struct clocksource_hw_snapshot {
> +	u64			hw_cycles;
> +	enum clocksource_ids	hw_csid;
> +};
> +
>  /**
>   * struct clocksource - hardware abstraction for a free running counter
>   *	Provides mostly state-free accessors to the underlying hardware.
> @@ -72,6 +87,14 @@ struct module;
>   * @flags:		Flags describing special properties
>   * @base:		Hardware abstraction for clock on which a clocksource
>   *			is based
> + * @read_snapshot:	Extended @read() function for clocksources such as
> + *			kvmclock or the Hyper-V scaled TSC where the actual
> + *			clocksource value for timekeeping is calculated from an
> + *			underlying hardware counter. Returns the timekeeping
> + *			relevant cycle value and stores the raw value of the
> + *			underlying counter from which it was calculated
> + *			including the clocksource ID of that counter in the
> + *			clocksource hardware snapshot.
>   * @enable:		Optional function to enable the clocksource
>   * @disable:		Optional function to disable the clocksource
>   * @suspend:		Optional suspend function for the clocksource
> @@ -113,6 +136,7 @@ struct clocksource {
>  	unsigned long		flags;
>  	struct clocksource_base *base;
>  
> +	u64			(*read_snapshot)(struct clocksource *cs, struct clocksource_hw_snapshot *chs);
>  	int			(*enable)(struct clocksource *cs);
>  	void			(*disable)(struct clocksource *cs);
>  	void			(*suspend)(struct clocksource *cs);
> diff --git a/include/linux/pps_kernel.h b/include/linux/pps_kernel.h
> index cd80f1cb96a9..9f088c9023b1 100644
> --- a/include/linux/pps_kernel.h
> +++ b/include/linux/pps_kernel.h
> @@ -102,9 +102,9 @@ static inline void pps_get_ts(struct pps_event_time *ts)
>  #ifdef CONFIG_NTP_PPS
>  	struct system_time_snapshot snap;
>  
> -	ktime_get_snapshot_id(&snap, CLOCK_REALTIME);
> -	ts->ts_real = ktime_to_timespec64(snap.sys);
> -	ts->ts_raw = ktime_to_timespec64(snap.raw);
> +	ktime_get_snapshot_id(CLOCK_REALTIME, &snap);
> +	ts->ts_real = ktime_to_timespec64(snap.systime);
> +	ts->ts_raw = ktime_to_timespec64(snap.monoraw);
>  #else
>  	ktime_get_real_ts64(&ts->ts_real);
>  #endif
> diff --git a/include/linux/ptp_clock_kernel.h b/include/linux/ptp_clock_kernel.h
> index df6c9aac458b..36a27a910595 100644
> --- a/include/linux/ptp_clock_kernel.h
> +++ b/include/linux/ptp_clock_kernel.h
> @@ -511,13 +511,13 @@ static inline ktime_t ptp_convert_timestamp(const ktime_t *hwtstamp,
>  static inline void ptp_read_system_prets(struct ptp_system_timestamp *sts)
>  {
>  	if (sts)
> -		ktime_get_snapshot_id(&sts->pre_sts, sts->clockid);
> +		ktime_get_snapshot_id(sts->clockid, &sts->pre_sts);
>  }
>  
>  static inline void ptp_read_system_postts(struct ptp_system_timestamp *sts)
>  {
>  	if (sts)
> -		ktime_get_snapshot_id(&sts->post_sts, sts->clockid);
> +		ktime_get_snapshot_id(sts->clockid, &sts->post_sts);
>  }
>  
>  #endif
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index f7945f1048fc..984a866d293b 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -279,18 +279,24 @@ static inline bool ktime_get_aux_ts64(clockid_t id, struct timespec64 *kt) { ret
>   * struct system_time_snapshot - Simultaneous time capture of CLOCK_MONOTONIC_RAW,
>   *				 a selected CLOCK_* and the clocksource counter value
>   * @cycles:		Clocksource counter value to produce the system times
> - * @sys:		The system time of the selected CLOCK ID
> - * @raw:		Monotonic raw system time
> + * @hw_cycles:		For derived clocksources, the hardware counter value from
> + *			which @cycles was derived
> + * @systime:		The system time of the selected CLOCK ID
> + * @monoraw:		Monotonic raw system time
>   * @cs_id:		Clocksource ID
> + * @hw_csid:		Clocksource ID of the underlying hardware counter for derived
> + *			clocksources which implement the read_snapshot() callback.
>   * @clock_was_set_seq:	The sequence number of clock-was-set events
>   * @cs_was_changed_seq:	The sequence number of clocksource change events
>   * @valid:		True if the snapshot is valid
>   */
>  struct system_time_snapshot {
>  	u64			cycles;
> -	ktime_t			sys;
> -	ktime_t			raw;
> +	u64			hw_cycles;
> +	ktime_t			systime;
> +	ktime_t			monoraw;
>  	enum clocksource_ids	cs_id;
> +	enum clocksource_ids	hw_csid;
>  	unsigned int		clock_was_set_seq;
>  	u8			cs_was_changed_seq;
>  	u8			valid;
> @@ -348,8 +354,7 @@ extern int get_device_system_crosststamp(
>   * Simultaneously snapshot a given clock with MONOTONIC_RAW and the underlying
>   * clocksource counter value.
>   */
> -extern bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot,
> -				  clockid_t clock_id);
> +extern void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *systime_snapshot);
>  
>  /*
>   * Persistent clock related interfaces
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index c4fd7229b7da..0d5b67f609bb 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -320,6 +320,7 @@ static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
>  
>  	return clock->read(clock);
>  }
> +
>  static inline void clocksource_disable_inline_read(void) { }
>  static inline void clocksource_enable_inline_read(void) { }
>  #endif
> @@ -1187,14 +1188,26 @@ noinstr time64_t __ktime_get_real_seconds(void)
>  	return tk->xtime_sec;
>  }
>  
> +static inline u64 tk_clock_read_snapshot(const struct tk_read_base *tkr,
> +					 struct clocksource_hw_snapshot *chs)
> +{
> +	struct clocksource *clock = READ_ONCE(tkr->clock);
> +
> +	if (unlikely(clock->read_snapshot))
> +		return clock->read_snapshot(clock, chs);
> +
> +	return clock->read(clock);
> +}
> +
> +
>  /**
>   * ktime_get_snapshot_id -  Simultaneously snapshot a given clock ID with
>   *			    CLOCK_MONOTONIC_RAW and the underlying
>   *			    clocksource counter value.
> - * @systime_snapshot:	Pointer to struct receiving the system time snapshot
>   * @clock_id:		The clock ID to snapshot
> + * @systime_snapshot:	Pointer to struct receiving the system time snapshot
>   */
> -bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot, clockid_t clock_id)
> +void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *systime_snapshot)
>  {
>  	ktime_t base_raw, base_sys, offs_sys, *offs, offs_zero = 0;
>  	u64 nsec_raw, nsec_sys, now;
> @@ -1206,7 +1219,7 @@ bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot, clocki
>  	systime_snapshot->valid = false;
>  
>  	if (WARN_ON_ONCE(timekeeping_suspended))
> -		return false;
> +		return;
>  
>  	switch (clock_id) {
>  	case CLOCK_REALTIME:
> @@ -1226,25 +1239,31 @@ bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot, clocki
>  	case CLOCK_AUX ... CLOCK_AUX_LAST:
>  		tkd = aux_get_tk_data(clock_id);
>  		if (!tkd)
> -			return false;
> +			return;
>  		offs = &tkd->timekeeper.offs_aux;
>  		break;
>  	default:
>  		WARN_ON_ONCE(1);
> -		return false;
> +		return;
>  	}
>  
>  	tk = &tkd->timekeeper;
>  
>  	do {
> +		struct clocksource_hw_snapshot chs = { };
> +
>  		seq = read_seqcount_begin(&tkd->seq);
>  
>  		/* Aux clocks can be invalid */
>  		if (!tk->clock_valid)
> -			return false;
> +			return;
>  
> -		now = tk_clock_read(&tk->tkr_mono);
> +		now = tk_clock_read_snapshot(&tk->tkr_mono, &chs);
>  		systime_snapshot->cs_id = tk->tkr_mono.clock->id;
> +
> +		systime_snapshot->hw_cycles = chs.hw_cycles;
> +		systime_snapshot->hw_csid = chs.hw_csid;
> +
>  		systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
>  		systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
>  
> @@ -1257,18 +1276,17 @@ bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot, clocki
>  	} while (read_seqcount_retry(&tkd->seq, seq));
>  
>  	systime_snapshot->cycles = now;
> -	systime_snapshot->sys = ktime_add_ns(base_sys, offs_sys + nsec_sys);
> -	systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
> +	systime_snapshot->systime = ktime_add_ns(base_sys, offs_sys + nsec_sys);
> +	systime_snapshot->monoraw = ktime_add_ns(base_raw, nsec_raw);
>  
>  	/*
>  	 * Special case for PTP. Just transfer the raw time into sys,
> -	 * so the call sites can consistently use snap::sys.
> +	 * so the call sites can consistently use snap::systime.
>  	 */
>  	if (clock_id == CLOCK_MONOTONIC_RAW)
> -		systime_snapshot->sys = systime_snapshot->raw;
> +		systime_snapshot->systime = systime_snapshot->monoraw;
>  	/* Tell the consumer that this snapshot is valid */
>  	systime_snapshot->valid = true;
> -	return true;
>  }
>  EXPORT_SYMBOL_GPL(ktime_get_snapshot_id);
>  
> @@ -1330,7 +1348,7 @@ static int adjust_historical_crosststamp(struct system_time_snapshot *history,
>  	 * Scale the monotonic raw time delta by:
>  	 *	partial_history_cycles / total_history_cycles
>  	 */
> -	corr_raw = (u64)ktime_to_ns(ktime_sub(ts->sys_monoraw, history->raw));
> +	corr_raw = (u64)ktime_to_ns(ktime_sub(ts->sys_monoraw, history->monoraw));
>  	ret = scale64_check_overflow(partial_history_cycles,
>  				     total_history_cycles, &corr_raw);
>  	if (ret)
> @@ -1347,7 +1365,7 @@ static int adjust_historical_crosststamp(struct system_time_snapshot *history,
>  	if (discontinuity) {
>  		corr_sys = mul_u64_u32_div(corr_raw, tk->tkr_mono.mult, tk->tkr_raw.mult);
>  	} else {
> -		corr_sys = (u64)ktime_to_ns(ktime_sub(ts->sys_systime, history->sys));
> +		corr_sys = (u64)ktime_to_ns(ktime_sub(ts->sys_systime, history->systime));
>  		ret = scale64_check_overflow(partial_history_cycles, total_history_cycles,
>  					     &corr_sys);
>  		if (ret)
> @@ -1356,8 +1374,8 @@ static int adjust_historical_crosststamp(struct system_time_snapshot *history,
>  
>  	/* Fixup monotonic raw and system time time values */
>  	if (interp_forward) {
> -		ts->sys_monoraw = ktime_add_ns(history->raw, corr_raw);
> -		ts->sys_systime = ktime_add_ns(history->sys, corr_sys);
> +		ts->sys_monoraw = ktime_add_ns(history->monoraw, corr_raw);
> +		ts->sys_systime = ktime_add_ns(history->systime, corr_sys);
>  	} else {
>  		ts->sys_monoraw = ktime_sub_ns(ts->sys_monoraw, corr_raw);
>  		ts->sys_systime = ktime_sub_ns(ts->sys_systime, corr_sys);
> @@ -1521,12 +1539,12 @@ int get_device_system_crosststamp(int (*get_time_fn)
>  	case CLOCK_AUX ... CLOCK_AUX_LAST:
>  		tkd = aux_get_tk_data(xtstamp->clock_id);
>  		if (!tkd)
> -			return false;
> +			return -ENODEV;
>  		offs = &tkd->timekeeper.offs_aux;
>  		break;
>  	default:
>  		WARN_ON_ONCE(1);
> -		return false;
> +		return -ENODEV;
>  	}
>  
>  	tk = &tkd->timekeeper;


^ permalink raw reply

* Re: [PATCH v9 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-01 12:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Jann Horn,
	Pedro Falcato, Harry Yoo, Hao Li
In-Reply-To: <a526a713dbba63c7a69df37b67ed1590f77fefd3.1780067977.git.mst@redhat.com>

On Fri, May 29, 2026 at 11:22:42AM -0400, Michael S. Tsirkin wrote:
> Thread a user virtual address from vma_alloc_folio() down through
> the page allocator to post_alloc_hook(). This is plumbing
> preparation for a subsequent patch that will use user_addr to
> call folio_zero_user() for cache-friendly zeroing of user pages.
> 
> The user_addr is stored in struct alloc_context and flows through:
>   vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
>   __alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
>   post_alloc_hook
> 
> USER_ADDR_NONE ((unsigned long)-1) is used for non-user
> allocations, since address 0 is a valid userspace mapping.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  include/linux/gfp.h |  2 +-
>  mm/compaction.c     |  5 ++---
>  mm/hugetlb.c        | 36 ++++++++++++++++++++----------------
>  mm/internal.h       | 22 +++++++++++++++++++---
>  mm/mempolicy.c      | 44 ++++++++++++++++++++++++++++++++------------
>  mm/mmap.c           |  6 ++++++
>  mm/page_alloc.c     | 44 +++++++++++++++++++++++++++++---------------
>  mm/slub.c           |  4 ++--
>  8 files changed, 111 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 7ccbda35b9ad..ee35c5367abc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -337,7 +337,7 @@ static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
>  static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  		struct mempolicy *mpol, pgoff_t ilx, int nid)
>  {
> -	return folio_alloc_noprof(gfp, order);
> +	return __folio_alloc_noprof(gfp, order, numa_node_id(), NULL);
>  }
>  #endif
>  
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 3648ce22c807..72684fe81e83 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -82,7 +82,7 @@ static inline bool is_via_compact_memory(int order) { return false; }
>  
>  static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
>  {
> -	post_alloc_hook(page, order, __GFP_MOVABLE);
> +	post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);
>  	set_page_refcounted(page);
>  	return page;
>  }
> @@ -1849,8 +1849,7 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
>  		set_page_private(&freepage[size], start_order);
>  	}
>  	dst = (struct folio *)freepage;
> -
> -	post_alloc_hook(&dst->page, order, __GFP_MOVABLE);
> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
>  	set_page_refcounted(&dst->page);
>  	if (order)
>  		prep_compound_page(&dst->page, order);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24bf49be047..a999f3ead852 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1806,7 +1806,8 @@ struct address_space *hugetlb_folio_mapping_lock_write(struct folio *folio)
>  }
>  
>  static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
> -		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry)
> +		int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry,
> +		unsigned long addr)
>  {
>  	struct folio *folio;
>  	bool alloc_try_hard = true;
> @@ -1823,7 +1824,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
>  	if (alloc_try_hard)
>  		gfp_mask |= __GFP_RETRY_MAYFAIL;
>  
> -	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask);
> +	folio = (struct folio *)__alloc_frozen_pages(gfp_mask, order, nid, nmask, addr);
>  
>  	/*
>  	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a
> @@ -1852,7 +1853,7 @@ static struct folio *alloc_buddy_frozen_folio(int order, gfp_t gfp_mask,
>  
>  static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
>  		gfp_t gfp_mask, int nid, nodemask_t *nmask,
> -		nodemask_t *node_alloc_noretry)
> +		nodemask_t *node_alloc_noretry, unsigned long addr)
>  {
>  	struct folio *folio;
>  	int order = huge_page_order(h);
> @@ -1864,7 +1865,7 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
>  		folio = alloc_gigantic_frozen_folio(order, gfp_mask, nid, nmask);
>  	else
>  		folio = alloc_buddy_frozen_folio(order, gfp_mask, nid, nmask,
> -						 node_alloc_noretry);
> +						 node_alloc_noretry, addr);
>  	if (folio)
>  		init_new_hugetlb_folio(folio);
>  	return folio;
> @@ -1878,11 +1879,12 @@ static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
>   * pages is zero, and the accounting must be done in the caller.
>   */
>  static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
> -		gfp_t gfp_mask, int nid, nodemask_t *nmask)
> +		gfp_t gfp_mask, int nid, nodemask_t *nmask,
> +		unsigned long addr)
>  {
>  	struct folio *folio;
>  
> -	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
> +	folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL, addr);
>  	if (folio)
>  		hugetlb_vmemmap_optimize_folio(h, folio);
>  	return folio;
> @@ -1922,7 +1924,7 @@ static struct folio *alloc_pool_huge_folio(struct hstate *h,
>  		struct folio *folio;
>  
>  		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
> -					nodes_allowed, node_alloc_noretry);
> +					nodes_allowed, node_alloc_noretry, USER_ADDR_NONE);
>  		if (folio)
>  			return folio;
>  	}
> @@ -2091,7 +2093,8 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn, unsigned long end_pfn)
>   * Allocates a fresh surplus page from the page allocator.
>   */
>  static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
> -				gfp_t gfp_mask,	int nid, nodemask_t *nmask)
> +				gfp_t gfp_mask,	int nid, nodemask_t *nmask,
> +				unsigned long addr)
>  {
>  	struct folio *folio = NULL;
>  
> @@ -2103,7 +2106,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
>  		goto out_unlock;
>  	spin_unlock_irq(&hugetlb_lock);
>  
> -	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
> +	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, addr);
>  	if (!folio)
>  		return NULL;
>  
> @@ -2146,7 +2149,7 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>  
> -	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask);
> +	folio = alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, USER_ADDR_NONE);
>  	if (!folio)
>  		return NULL;
>  
> @@ -2182,14 +2185,14 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
>  	if (mpol_is_preferred_many(mpol)) {
>  		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
>  
> -		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
> +		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask, addr);
>  
>  		/* Fallback to all nodes if page==NULL */
>  		nodemask = NULL;
>  	}
>  
>  	if (!folio)
> -		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
> +		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask, addr);
>  	mpol_cond_put(mpol);
>  	return folio;
>  }
> @@ -2296,7 +2299,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  		 * down the road to pick the current node if that is the case.
>  		 */
>  		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
> -						    NUMA_NO_NODE, &alloc_nodemask);
> +						    NUMA_NO_NODE, &alloc_nodemask,
> +						    USER_ADDR_NONE);
>  		if (!folio) {
>  			alloc_ok = false;
>  			break;
> @@ -2702,7 +2706,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
>  			spin_unlock_irq(&hugetlb_lock);
>  			gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  			new_folio = alloc_fresh_hugetlb_folio(h, gfp_mask,
> -							      nid, NULL);
> +							      nid, NULL, USER_ADDR_NONE);
>  			if (!new_folio)
>  				return -ENOMEM;
>  			goto retry;
> @@ -3400,13 +3404,13 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>  			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  
>  			folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
> -					&node_states[N_MEMORY], NULL);
> +					&node_states[N_MEMORY], NULL, USER_ADDR_NONE);
>  			if (!folio && !list_empty(&folio_list) &&
>  			    hugetlb_vmemmap_optimizable_size(h)) {
>  				prep_and_add_allocated_folios(h, &folio_list);
>  				INIT_LIST_HEAD(&folio_list);
>  				folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
> -						&node_states[N_MEMORY], NULL);
> +						&node_states[N_MEMORY], NULL, USER_ADDR_NONE);
>  			}
>  			if (!folio)
>  				break;
> diff --git a/mm/internal.h b/mm/internal.h
> index 5a2ddcf68e0b..389098200aa6 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -662,6 +662,16 @@ void calculate_min_free_kbytes(void);
>  int __meminit init_per_zone_wmark_min(void);
>  void page_alloc_sysctl_init(void);
>  
> +/*
> + * Sentinel for user_addr: indicates a non-user allocation.
> + * Cannot use 0 because address 0 is a valid userspace mapping.
> + * (unsigned long)-1 is safe because:
> + * 1. vm_end = addr + len <= TASK_SIZE, and vm_end is exclusive,
> + *    so -1 is never inside any VMA.
> + * 2. It will only be compared to page-aligned addresses.
> + */
> +#define USER_ADDR_NONE	((unsigned long)-1)
> +
>  /*
>   * Structure for holding the mostly immutable allocation parameters passed
>   * between functions involved in allocations, including the alloc_pages*
> @@ -693,6 +703,7 @@ struct alloc_context {
>  	 */
>  	enum zone_type highest_zoneidx;
>  	bool spread_dirty_pages;
> +	unsigned long user_addr;
>  };
>  
>  /*
> @@ -916,24 +927,29 @@ static inline void init_compound_tail(struct page *tail,
>  	prep_compound_tail(tail, head, order);
>  }
>  
> -void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
> +void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> +		     unsigned long user_addr);
>  extern bool free_pages_prepare(struct page *page, unsigned int order);
>  
>  extern int user_min_free_kbytes;
>  
>  struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
> -		nodemask_t *);
> +		nodemask_t *, unsigned long user_addr);
>  #define __alloc_frozen_pages(...) \
>  	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
>  void free_frozen_pages(struct page *page, unsigned int order);
> +void free_frozen_pages_zeroed(struct page *page, unsigned int order);


sashiko pointed this one out:
https://sashiko.dev/#/patchset/cover.1780067977.git.mst%40redhat.com?part=7

this landed here by mistake during one of the rebases, harmless but
ideally belongs in patch 33.  Will move if there's v10.

it's other findings on this patch seem like false positives.

>  void free_unref_folios(struct folio_batch *fbatch);
>  
>  #ifdef CONFIG_NUMA
>  struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
> +struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid,
> +		unsigned long user_addr);
>  #else
>  static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
>  {
> -	return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL);
> +	return __alloc_frozen_pages_noprof(gfp, order, numa_node_id(), NULL, USER_ADDR_NONE);
>  }
>  #endif
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index a1707ad498a8..f573ff32e94d 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2413,7 +2413,8 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  }
>  
>  static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
> -						int nid, nodemask_t *nodemask)
> +						int nid, nodemask_t *nodemask,
> +						unsigned long user_addr)
>  {
>  	struct page *page;
>  	gfp_t preferred_gfp;
> @@ -2426,25 +2427,29 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
>  	 */
>  	preferred_gfp = gfp | __GFP_NOWARN;
>  	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> -	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask);
> +	page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid,
> +					   nodemask, user_addr);
>  	if (!page)
> -		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL);
> +		page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL,
> +						   user_addr);
>  
>  	return page;
>  }
>  
>  /**
> - * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
> + * __alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
>   * @gfp: GFP flags.
>   * @order: Order of the page allocation.
>   * @pol: Pointer to the NUMA mempolicy.
>   * @ilx: Index for interleave mempolicy (also distinguishes alloc_pages()).
>   * @nid: Preferred node (usually numa_node_id() but @mpol may override it).
> + * @user_addr: User fault address for cache-friendly zeroing, or USER_ADDR_NONE.
>   *
>   * Return: The page on success or NULL if allocation fails.
>   */
> -static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> -		struct mempolicy *pol, pgoff_t ilx, int nid)
> +static struct page *__alloc_pages_mpol(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid,
> +		unsigned long user_addr)
>  {
>  	nodemask_t *nodemask;
>  	struct page *page;
> @@ -2452,7 +2457,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
>  
>  	if (pol->mode == MPOL_PREFERRED_MANY)
> -		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
> +		return alloc_pages_preferred_many(gfp, order, nid, nodemask,
> +						 user_addr);
>  
>  	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
>  	    /* filter "hugepage" allocation, unless from alloc_pages() */
> @@ -2476,7 +2482,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  			 */
>  			page = __alloc_frozen_pages_noprof(
>  				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
> -				nid, NULL);
> +				nid, NULL, user_addr);
>  			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
>  				return page;
>  			/*
> @@ -2488,7 +2494,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		}
>  	}
>  
> -	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
> +	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask, user_addr);
>  
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE ||
>  		     pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
> @@ -2504,11 +2510,18 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  	return page;
>  }
>  
> -struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
> +static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		struct mempolicy *pol, pgoff_t ilx, int nid)
>  {
> -	struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> -			ilx, nid);
> +	return __alloc_pages_mpol(gfp, order, pol, ilx, nid, USER_ADDR_NONE);
> +}
> +
> +struct folio *folio_alloc_mpol_user_noprof(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid,
> +		unsigned long user_addr)
> +{
> +	struct page *page = __alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> +			ilx, nid, user_addr);
>  	if (!page)
>  		return NULL;
>  
> @@ -2516,6 +2529,13 @@ struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
>  	return page_rmappable_folio(page);
>  }
>  
> +struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
> +		struct mempolicy *pol, pgoff_t ilx, int nid)
> +{
> +	return folio_alloc_mpol_user_noprof(gfp, order, pol, ilx, nid,
> +					    USER_ADDR_NONE);
> +}
> +
>  struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
>  {
>  	struct mempolicy *pol = &default_policy;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5754d1c36462..73413cebc418 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -855,6 +855,12 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>  	if (IS_ERR_VALUE(addr))
>  		return addr;
>  
> +	/*
> +	 * The check below ensures vm_end = addr + len <= TASK_SIZE.
> +	 * Since (unsigned long)-1 (USER_ADDR_NONE) >= TASK_SIZE and
> +	 * vm_end is exclusive, USER_ADDR_NONE is thus never a valid
> +	 * userspace address.
> +	 */
>  	if (addr > TASK_SIZE - len)
>  		return -ENOMEM;
>  	if (offset_in_page(addr))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0c4f4c678233..b96c9892f6c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1819,7 +1819,7 @@ static inline bool should_skip_init(gfp_t flags)
>  }
>  
>  inline void post_alloc_hook(struct page *page, unsigned int order,
> -				gfp_t gfp_flags)
> +				gfp_t gfp_flags, unsigned long user_addr)
>  {
>  	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
>  			!should_skip_init(gfp_flags);
> @@ -1874,9 +1874,10 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  }
>  
>  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> -							unsigned int alloc_flags)
> +							unsigned int alloc_flags,
> +							unsigned long user_addr)
>  {
> -	post_alloc_hook(page, order, gfp_flags);
> +	post_alloc_hook(page, order, gfp_flags, user_addr);
>  
>  	if (order && (gfp_flags & __GFP_COMP))
>  		prep_compound_page(page, order);
> @@ -3958,7 +3959,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
>  				gfp_mask, alloc_flags, ac->migratetype);
>  		if (page) {
> -			prep_new_page(page, order, gfp_mask, alloc_flags);
> +			prep_new_page(page, order, gfp_mask, alloc_flags,
> +				      ac->user_addr);
>  
>  			return page;
>  		} else {
> @@ -4186,7 +4188,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	/* Prep a captured page if available */
>  	if (page)
> -		prep_new_page(page, order, gfp_mask, alloc_flags);
> +		prep_new_page(page, order, gfp_mask, alloc_flags,
> +			      ac->user_addr);
>  
>  	/* Try get a page from the freelist if available */
>  	if (!page)
> @@ -5063,7 +5066,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	struct zoneref *z;
>  	struct per_cpu_pages *pcp;
>  	struct list_head *pcp_list;
> -	struct alloc_context ac;
> +	struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
>  	gfp_t alloc_gfp;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
>  	int nr_populated = 0, nr_account = 0;
> @@ -5178,7 +5181,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  		}
>  		nr_account++;
>  
> -		prep_new_page(page, 0, gfp, 0);
> +		prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
>  		set_page_refcounted(page);
>  		page_array[nr_populated++] = page;
>  	}
> @@ -5203,12 +5206,13 @@ EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>   * This is the 'heart' of the zoned buddy allocator.
>   */
>  struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
> -		int preferred_nid, nodemask_t *nodemask)
> +		int preferred_nid, nodemask_t *nodemask,
> +		unsigned long user_addr)
>  {
>  	struct page *page;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
>  	gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */
> -	struct alloc_context ac = { };
> +	struct alloc_context ac = { .user_addr = user_addr };
>  
>  	/*
>  	 * There are several places where we assume that the order value is sane
> @@ -5269,10 +5273,12 @@ EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
>  
>  struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>  		int preferred_nid, nodemask_t *nodemask)
> +
>  {
>  	struct page *page;
>  
> -	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
> +	page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid,
> +					   nodemask, USER_ADDR_NONE);
>  	if (page)
>  		set_page_refcounted(page);
>  	return page;
> @@ -5315,7 +5321,8 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>  		gfp |= __GFP_NOWARN;
>  
>  	pol = get_vma_policy(vma, addr, order, &ilx);
> -	folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
> +	folio = folio_alloc_mpol_user_noprof(gfp, order, pol, ilx,
> +					     numa_node_id(), addr);
>  	mpol_cond_put(pol);
>  	return folio;
>  }
> @@ -5323,10 +5330,17 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>  struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>  		struct vm_area_struct *vma, unsigned long addr)
>  {
> +	struct page *page;
> +
>  	if (vma->vm_flags & VM_DROPPABLE)
>  		gfp |= __GFP_NOWARN;
>  
> -	return folio_alloc_noprof(gfp, order);
> +	page = __alloc_frozen_pages_noprof(gfp | __GFP_COMP, order,
> +					   numa_node_id(), NULL, addr);
> +	if (!page)
> +		return NULL;
> +	set_page_refcounted(page);
> +	return page_rmappable_folio(page);
>  }
>  #endif
>  EXPORT_SYMBOL(vma_alloc_folio_noprof);
> @@ -6907,7 +6921,7 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
>  		list_for_each_entry_safe(page, next, &list[order], lru) {
>  			int i;
>  
> -			post_alloc_hook(page, order, gfp_mask);
> +			post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
>  			if (!order)
>  				continue;
>  
> @@ -7113,7 +7127,7 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
>  		struct page *head = pfn_to_page(start);
>  
>  		check_new_pages(head, order);
> -		prep_new_page(head, order, gfp_mask, 0);
> +		prep_new_page(head, order, gfp_mask, 0, USER_ADDR_NONE);
>  	} else {
>  		ret = -EINVAL;
>  		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
> @@ -7778,7 +7792,7 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
>  	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
>  			| gfp_flags;
>  	unsigned int alloc_flags = ALLOC_TRYLOCK;
> -	struct alloc_context ac = { };
> +	struct alloc_context ac = { .user_addr = USER_ADDR_NONE };
>  	struct page *page;
>  
>  	VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
> diff --git a/mm/slub.c b/mm/slub.c
> index 0baa906f39ab..74dd2d96941b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3275,7 +3275,7 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
>  	else if (node == NUMA_NO_NODE)
>  		page = alloc_frozen_pages(flags, order);
>  	else
> -		page = __alloc_frozen_pages(flags, order, node, NULL);
> +		page = __alloc_frozen_pages(flags, order, node, NULL, USER_ADDR_NONE);
>  
>  	if (!page)
>  		return NULL;
> @@ -5235,7 +5235,7 @@ static void *___kmalloc_large_node(size_t size, gfp_t flags, int node)
>  	if (node == NUMA_NO_NODE)
>  		page = alloc_frozen_pages_noprof(flags, order);
>  	else
> -		page = __alloc_frozen_pages_noprof(flags, order, node, NULL);
> +		page = __alloc_frozen_pages_noprof(flags, order, node, NULL, USER_ADDR_NONE);
>  
>  	if (page) {
>  		ptr = page_address(page);
> -- 
> MST
> 


^ permalink raw reply

* Re: [PATCH v9 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-01 12:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780067977.git.mst@redhat.com>


I went over sashiko review of this:
https://sashiko.dev/#/patchset/cover.1780067977.git.mst%40redhat.com

It's mostly Sashiko apparently not seeing the whole picture
(e.g. is says "an optimization is missing" and it is actually
 added in follow up patches).

Sometimes it is unhappy while there's an explanation right in the commit
log. Weird.


It's a pity there's no way to respond to the review, to make it pay more
attention and reconsider.


It did flag a minor rebase artifact (declaration landed in the wrong
patch during rebase). Mostly harmless but if there's a v10, I'll fix that.

But it also flagged 4 pre-existing bugs:

  - page_reporting_register before DRIVER_OK, theoretical race in virtballoon_probe ordering:
  https://sashiko.dev/#/patchset/cover.1780067977.git.mst%40redhat.com?part=4

  - spurious OOM on large-folio swapin race in do_swap_page:
  https://sashiko.dev/#/patchset/cover.1780067977.git.mst%40redhat.com?part=16

  - free_huge_folio called with refcount==1 on mem_cgroup_charge_hugetlb failure:
  https://sashiko.dev/#/patchset/cover.1780067977.git.mst%40redhat.com?part=20

  - double-decrement of resv_huge_pages in memfd_alloc_folio error path:
  https://sashiko.dev/#/patchset/cover.1780067977.git.mst%40redhat.com?part=21

All seem unrelated to the specific feature, so I'd rather put off fixing
these until the feature is merged.

-- 
MST


^ permalink raw reply

* Re: [PATCH v4 07/10] drm/damage-helper: Remove old state from drm_atomic_helper_damage_iter_init()
From: Hamza Mahfooz @ 2026-06-01 14:01 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: mripard, maarten.lankhorst, airlied, airlied, simona, admin,
	gargaditya08, paul, jani.nikula, mhklinux, zack.rusin,
	bcm-kernel-feedback-list, dri-devel, linux-hyperv, intel-gfx,
	intel-xe, linux-mips, virtualization
In-Reply-To: <20260530185716.65688-8-tzimmermann@suse.de>

On Sat, May 30, 2026 at 08:53:20PM +0200, Thomas Zimmermann wrote:
> Nothing in drm_atomic_helper_damage_iter_init() requires the old
> plane state. Remove the parameter and mass-convert callers.
> 
> Most callers now no longer require the old plane state in their plane's
> atomic_update helper. Remove it as well.
> 
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Acked-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> # hyperv

^ permalink raw reply

* [PATCH 1/7] drm/vblank: timer: Return success status from get_vblank_timeout
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

Return true/false from drm_crtc_vblank_get_vblank_timeout(), depending
on the success of the calculation. Let caller handle failure by itself.

Until now the helper tried to return a vblank time even in the case of
an error. Letting the caller handle the failure is the preferred behavior.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c        | 15 +++++++++------
 drivers/gpu/drm/drm_vblank_helper.c |  4 +---
 include/drm/drm_vblank.h            |  2 +-
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index f90fb2d13e42..96d70c3d4522 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2287,18 +2287,19 @@ EXPORT_SYMBOL(drm_crtc_vblank_cancel_timer);
  * The helper drm_crtc_vblank_get_vblank_timeout() returns the next vblank
  * timestamp of the CRTC's vblank timer according to the timer's expiry
  * time.
+ *
+ * Returns:
+ * True on success, or false otherwise.
  */
-void drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_time)
+bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_time)
 {
 	struct drm_vblank_crtc *vblank = drm_crtc_vblank_crtc(crtc);
 	struct drm_vblank_crtc_timer *vtimer = &vblank->vblank_timer;
 	u64 cur_count;
 	ktime_t cur_time;
 
-	if (!READ_ONCE(vblank->enabled)) {
-		*vblank_time = ktime_get();
-		return;
-	}
+	if (!READ_ONCE(vblank->enabled))
+		return false;
 
 	/*
 	 * A concurrent vblank timeout could update the expires field before
@@ -2312,7 +2313,7 @@ void drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 	} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
 
 	if (drm_WARN_ON(crtc->dev, !ktime_compare(*vblank_time, cur_time)))
-		return; /* Already expired */
+		return false; /* Already expired */
 
 	/*
 	 * To prevent races we roll the hrtimer forward before we do any
@@ -2322,5 +2323,7 @@ void drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 	 * correct the timestamp by one frame.
 	 */
 	*vblank_time = ktime_sub(*vblank_time, vtimer->interval);
+
+	return true;
 }
 EXPORT_SYMBOL(drm_crtc_vblank_get_vblank_timeout);
diff --git a/drivers/gpu/drm/drm_vblank_helper.c b/drivers/gpu/drm/drm_vblank_helper.c
index d3f8147ecdc1..aa8df047b2aa 100644
--- a/drivers/gpu/drm/drm_vblank_helper.c
+++ b/drivers/gpu/drm/drm_vblank_helper.c
@@ -169,8 +169,6 @@ bool drm_crtc_vblank_helper_get_vblank_timestamp_from_timer(struct drm_crtc *crt
 							    ktime_t *vblank_time,
 							    bool in_vblank_irq)
 {
-	drm_crtc_vblank_get_vblank_timeout(crtc, vblank_time);
-
-	return true;
+	return drm_crtc_vblank_get_vblank_timeout(crtc, vblank_time);
 }
 EXPORT_SYMBOL(drm_crtc_vblank_helper_get_vblank_timestamp_from_timer);
diff --git a/include/drm/drm_vblank.h b/include/drm/drm_vblank.h
index 2fcef9c0f5b1..1c06e4499dae 100644
--- a/include/drm/drm_vblank.h
+++ b/include/drm/drm_vblank.h
@@ -319,7 +319,7 @@ void drm_crtc_set_max_vblank_count(struct drm_crtc *crtc,
 
 int drm_crtc_vblank_start_timer(struct drm_crtc *crtc);
 void drm_crtc_vblank_cancel_timer(struct drm_crtc *crtc);
-void drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_time);
+bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_time);
 
 /*
  * Helpers for struct drm_crtc_funcs
-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/7] drm/vblank: timer: Fix timestamps and improve reliabilty
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann

It appears that the DRM's vblank timers have always been somewhat buggy
in their timestamp calculations. Fix the generic implementation and
improve reliability.

Patch 1 returns success/failure to the caller, as expected by DRM.

Patch 2 fixes the timestamp calculation to return the time of the first
visible scanline in the current vblank phase.

Patch 3 switches the hrtimer to absolute values since boot, so that it's
easier to work with and we can predict future timestamps reliably.

Patches 4 and 5 allow for estimating timestamps even with the timer
being disabled.

Patches 6 and 7 improve the retrieval of the next vblank timeout's
timestamp.

Tested under heavy CPU-load on bochs and virtio. No ersors or warnings
showed up.

There have been reports about vblank timeouts being handled so late,
that they trigger DRM's internal error checks. Maybe these fixes and
adjustments can help to further avoid hick ups from delayed vblank
timers.

Thomas Zimmermann (7):
  drm/vblank: timer: Return success status from get_vblank_timeout
  drm/vblank: timer: Fix timestamp calculation
  drm/vblank: timer: Use absolute timer since boot
  drm/vblank: timer: Reorganize get_vblank_timeout
  drm/vblank: timer: Estimate vblank timeout if timer is disabled
  drm/vblank: timer: Verify that expiry time is in the future
  drm/vblank: timer: Avoid reading the vblank time unnecessarily

 drivers/gpu/drm/drm_vblank.c        | 105 ++++++++++++++++++++--------
 drivers/gpu/drm/drm_vblank_helper.c |   4 +-
 include/drm/drm_vblank.h            |   2 +-
 3 files changed, 79 insertions(+), 32 deletions(-)


base-commit: 4f554688dffcacf48630c14f9fb77a9f60394c1c
-- 
2.54.0


^ permalink raw reply

* [PATCH 2/7] drm/vblank: timer: Fix timestamp calculation
From: Thomas Zimmermann @ 2026-06-01 14:08 UTC (permalink / raw)
  To: simona, michel.daenzer, louis.chauvet, ville.syrjala, jani.nikula,
	mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization, Thomas Zimmermann
In-Reply-To: <20260601141922.91498-1-tzimmermann@suse.de>

In drm_crtc_vblank_get_vblank_timeout(), return the timestamp of the
first visible scanline after the last vblank timeout. This is what the
caller expects.

A vblank phase starts with a vblank timeout. At this point the display
is blanked for several scanlines. Afterwards the display is unblanked
until the next vblank timeout occurs. The display content is only visible
during that second part.

The current implementation of drm_crtc_vblank_get_vblank_timeout()
returns the timestamp of the last vblank timeout that started the current
vblank phase. But the display only unblanks after 20 to 30 percent of
the overall frame duration. The returned timestamp is therefore too early.

The next vblank timeout is already known when calculating the returned
timestamp. Instead of subtracting the duration of a full frame from the
value, only subtract the duration of the active, visible part. The result
is the timestamp of the first visible scanline, as expected by the caller.

This bug was not introduced by the generic vblank timer. It appears that
the get_vblank_timeout logic has always been buggy since it was first
added in commit 3a0709928b17 ("drm/vkms: Add vblank events simulated by
hrtimers").

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/drm_vblank.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 96d70c3d4522..d52df247d04e 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -2293,10 +2293,19 @@ EXPORT_SYMBOL(drm_crtc_vblank_cancel_timer);
  */
 bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_time)
 {
+	struct drm_device *dev = crtc->dev;
 	struct drm_vblank_crtc *vblank = drm_crtc_vblank_crtc(crtc);
 	struct drm_vblank_crtc_timer *vtimer = &vblank->vblank_timer;
+	const struct drm_display_mode *mode;
 	u64 cur_count;
 	ktime_t cur_time;
+	s64 framedur_ns;
+	s64 activedur_ns;
+
+	if (drm_drv_uses_atomic_modeset(dev))
+		mode = &vblank->hwmode;
+	else
+		mode = &crtc->hwmode;
 
 	if (!READ_ONCE(vblank->enabled))
 		return false;
@@ -2312,17 +2321,26 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
 		*vblank_time = READ_ONCE(vtimer->timer.node.expires);
 	} while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
 
-	if (drm_WARN_ON(crtc->dev, !ktime_compare(*vblank_time, cur_time)))
+	if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
 		return false; /* Already expired */
 
+	framedur_ns = vblank->framedur_ns;
+
 	/*
-	 * To prevent races we roll the hrtimer forward before we do any
-	 * interrupt processing - this is how real hw works (the interrupt
-	 * is only generated after all the vblank registers are updated)
-	 * and what the vblank core expects. Therefore we need to always
-	 * correct the timestamp by one frame.
+	 * To prevent races we rolled the hrtimer forward before we did any
+	 * timeout processing - this is how real hw works (the interrupt is
+	 * only generated after all the vblank registers are updated) and what
+	 * the vblank core expects.
+	 *
+	 * Therefore we always need to correct the timestamp. The returned
+	 * time should be the time of the first active scanline after the
+	 * previous vblank. Hence subtract the active phase's duration from
+	 * the next expiration time.
 	 */
-	*vblank_time = ktime_sub(*vblank_time, vtimer->interval);
+	if (drm_WARN_ON(dev, !mode->crtc_vtotal))
+		return false;
+	activedur_ns = div_s64(framedur_ns * mode->crtc_vdisplay, mode->crtc_vtotal);
+	*vblank_time = ktime_sub_ns(*vblank_time, activedur_ns);
 
 	return true;
 }
-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox