Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Jani Nikula @ 2026-06-25 11:00 UTC (permalink / raw)
  To: Kaitao Cheng, David Laight, Christian König,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <0ed6b5c3-e955-46e2-9fc6-075a0dfd1c4f@linux.dev>

On Thu, 25 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 在 2026/6/24 22:23, David Laight 写道:
>> On Wed, 24 Jun 2026 15:23:47 +0200
>> Christian König <christian.koenig@amd.com> wrote:
>>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>>  
>>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>>
>>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>>> every call site, even though most users only need it for the iterator
>>>>>> implementation and never reference it in the loop body.
>>>>>>
>>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>>> a unique internal cursor.  
>>>>>
>>>>> I'm not really sure 'mutable' means anything either.
>>>>> It is possible to make it valid for the loop body (or even other threads)
>>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>>
>>>>> It might be worth doing something that doesn't need the extra variable,
>>>>> but there is little point doing all the churn just to rename things.
>>>>>  
>>>>>>
>>>>>> This makes call sites that only mutate the list through the current entry
>>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>>> compatibility.
>>>>>>
>>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>> ---
>>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>>> --- a/include/linux/list.h
>>>>>> +++ b/include/linux/list.h
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  #include <linux/stddef.h>
>>>>>>  #include <linux/poison.h>
>>>>>>  #include <linux/const.h>
>>>>>> +#include <linux/args.h>
>>>>>>  
>>>>>>  #include <asm/barrier.h>
>>>>>>  
>>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>>  #define list_for_each_prev(pos, head) \
>>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>>  
>>>>>> -/**
>>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> +/*
>>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>>   */
>>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>>  	     !list_is_head(pos, (head)); \
>>>>>>  	     pos = n, n = pos->next)
>>>>>>  
>>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>>
>>>>> Use auto
>>>>>  
>>>>>> +	     !list_is_head(pos, (head));				\
>>>>>> +	     pos = tmp, tmp = pos->next)
>>>>>> +
>>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>>> +
>>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>>> +	list_for_each_safe(pos, next, head)
>>>>>> +
>>>>>>  /**
>>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> + * @...:	either (head) or (next, head)
>>>>>> + *
>>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>>> + *		the caller.
>>>>>> + * head:	the head for your list.
>>>>>> + */
>>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>>> +		(pos, __VA_ARGS__)  
>>>>>
>>>>> The variable argument count logic really just slows down compilation.
>>>>> Maybe there aren't enough copies of this code to make that significant.
>>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>>> I'm also not sure it really adds anything to the readability.
>>>>>
>>>>> And, it you are going to make the middle argument optional there is
>>>>> no need to change the macro name.  
>>>>
>>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>>> implementation approach. If we abandon that method, it means we will
>>>> inevitably need to add some new macros. If mutable is not a good name,
>>>> suggestions for better alternatives would be welcome; coming up with a
>>>> suitable name is indeed rather tricky.  
>>>
>>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>>
>>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
>> 
>> IIRC currently you have a choice of either:
>> 	define               Item that can't be deleted
>> 	list_for_each()	     The current item.
>> 	list_for_each_safe() The next item.
>> There is also likely to be code that updates the variables to allow
>> for other scenarios.
>> 
>> Note that if increase a reference count and release a lock then list_for_each()
>> is likely safer than list_for_each_safe() :-)
>> 
>> list.h has 9 variants of the 'safe' loop.
>> The bloat of another 9 is getting excessive.
>> 
>> It has to be said that this is one of my least favourite type of list...
>
> Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
> Andy Shevchenko, Alexei Starovoitov
>
> For ease of discussion, I need to summarize the currently possible
> approaches and briefly describe their respective pros and cons,
> using the list_for_each_entry* interfaces as examples.
>
> 1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
> and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
> would be used specifically for safe deletion scenarios that do not
> need to expose the temporary cursor externally. The code can refer to
> the v1 version.
>
> Pros: Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: Requires adding a whole set of mutable interfaces, which makes the
>       code somewhat redundant.

Seems fine, and the original _safe naming is ambiguous anyway.

> 2. Directly optimize away the temporary cursor in list_for_each_entry_safe
> and define it inside the loop instead, changing the interface from four
> arguments to three.
>
> Pros: Does not add redundant interfaces.
> Cons: (1) Users need to manually update special cases that use the
>       traversal variable of list_for_each_entry_safe, the new
>       list_for_each_entry_safe would no longer apply there and would
>       need to be open-coded.
>       (2) Because the macro arguments changes, all list_for_each_entry_safe
>       callers would need to be modified and merged together, making it
>       difficult to merge such a large amount of code at once.

This won't fly because there are literally thousands of
list_for_each_entry_safe() users.

> 3. Use a variadic macro approach to optimize list_for_each_entry_safe,
> so that it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can
>       be merged directly.
> Cons: (1) Increases compile time.
>       (2) Makes the interface harder for users to use.

Basically I'm against any variadic macro tricks where the optional
argument is not the last argument. That's just way too surprising, and
goes against common practice in just about all other languages.

> 4. Optimize list_for_each_entry by defining the temporary cursor internally,
> making it compatible with the functionality of list_for_each_entry_safe.
> The code can refer to the v2 version.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) The number of externally visible arguments of list_for_each_entry
>       remains unchanged, still three.
> Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.
>       (2) Users need to manually update special cases that use the traversal
>       variable of list_for_each_entry, the new list_for_each_entry would no
>       longer apply there and would need to be open-coded. There are 15 such
>       cases in total.

This sounds good to me, though I take it there's some code size increase
and/or performance penalty?

Maybe the 15 cases are questionable anyway?

> 5. Use a variadic macro approach to optimize list_for_each_entry, so that
> it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: (1) Increases compile time.
>       (2) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.

Please don't do the macro tricks.

> 6. Make no changes, keep the current logic unchanged, and close the current
> email discussion.

I like hiding the temporary stuff when possible.


BR,
Jani.

-- 
Jani Nikula, Intel

^ permalink raw reply

* Re: [PATCH net v3 1/2] iov_iter: export iov_iter_restore
From: Christian Brauner @ 2026-06-25 10:43 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: netdev, Alexander Viro, Andrew Morton, Arseniy Krasnov,
	David S. Miller, Eric Dumazet, Eugenio Pérez, Jakub Kicinski,
	Jason Wang, kvm, linux-block, linux-fsdevel, linux-kernel,
	Michael S. Tsirkin, Paolo Abeni, Simon Horman, Stefan Hajnoczi,
	Stefano Garzarella, virtualization, Xuan Zhuo, Jens Axboe
In-Reply-To: <20260622222757.2130402-2-tavip@google.com>

> Export iov_iter_restore so that it can be used by modules.
> 
> This is needed by the virtio vsock transport (which can be built as a
> module) to restore the msg_iter state when transmission fails.
> 
> Acked-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Octavian Purdila <tavip@google.com>
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 273919b16161..f5df63961fb2 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
>  		i->__iov -= state->nr_segs - i->nr_segs;
>  	i->nr_segs = state->nr_segs;
>  }
> +EXPORT_SYMBOL_GPL(iov_iter_restore);

At least only export it for the module that really needs it. For
example, see:

EXPORT_SYMBOL_FOR_MODULES(__kernel_write, "autofs4");

-- 
Christian Brauner <brauner@kernel.org>

^ permalink raw reply

* Re: [PATCH] vdpa_sim: fix cleanup after worker creation failure
From: Eugenio Perez Martin @ 2026-06-25  8:54 UTC (permalink / raw)
  To: Xiong Weimin; +Cc: virtualization, jasowang, mst, xuanzhuo
In-Reply-To: <20260625083017.497842-1-15927021679@163.com>

On Thu, Jun 25, 2026 at 10:37 AM Xiong Weimin <15927021679@163.com> wrote:
>
> From: Xiong Weimin <xiongweimin@kylinos.cn>
>
> vdpasim_create() leaves vdpasim->worker as an ERR_PTR when
> kthread_run_worker() fails. The error path then drops the device
> reference, which releases the partially initialized simulator.
>
> vdpasim_free() unconditionally passes the worker pointer to
> kthread_destroy_worker(), so the ERR_PTR is dereferenced and can
> trigger a general protection fault.
>
> Store the worker error, clear the pointer, and make the release path
> only clean up resources that were successfully initialized before
> the failure.
>
> Tested on openEuler VM with kernel 6.16.8: module build, reload,
> and vdpa dev add via vdpasim_net.
>
> Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 1111111..2222222 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -231,8 +231,11 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr,
>         kthread_init_work(&vdpasim->work, vdpasim_work_fn);
>         vdpasim->worker = kthread_run_worker(0, "vDPA sim worker: %s",
>                                                 dev_attr->name);
> -       if (IS_ERR(vdpasim->worker))
> +       if (IS_ERR(vdpasim->worker)) {
> +               ret = PTR_ERR(vdpasim->worker);
> +               vdpasim->worker = NULL;
>                 goto err_iommu;
> +       }
>
>         mutex_init(&vdpasim->mutex);
>         spin_lock_init(&vdpasim->iommu_lock);
> @@ -750,18 +753,24 @@ static void vdpasim_free(struct vdpa_device *vdpa)
>         struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
>         int i;
>
> -       kthread_cancel_work_sync(&vdpasim->work);
> -       kthread_destroy_worker(vdpasim->worker);
> +       if (vdpasim->worker) {
> +               kthread_cancel_work_sync(&vdpasim->work);
> +               kthread_destroy_worker(vdpasim->worker);
> +       }
>
> -       for (i = 0; i < vdpasim->dev_attr.nvqs; i++) {
> -               vringh_kiov_cleanup(&vdpasim->vqs[i].out_iov);
> -               vringh_kiov_cleanup(&vdpasim->vqs[i].in_iov);
> +       if (vdpasim->vqs) {
> +               for (i = 0; i < vdpasim->dev_attr.nvqs; i++) {
> +                       vringh_kiov_cleanup(&vdpasim->vqs[i].out_iov);
> +                       vringh_kiov_cleanup(&vdpasim->vqs[i].in_iov);
> +               }
>         }
>
>         vdpasim->dev_attr.free(vdpasim);
>
> -       for (i = 0; i < vdpasim->dev_attr.nas; i++)
> -               vhost_iotlb_reset(&vdpasim->iommu[i]);
> +       if (vdpasim->iommu && vdpasim->iommu_pt) {
> +               for (i = 0; i < vdpasim->dev_attr.nas; i++)
> +                       vhost_iotlb_reset(&vdpasim->iommu[i]);
> +       }
>         kfree(vdpasim->iommu);
>         kfree(vdpasim->iommu_pt);
>         kfree(vdpasim->vqs);

Isn't this the exact same fix as v1 of
https://lore.kernel.org/lkml/20260620100959.2070316-1-slf@hdu.edu.cn/
?


^ permalink raw reply

* [PATCH] vdpa_sim: fix cleanup after worker creation failure
From: Xiong Weimin @ 2026-06-25  8:30 UTC (permalink / raw)
  To: virtualization; +Cc: jasowang, mst, xuanzhuo, eperezma

From: Xiong Weimin <xiongweimin@kylinos.cn>

vdpasim_create() leaves vdpasim->worker as an ERR_PTR when
kthread_run_worker() fails. The error path then drops the device
reference, which releases the partially initialized simulator.

vdpasim_free() unconditionally passes the worker pointer to
kthread_destroy_worker(), so the ERR_PTR is dereferenced and can
trigger a general protection fault.

Store the worker error, clear the pointer, and make the release path
only clean up resources that were successfully initialized before
the failure.

Tested on openEuler VM with kernel 6.16.8: module build, reload,
and vdpa dev add via vdpasim_net.

Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 1111111..2222222 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -231,8 +231,11 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr,
 	kthread_init_work(&vdpasim->work, vdpasim_work_fn);
 	vdpasim->worker = kthread_run_worker(0, "vDPA sim worker: %s",
 						dev_attr->name);
-	if (IS_ERR(vdpasim->worker))
+	if (IS_ERR(vdpasim->worker)) {
+		ret = PTR_ERR(vdpasim->worker);
+		vdpasim->worker = NULL;
 		goto err_iommu;
+	}
 
 	mutex_init(&vdpasim->mutex);
 	spin_lock_init(&vdpasim->iommu_lock);
@@ -750,18 +753,24 @@ static void vdpasim_free(struct vdpa_device *vdpa)
 	struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
 	int i;
 
-	kthread_cancel_work_sync(&vdpasim->work);
-	kthread_destroy_worker(vdpasim->worker);
+	if (vdpasim->worker) {
+		kthread_cancel_work_sync(&vdpasim->work);
+		kthread_destroy_worker(vdpasim->worker);
+	}
 
-	for (i = 0; i < vdpasim->dev_attr.nvqs; i++) {
-		vringh_kiov_cleanup(&vdpasim->vqs[i].out_iov);
-		vringh_kiov_cleanup(&vdpasim->vqs[i].in_iov);
+	if (vdpasim->vqs) {
+		for (i = 0; i < vdpasim->dev_attr.nvqs; i++) {
+			vringh_kiov_cleanup(&vdpasim->vqs[i].out_iov);
+			vringh_kiov_cleanup(&vdpasim->vqs[i].in_iov);
+		}
 	}
 
 	vdpasim->dev_attr.free(vdpasim);
 
-	for (i = 0; i < vdpasim->dev_attr.nas; i++)
-		vhost_iotlb_reset(&vdpasim->iommu[i]);
+	if (vdpasim->iommu && vdpasim->iommu_pt) {
+		for (i = 0; i < vdpasim->dev_attr.nas; i++)
+			vhost_iotlb_reset(&vdpasim->iommu[i]);
+	}
 	kfree(vdpasim->iommu);
 	kfree(vdpasim->iommu_pt);
 	kfree(vdpasim->vqs);
-- 
2.39.0


^ permalink raw reply related

* RE: [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug memory
From: Duan, Zhenzhong @ 2026-06-25  6:38 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: marcandre.lureau@redhat.com, david@kernel.org, Edgecombe, Rick P,
	prsampat@amd.com, pbonzini@redhat.com, mst@redhat.com,
	peterx@redhat.com, Qiang, Chenyi, Reshetova, Elena,
	michael.roth@amd.com, ackerleytng@google.com,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	virtualization@lists.linux.dev, x86@kernel.org, Xu, Yilun,
	Li, Xiaoyao, Peng, Chao P
In-Reply-To: <ajvNXyYwb7FXAJhP@thinkstation>



>-----Original Message-----
>From: Kiryl Shutsemau <kas@kernel.org>
>Subject: Re: [RFCv2 PATCH 2/6] efi/unaccepted: Set unaccepted bits for all hotplug
>memory
>
>On Tue, Jun 23, 2026 at 06:17:33AM -0400, Zhenzhong Duan wrote:
>> In coco guests, hotpluggable memory ranges are initially unaccepted.
>> While a previous change expanded the unaccepted memory bitmap boundaries
>> to include these hotplug spaces, the actual bits inside the bitmap are
>> not yet marked as unaccepted.
>>
>> Walks SRAT a second time after the bitmap is allocated and sets the bits
>> corresponding to hotpluggable ranges.
>>
>> This ensures the bitmap state accurately reflects all static and hotplug
>> memory ranges before booting kernel.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  .../firmware/efi/libstub/unaccepted_memory.c   | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c
>b/drivers/firmware/efi/libstub/unaccepted_memory.c
>> index bfbb78bd7b8a..01bed8e751ca 100644
>> --- a/drivers/firmware/efi/libstub/unaccepted_memory.c
>> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
>> @@ -92,6 +92,23 @@ static void update_mem_boundaries(struct
>acpi_srat_mem_affinity *mem, struct sra
>>  		*(ctx->mem_end) = range_end;
>>  }
>>
>> +static void mark_hotplug_memory_unaccepted(struct acpi_srat_mem_affinity
>*mem,
>> +					   struct srat_parse_ctx *ctx)
>> +{
>> +	u64 unit_size = unaccepted_table->unit_size;
>> +	u64 start, end;
>> +
>> +	start = round_up(mem->base_address, unit_size);
>> +	end = round_down(mem->base_address + mem->length, unit_size);
>
>We can get here with start > end if srat range is less then unit_size.

Will add a check to ignore small range less than unit_size:

+       if (start >= end)
+               return;
+

Thanks
Zhenzhong

^ permalink raw reply

* RE: [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI hotplug/unplug for coco guest
From: Duan, Zhenzhong @ 2026-06-25  5:56 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: marcandre.lureau@redhat.com, david@kernel.org, Edgecombe, Rick P,
	prsampat@amd.com, pbonzini@redhat.com, mst@redhat.com,
	peterx@redhat.com, Qiang, Chenyi, Reshetova, Elena,
	michael.roth@amd.com, ackerleytng@google.com,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	virtualization@lists.linux.dev, x86@kernel.org, Xu, Yilun,
	Li, Xiaoyao, Peng, Chao P
In-Reply-To: <ajvOP9NwqNstgNvl@thinkstation>



>-----Original Message-----
>From: Kiryl Shutsemau <kas@kernel.org>
>Subject: Re: [RFCv2 PATCH 5/6] mm/memory_hotplug: Support ACPI
>hotplug/unplug for coco guest
>
>On Tue, Jun 23, 2026 at 06:17:36AM -0400, Zhenzhong Duan wrote:
>> +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
>> +	for (; range_start < bitmap_size; range_start = range_end) {
>> +		unsigned long phys_start, phys_end;
>> +		unsigned long unaccepted_one, plugged_zero;
>> +
>> +		range_start = find_next_andnot_bit(plugged_bitmap,
>unaccepted->bitmap,
>> +						   bitmap_size, range_start);
>> +
>> +		if (range_start >= bitmap_size)
>> +			break;
>> +
>> +		unaccepted_one = find_next_bit(unaccepted->bitmap,
>bitmap_size, range_start);
>> +		plugged_zero = find_next_zero_bit(plugged_bitmap, bitmap_size,
>range_start);
>> +		range_end = min(unaccepted_one, plugged_zero);
>> +
>> +		phys_start = range_start * unit_size + unaccepted->phys_base;
>> +		phys_end = range_end * unit_size + unaccepted->phys_base;
>> +
>> +		arch_unaccept_memory(phys_start, phys_end);
>> +		bitmap_set(unaccepted->bitmap, range_start, range_end -
>range_start);
>> +	}
>> +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
>
>Accept TDCALL under the spin lock will kill scalability.

OK, I can drop the lock during arch_unaccept_memory() and avoid race
by checking the accepting_list just like in accept_memory().

I initially wrapped this in the spinlock because TDG.MEM.PAGE.RELEASE
is a quick local TDX module call to transition pages back to PENDING state,
without the heavy VMM trapping/faulting overhead associated with
memory acceptance paths.

Thanks
Zhenzhong

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-25  3:01 UTC (permalink / raw)
  To: David Laight, Christian König, Jani Nikula,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <20260624152324.3def88ce@pumpkin>

在 2026/6/24 22:23, David Laight 写道:
> On Wed, 24 Jun 2026 15:23:47 +0200
> Christian König <christian.koenig@amd.com> wrote:
>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>  
>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>
>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>> every call site, even though most users only need it for the iterator
>>>>> implementation and never reference it in the loop body.
>>>>>
>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>> a unique internal cursor.  
>>>>
>>>> I'm not really sure 'mutable' means anything either.
>>>> It is possible to make it valid for the loop body (or even other threads)
>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>
>>>> It might be worth doing something that doesn't need the extra variable,
>>>> but there is little point doing all the churn just to rename things.
>>>>  
>>>>>
>>>>> This makes call sites that only mutate the list through the current entry
>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>> compatibility.
>>>>>
>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>> ---
>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>> --- a/include/linux/list.h
>>>>> +++ b/include/linux/list.h
>>>>> @@ -7,6 +7,7 @@
>>>>>  #include <linux/stddef.h>
>>>>>  #include <linux/poison.h>
>>>>>  #include <linux/const.h>
>>>>> +#include <linux/args.h>
>>>>>  
>>>>>  #include <asm/barrier.h>
>>>>>  
>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>  #define list_for_each_prev(pos, head) \
>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>  
>>>>> -/**
>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> +/*
>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>   */
>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>  	     !list_is_head(pos, (head)); \
>>>>>  	     pos = n, n = pos->next)
>>>>>  
>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>
>>>> Use auto
>>>>  
>>>>> +	     !list_is_head(pos, (head));				\
>>>>> +	     pos = tmp, tmp = pos->next)
>>>>> +
>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>> +
>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>> +	list_for_each_safe(pos, next, head)
>>>>> +
>>>>>  /**
>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> + * @...:	either (head) or (next, head)
>>>>> + *
>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>> + *		the caller.
>>>>> + * head:	the head for your list.
>>>>> + */
>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>> +		(pos, __VA_ARGS__)  
>>>>
>>>> The variable argument count logic really just slows down compilation.
>>>> Maybe there aren't enough copies of this code to make that significant.
>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>> I'm also not sure it really adds anything to the readability.
>>>>
>>>> And, it you are going to make the middle argument optional there is
>>>> no need to change the macro name.  
>>>
>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>> implementation approach. If we abandon that method, it means we will
>>> inevitably need to add some new macros. If mutable is not a good name,
>>> suggestions for better alternatives would be welcome; coming up with a
>>> suitable name is indeed rather tricky.  
>>
>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>
>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
> 
> IIRC currently you have a choice of either:
> 	define               Item that can't be deleted
> 	list_for_each()	     The current item.
> 	list_for_each_safe() The next item.
> There is also likely to be code that updates the variables to allow
> for other scenarios.
> 
> Note that if increase a reference count and release a lock then list_for_each()
> is likely safer than list_for_each_safe() :-)
> 
> list.h has 9 variants of the 'safe' loop.
> The bloat of another 9 is getting excessive.
> 
> It has to be said that this is one of my least favourite type of list...

Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
Andy Shevchenko, Alexei Starovoitov

For ease of discussion, I need to summarize the currently possible
approaches and briefly describe their respective pros and cons,
using the list_for_each_entry* interfaces as examples.

1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
would be used specifically for safe deletion scenarios that do not
need to expose the temporary cursor externally. The code can refer to
the v1 version.

Pros: Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: Requires adding a whole set of mutable interfaces, which makes the
      code somewhat redundant.

2. Directly optimize away the temporary cursor in list_for_each_entry_safe
and define it inside the loop instead, changing the interface from four
arguments to three.

Pros: Does not add redundant interfaces.
Cons: (1) Users need to manually update special cases that use the
      traversal variable of list_for_each_entry_safe, the new
      list_for_each_entry_safe would no longer apply there and would
      need to be open-coded.
      (2) Because the macro arguments changes, all list_for_each_entry_safe
      callers would need to be modified and merged together, making it
      difficult to merge such a large amount of code at once.

3. Use a variadic macro approach to optimize list_for_each_entry_safe,
so that it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can
      be merged directly.
Cons: (1) Increases compile time.
      (2) Makes the interface harder for users to use.

4. Optimize list_for_each_entry by defining the temporary cursor internally,
making it compatible with the functionality of list_for_each_entry_safe.
The code can refer to the v2 version.

Pros: (1) Does not add redundant interfaces.
      (2) The number of externally visible arguments of list_for_each_entry
      remains unchanged, still three.
Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.
      (2) Users need to manually update special cases that use the traversal
      variable of list_for_each_entry, the new list_for_each_entry would no
      longer apply there and would need to be open-coded. There are 15 such
      cases in total.

5. Use a variadic macro approach to optimize list_for_each_entry, so that
it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: (1) Increases compile time.
      (2) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.

6. Make no changes, keep the current logic unchanged, and close the current
email discussion.


Which of the six solutions above do people prefer?

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* [PATCH] drm/virtio: fail init on display-info timeout
From: Pengpeng Hou @ 2026-06-25  3:02 UTC (permalink / raw)
  To: David Airlie, Gerd Hoffmann, Dmitry Osipenko
  Cc: Gurchetan Singh, Chia-I Wu, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Simona Vetter, dri-devel, virtualization,
	linux-kernel, pengpeng

virtio_gpu_init() sends GET_DISPLAY_INFO when scanouts are present and
waits for display_info_pending to clear. If the response never arrives,
the wait result is ignored and probe still succeeds.

Return -ETIMEDOUT on display-info timeout. Because this happens after
virtio_device_ready(), reset the device and tear down modesetting before
using the existing vbuf and virtqueue cleanup path.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 drivers/gpu/drm/virtio/virtgpu_kms.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_kms.c b/drivers/gpu/drm/virtio/virtgpu_kms.c
index cfde9f573df6..31209bea97ae 100644
--- a/drivers/gpu/drm/virtio/virtgpu_kms.c
+++ b/drivers/gpu/drm/virtio/virtgpu_kms.c
@@ -262,11 +262,19 @@ int virtio_gpu_init(struct virtio_device *vdev, struct drm_device *dev)
 			virtio_gpu_cmd_get_edids(vgdev);
 		virtio_gpu_cmd_get_display_info(vgdev);
 		virtio_gpu_notify(vgdev);
-		wait_event_timeout(vgdev->resp_wq, !vgdev->display_info_pending,
-				   5 * HZ);
+		if (!wait_event_timeout(vgdev->resp_wq,
+					!vgdev->display_info_pending,
+					5 * HZ)) {
+			DRM_ERROR("timed out waiting for display info\n");
+			ret = -ETIMEDOUT;
+			goto err_ready;
+		}
 	}
 	return 0;
 
+err_ready:
+	virtio_reset_device(vgdev->vdev);
+	virtio_gpu_modeset_fini(vgdev);
 err_scanouts:
 	virtio_gpu_free_vbufs(vgdev);
 err_vbufs:
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v4] virtio_net: disable cb when NAPI is busy-polled
From: Longjun Tang @ 2026-06-25  1:37 UTC (permalink / raw)
  To: mst, xuanzhuo
  Cc: jasowang, edumazet, virtualization, netdev, tanglongjun,
	lange_tang

From: Longjun Tang <tanglongjun@kylinos.cn>

When busy-poll is active, napi_schedule_prep() returns false in
virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
The device may keep firing irqs until reaches virtqueue_napi_complete().
Under load (received == budget), it will lead to a large number
of spurious interrupts.

Fix it by disabling the callback at the virtnet_poll() entry.
This keeps the callback off while we poll and it is re-enabled by
virtqueue_napi_complete() when going idle.

Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>

---
V1 -> V2: Remain agnostic to busy polling
V2 -> V3: Add fixes tag
V3 -> V4: Update commit message and remove some comments
---
 drivers/net/virtio_net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..569e4db187d1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3008,6 +3008,8 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
 	unsigned int xdp_xmit = 0;
 	bool napi_complete;

+	virtqueue_disable_cb(rq->vq);
+
 	virtnet_poll_cleantx(rq, budget);

 	received = virtnet_receive(rq, budget, &xdp_xmit);
-- 
2.43.0

^ permalink raw reply related

* Re:Re: [PATCH v3] virtio_net: disable cb when NAPI is busy-polled
From: Lange Tang @ 2026-06-25  1:28 UTC (permalink / raw)
  To: mst@redhat.com
  Cc: xuanzhuo@linux.alibaba.com, jasowang@redhat.com,
	edumazet@google.com, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, Tang Longjun
In-Reply-To: <20260624030656-mutt-send-email-mst@kernel.org>

At 2026-06-24 15:08:24, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>On Wed, Jun 24, 2026 at 03:02:06PM +0800, Longjun Tang wrote:
>> From: Longjun Tang <tanglongjun@kylinos.cn>
>> 
>> When busy-poll is active, napi_schedule_prep() returns false in
>> virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
>> The device may keep firing irqs until reaches virtqueue_napi_complete().
>> Under load (received == budget), it will lead to a large number
>> of spurious interrupts.
>> 
>> Fix it by disabling the callback at the virtnet_poll() entry. This keeps
>> the callback off while we poll and re-enable
>
>and it is re-enabled
>
>> by virtqueue_napi_complete()
>> when going idle.
>> 
>> Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
>> 
>> ---
>> V1 -> V2: Remain agnostic to busy polling
>> V2 -> V3: Add fixes tag
>> ---
>>  drivers/net/virtio_net.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>> 
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index f4adcfee7a80..0a11f2b32500 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>>  	unsigned int xdp_xmit = 0;
>>  	bool napi_complete;
>>  
>> +	/* Keep callbacks suppressed for the duration of this poll,
>> +	 * busy-poll need.
>
>I don't know what "busy-poll need" means. Just drop this part?
>In fact, the whole comment can go, we know virtqueue_disable_cb
>disables callbacks.

Thanks for your reply.  I got it, see you next version.
>
>> +	 */
>> +	virtqueue_disable_cb(rq->vq);
>> +
>>  	virtnet_poll_cleantx(rq, budget);
>>  
>>  	received = virtnet_receive(rq, budget, &xdp_xmit);
>> -- 
>> 2.43.0

^ permalink raw reply

* Re: [PATCH v4] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 22:10 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bX-iDvcUOPPY+NLz95tkRJYwWqvzAr=U48uNaub_HZLGw@mail.gmail.com>

On Wed, Jun 24, 2026 at 03:02:02PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 
> Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
> vhost_vdpa_pa_unmap()")

still 2 lines?

> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
> Changes in v4:
> - Keep the Fixes tag on one line.
> - Add Michael's Acked-by tag.
> 
> Changes in v3:
> - Add the Fixes tag.
> 
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* [PATCH v4] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 22:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
vhost_vdpa_pa_unmap()")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v4:
- Keep the Fixes tag on one line.
- Add Michael's Acked-by tag.

Changes in v3:
- Add the Fixes tag.

Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v3] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 21:59 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bV8OeSTOVnAPRh6ygKdogFjqEiDNj1Vbh623KBBkZgxiw@mail.gmail.com>

On Wed, Jun 24, 2026 at 02:56:20PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 
> Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
> vhost_vdpa_pa_unmap()")

weirdly wrapped. will likely break some tools.

> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> Changes in v3:
> - Add the Fixes tag.
> 
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* [PATCH v3] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 21:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Fixes: 22af48cf91aa ("vdpa: factor out vhost_vdpa_pa_map() and
vhost_vdpa_pa_unmap()")
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v3:
- Add the Fixes tag.

Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v2] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Michael S. Tsirkin @ 2026-06-24 21:54 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <CAMuQ4bURwyoJtKUskzXpbUtrxXT1vBZZxpKpnnJY6qWsLtTBMA@mail.gmail.com>

On Wed, Jun 24, 2026 at 02:51:53PM -0700, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. On 32-bit systems,
> where unsigned long is narrower than u64, that addition can overflow and
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the unsigned long page-count calculation.
> 


And a Fixes: tag, please.

> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
> Changes in v2:
> - Clarify that the overflow is on 32-bit systems.
> - Drop the unrelated memlock check change.
> 
>  drivers/vhost/vdpa.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..38b28ed3d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
> 
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> -- 
> 2.54.0


^ permalink raw reply

* [PATCH v2] vhost/vdpa: reject overflowing PA map page counts on 32-bit
From: Yousef Alhouseen @ 2026-06-24 21:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. On 32-bit systems,
where unsigned long is narrower than u64, that addition can overflow and
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the unsigned long page-count calculation.

Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
Changes in v2:
- Clarify that the overflow is on 32-bit systems.
- Drop the unrelated memlock check change.

 drivers/vhost/vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..38b28ed3d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1115,13 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH] vhost/vdpa: reject overflowing PA map page counts
From: Yousef Alhouseen @ 2026-06-24 21:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <20260624153850-mutt-send-email-mst@kernel.org>

On Wed, Jun 24, 2026 at 01:53:38PM -0400, Michael S. Tsirkin wrote:
> You should add "on 32 bit systems" - I do not see how it can
> overflow on 64 bit.

Right, the overflow I was trying to cover is the unsigned long
page-count calculation on 32-bit systems, where size can be wider than
unsigned long and the page offset is added before PFN_UP(). I should
have made that scope explicit in the changelog.

> I don't see how this can happen at all - pinned_vm is in units of pages.

Agreed, that part is not needed for this fix. I'll drop the memlock
check change and send a v2 with the changelog clarified to say this is
for 32-bit systems.

Thanks,
Yousef


On Wed, 24 Jun 2026 15:53:38 -0400, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Jun 24, 2026 at 09:06:53PM +0200, Yousef Alhouseen wrote:
> > vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> > size before computing the number of pages to pin. If that addition wraps,
> > the code can pin and map fewer pages than the requested IOTLB range.
> >
> > Reject sizes that overflow the page-count calculation.
>
> You should add "on 32 bit systems" - I do not see how it can
> overflow on 64 bit.
>
> > Also make the
> > memlock check subtraction-based so a large page count cannot wrap the
> > pinned page total.
>
> I don't see how this can happen at all - pinned_vm is in units of pages.
>
> > Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> > ---
> > drivers/vhost/vdpa.c | 12 ++++++++++--
> > 1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index ac55275fa..090cb8693 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -1102,6 +1102,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > unsigned int gup_flags = FOLL_LONGTERM;
> > unsigned long npages, cur_base, map_pfn, last_pfn = 0;
> > unsigned long lock_limit, sz2pin, nchunks, i;
> > + unsigned long page_offset;
> > + u64 pinned_vm;
> > u64 start = iova;
> > long pinned;
> > int ret = 0;
> > @@ -1114,7 +1116,12 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > if (perm & VHOST_ACCESS_WO)
> > gup_flags |= FOLL_WRITE;
> >
> > - npages = PFN_UP(size + (iova & ~PAGE_MASK));
> > + page_offset = iova & ~PAGE_MASK;
> > + if (size > ULONG_MAX - page_offset) {
> > + ret = -EINVAL;
> > + goto free;
> > + }
> > + npages = PFN_UP(size + page_offset);
> > if (!npages) {
> > ret = -EINVAL;
> > goto free;
> > @@ -1123,7 +1130,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> > mmap_read_lock(dev->mm);
> >
> > lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> > - if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
> > + pinned_vm = atomic64_read(&dev->mm->pinned_vm);
> > + if (npages > lock_limit || pinned_vm > lock_limit - npages) {
> > ret = -ENOMEM;
> > goto unlock;
> > }
> > --
> > 2.54.0

^ permalink raw reply

* [syzbot] Monthly virt report (Jun 2026)
From: syzbot @ 2026-06-24 20:32 UTC (permalink / raw)
  To: linux-kernel, syzkaller-bugs, virtualization

Hello virt maintainers/developers,

This is a 31-day syzbot report for the virt subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/virt

During the period, 0 new issues were detected and 0 were fixed.
In total, 5 issues are still open and 61 have already been fixed.
There are also 2 low-priority issues.

Some of the still happening issues:

Ref Crashes Repro Title
<1> 24      No    WARNING: refcount bug in call_timer_fn (4)
                  https://syzkaller.appspot.com/bug?extid=07dcf509f4c013e25dc5
<2> 3       Yes   memory leak in __vsock_create (2)
                  https://syzkaller.appspot.com/bug?extid=1b2c9c4a0f8708082678
<3> 3913    Yes   INFO: rcu detected stall in do_idle
                  https://syzkaller.appspot.com/bug?extid=385468161961cee80c31

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

To disable reminders for individual bugs, reply with the following command:
#syz set <Ref> no-reminders

To change bug's subsystems, reply with:
#syz set <Ref> subsystems: new-subsystem

You may send multiple commands in a single email message.

^ permalink raw reply

* [PATCH] drm/qxl: fix use-after-free in qxl_irq_handler on PCI
From: Óscar Megía López @ 2026-06-24 20:12 UTC (permalink / raw)
  To: virtualization, linux-kernel-mentees; +Cc: Óscar Megía López

while :; do
    echo [pci qxl id] > /sys/bus/pci/drivers/qxl/unbind
    echo [pci qxl id] > /sys/bus/pci/drivers/qxl/bind
done

After a few seconds, it reports:

==================================================================
BUG: KASAN: slab-use-after-free in qxl_irq_handler+0x269/0x2b0
Read of size 8 at addr ffff888001c6cd48 by task swapper/0/0

CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted
     7.1.0-10963-g1a3746ccbb0a #31 PREEMPT(lazy)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
               BIOS Arch Linux 1.17.0-2-2 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl+0x4d/0x70
 print_report+0x14b/0x4b0
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 ? profile_tick+0x56/0x90
 ? tick_nohz_handler+0x23c/0x5c0
 kasan_report+0x117/0x140
 ? qxl_irq_handler+0x269/0x2b0
 ? qxl_irq_handler+0x269/0x2b0
 ? __pfx_qxl_irq_handler+0x10/0x10
 qxl_irq_handler+0x269/0x2b0
 ? __pfx_qxl_irq_handler+0x10/0x10
 ? __pfx_qxl_irq_handler+0x10/0x10
 __handle_irq_event_percpu+0x116/0x450
 ? __pfx__raw_spin_lock+0x10/0x10
 handle_irq_event+0xa6/0x1c0
 handle_fasteoi_irq+0x271/0xb10
 ? __pfx_handle_fasteoi_irq+0x10/0x10
 __common_interrupt+0x60/0x130
 common_interrupt+0x7a/0x90
 </IRQ>
 <TASK>
  asm_common_interrupt+0x26/0x40
 RIP: 0010:pv_native_safe_halt+0xf/0x20
 Code: 42 de 00 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90
       90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d a3 cf 20 00
       fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90
       90 90 90 90 90
 RSP: 0018:ffffffffb8207e48 EFLAGS: 00000206
 RAX: ffff8880b296f000 RBX: ffffffffb82146c0 RCX: 0000000000000001
 RDX: 0000000000000001 RSI: 0000000000000004 RDI: 0000000000067a04
 RBP: fffffbfff70428d8 R08: ffffffffb7247e1d R09: 1ffff1100d846202
 R10: ffffed100d846203 R11: ffffed100d846203 R12: 0000000000000000
 R13: 0000000000000000 R14: 1ffffffff7040fcd R15: dffffc0000000000
  ? ct_kernel_exit.constprop.0+0x9d/0xc0
  default_idle+0x9/0x10
o  default_idle_call+0x37/0x60
  do_idle+0x3a8/0x5d0
  ? __pfx___schedule+0x10/0x10
  ? __pfx_do_idle+0x10/0x10
  cpu_startup_entry+0x4e/0x60
  rest_init+0x11a/0x120
  start_kernel+0x382/0x390
  x86_64_start_reservations+0x24/0x30
  x86_64_start_kernel+0xd6/0xe0
  common_startup_64+0x13e/0x158
  </TASK>

The qxl_pci_remove() function does not call free_irq(), allowing the IRQ
handler to fire after the device has been torn down, accessing freed
memory (qdev->ram_header, qdev->io_base).

I followed these steps to unload driver at link.

Added Disable the device from generating IRQs, Release the IRQ (free_irq())
at the start of qxl_pci_remove() to ensure no IRQs fire
after teardown begins.

Added at end Disable the device.

Assisted-by: OpenCode:1.17.8-Big Pickle
Fixes: 48bd85808443 ("drm/qxl: Convert to Linux IRQ interfaces")
Signed-off-by: Óscar Megía López <megia.oscar@gmail.com>
Link: https://www.kernel.org/doc/html/latest/PCI/pci.html
---
 drivers/gpu/drm/qxl/qxl_drv.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/qxl/qxl_drv.c b/drivers/gpu/drm/qxl/qxl_drv.c
index 1e6a2392d7c6..1613547c1856 100644
--- a/drivers/gpu/drm/qxl/qxl_drv.c
+++ b/drivers/gpu/drm/qxl/qxl_drv.c
@@ -154,12 +154,19 @@ static void
 qxl_pci_remove(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct qxl_device *qdev = to_qxl(dev);
+
+	qdev->ram_header->int_mask = 0;
+	outb(0, qdev->io_base + QXL_IO_UPDATE_IRQ);
+	free_irq(pdev->irq, dev);
+	cancel_work_sync(&qdev->client_monitors_config_work);
 
 	drm_kms_helper_poll_fini(dev);
 	drm_dev_unregister(dev);
 	drm_atomic_helper_shutdown(dev);
 	if (pci_is_vga(pdev) && pdev->revision < 5)
 		vga_put(pdev, VGA_RSRC_LEGACY_IO);
+	pci_disable_device(pdev);
 }
 
 static void
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] vhost/vdpa: reject overflowing PA map page counts
From: Michael S. Tsirkin @ 2026-06-24 19:53 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel
In-Reply-To: <20260624190653.2893-1-alhouseenyousef@gmail.com>

On Wed, Jun 24, 2026 at 09:06:53PM +0200, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. If that addition wraps,
> the code can pin and map fewer pages than the requested IOTLB range.
> 
> Reject sizes that overflow the page-count calculation.

You should add "on 32 bit systems" - I do not see how it can
overflow on 64 bit.

> Also make the
> memlock check subtraction-based so a large page count cannot wrap the
> pinned page total.

I don't see how this can happen at all - pinned_vm is in units of pages.

> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
>  drivers/vhost/vdpa.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..090cb8693 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	unsigned int gup_flags = FOLL_LONGTERM;
>  	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>  	unsigned long lock_limit, sz2pin, nchunks, i;
> +	unsigned long page_offset;
> +	u64 pinned_vm;
>  	u64 start = iova;
>  	long pinned;
>  	int ret = 0;
> @@ -1114,7 +1116,12 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	if (perm & VHOST_ACCESS_WO)
>  		gup_flags |= FOLL_WRITE;
>  
> -	npages = PFN_UP(size + (iova & ~PAGE_MASK));
> +	page_offset = iova & ~PAGE_MASK;
> +	if (size > ULONG_MAX - page_offset) {
> +		ret = -EINVAL;
> +		goto free;
> +	}
> +	npages = PFN_UP(size + page_offset);
>  	if (!npages) {
>  		ret = -EINVAL;
>  		goto free;
> @@ -1123,7 +1130,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>  	mmap_read_lock(dev->mm);
>  
>  	lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> -	if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
> +	pinned_vm = atomic64_read(&dev->mm->pinned_vm);
> +	if (npages > lock_limit || pinned_vm > lock_limit - npages) {
>  		ret = -ENOMEM;
>  		goto unlock;
>  	}
> -- 
> 2.54.0


^ permalink raw reply

* [PATCH] vhost/vdpa: reject overflowing PA map page counts
From: Yousef Alhouseen @ 2026-06-24 19:06 UTC (permalink / raw)
  To: Michael S . Tsirkin, Jason Wang, Eugenio Pérez
  Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen

vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. If that addition wraps,
the code can pin and map fewer pages than the requested IOTLB range.

Reject sizes that overflow the page-count calculation. Also make the
memlock check subtraction-based so a large page count cannot wrap the
pinned page total.

Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
 drivers/vhost/vdpa.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..090cb8693 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	unsigned int gup_flags = FOLL_LONGTERM;
 	unsigned long npages, cur_base, map_pfn, last_pfn = 0;
 	unsigned long lock_limit, sz2pin, nchunks, i;
+	unsigned long page_offset;
+	u64 pinned_vm;
 	u64 start = iova;
 	long pinned;
 	int ret = 0;
@@ -1114,7 +1116,12 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	if (perm & VHOST_ACCESS_WO)
 		gup_flags |= FOLL_WRITE;

-	npages = PFN_UP(size + (iova & ~PAGE_MASK));
+	page_offset = iova & ~PAGE_MASK;
+	if (size > ULONG_MAX - page_offset) {
+		ret = -EINVAL;
+		goto free;
+	}
+	npages = PFN_UP(size + page_offset);
 	if (!npages) {
 		ret = -EINVAL;
 		goto free;
@@ -1123,7 +1130,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
 	mmap_read_lock(dev->mm);

 	lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
-	if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
+	pinned_vm = atomic64_read(&dev->mm->pinned_vm);
+	if (npages > lock_limit || pinned_vm > lock_limit - npages) {
 		ret = -ENOMEM;
 		goto unlock;
 	}
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH] scsi: virtio_scsi: fixup endian conversions for warning messages
From: Stefan Hajnoczi @ 2026-06-24 18:07 UTC (permalink / raw)
  To: Ben Dooks
  Cc: Michael S. Tsirkin, Jason Wang, Paolo Bonzini, Eugenio Pérez,
	James E.J. Bottomley, Martin K. Petersen, virtualization,
	linux-scsi, linux-kernel
In-Reply-To: <20260623132427.838900-1-ben.dooks@codethink.co.uk>

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

On Tue, Jun 23, 2026 at 02:24:27PM +0100, Ben Dooks wrote:
> There are several places where printing functions are being passed parameters
> that have not been through endian conversion functions. Use the virtio32_to_cpu
> to fix the warnings.
> 
> Fixes the following warnings from (prototype) sparse:
> drivers/scsi/virtio_scsi.c:126:9: warning: incorrect type in argument 7 (different base types)
> drivers/scsi/virtio_scsi.c:126:9:    expected unsigned int
> drivers/scsi/virtio_scsi.c:126:9:    got restricted __virtio32 [usertype] sense_len
> drivers/scsi/virtio_scsi.c:312:17: warning: incorrect type in argument 2 (different base types)
> drivers/scsi/virtio_scsi.c:312:17:    expected unsigned int
> drivers/scsi/virtio_scsi.c:312:17:    got restricted __virtio32 [usertype] reason
> drivers/scsi/virtio_scsi.c:412:17: warning: incorrect type in argument 2 (different base types)
> drivers/scsi/virtio_scsi.c:412:17:    expected unsigned int
> drivers/scsi/virtio_scsi.c:412:17:    got restricted __virtio32 [usertype] event
> 
> Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
> ---
>  drivers/scsi/virtio_scsi.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v6 04/12] nvdimm: virtio_pmem: stop allocating child flush bio
From: Pankaj Gupta @ 2026-06-24 17:22 UTC (permalink / raw)
  To: Li Chen
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	Alison Schofield, virtualization, nvdimm, linux-kernel
In-Reply-To: <20260621130246.2973254-5-me@linux.beauty>

> pmem_submit_bio() passes the parent bio to nvdimm_flush() for
> REQ_FUA. For virtio-pmem this makes async_pmem_flush() allocate
> and submit a child PREFLUSH bio chained to the parent.
>
> That child allocation is in the block submit path. Making it
> blocking with GFP_NOIO can consume the same global bio mempool that
> submit_bio() uses, while making it GFP_ATOMIC can fail under
> pressure. A forced failure of the child allocation produced:
>
> virtio_pmem: forcing child bio allocation failure for test
> Buffer I/O error on dev pmem0, logical block 0, lost sync page write
> EXT4-fs (pmem0): I/O error while writing superblock
> EXT4-fs (pmem0): mount failed
>
> Avoid the child bio completely. Flush FUA synchronously, like
> REQ_PREFLUSH, then complete the parent after the flush. Since no
> child bio can be created, async_pmem_flush() now only issues the
> virtio flush and preserves negative errno values.

Child flush is asynchronous (performs async flush to host side and returns).
Till child bio completes guest userspace waits in pending IO state.
It seems the current change will affect the behavior?

Prior RFC [1] attempted to coalesce the async FLUSH request between guest &host.
If there is interest, that approach could be revisited or integrated here?

[1] https://lore.kernel.org/all/20220111161937.56272-1-pankaj.gupta.linux@gmail.com/#t

Thanks,
Pankaj

>
> Signed-off-by: Li Chen <me@linux.beauty>
> ---
> Changes in v6:
> - Replace the child bio allocation fix with synchronous FUA flushing.
>
>  drivers/nvdimm/nd_virtio.c | 22 ++++------------------
>  drivers/nvdimm/pmem.c      |  2 +-
>  2 files changed, 5 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> index 4176046627beb..4b2e9c47af0f5 100644
> --- a/drivers/nvdimm/nd_virtio.c
> +++ b/drivers/nvdimm/nd_virtio.c
> @@ -110,27 +110,13 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
>  /* The asynchronous flush callback function */
>  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
>  {
> -       /*
> -        * Create child bio for asynchronous flush and chain with
> -        * parent bio. Otherwise directly call nd_region flush.
> -        */
> -       if (bio && bio->bi_iter.bi_sector != -1) {
> -               struct bio *child = bio_alloc(bio->bi_bdev, 0,
> -                                             REQ_OP_WRITE | REQ_PREFLUSH,
> -                                             GFP_ATOMIC);
> +       int err;
>
> -               if (!child)
> -                       return -ENOMEM;
> -               bio_clone_blkg_association(child, bio);
> -               child->bi_iter.bi_sector = -1;
> -               bio_chain(child, bio);
> -               submit_bio(child);
> -               return 0;
> -       }
> -       if (virtio_pmem_flush(nd_region))
> +       err = virtio_pmem_flush(nd_region);
> +       if (err > 0)
>                 return -EIO;
>
> -       return 0;
> +       return err;
>  };
>  EXPORT_SYMBOL_GPL(async_pmem_flush);
>  MODULE_DESCRIPTION("Virtio Persistent Memory Driver");
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 82ee1ddb3a445..058d2739c95a1 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -242,7 +242,7 @@ static void pmem_submit_bio(struct bio *bio)
>         }
>
>         if ((bio->bi_opf & REQ_FUA) && !bio->bi_status)
> -               ret = nvdimm_flush(nd_region, bio);
> +               ret = nvdimm_flush(nd_region, NULL);
>
>         if (ret)
>                 bio->bi_status = errno_to_blk_status(ret);
> --
> 2.52.0

^ permalink raw reply

* Re: [PATCH v2 5/4] virtio_balloon: warn on failed buffer add in stats_handle_request()
From: Denis V. Lunev @ 2026-06-24 17:03 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Denis V. Lunev, mst; +Cc: virtualization, linux-kernel
In-Reply-To: <1dc2c6c3-ebce-4646-afc1-d83755537278@kernel.org>

On 6/24/26 18:56, David Hildenbrand (Arm) wrote:
> On 6/24/26 17:40, Denis V. Lunev wrote:
>> Like tell_host(), stats_handle_request() ignores the return value of
>> virtqueue_add_outbuf() and kicks the queue regardless. The same "we
>> should always be able to add one buffer to an empty queue" assumption
>> does not hold once the virtqueue has been broken (e.g. on device
>> shutdown), where the add fails with -EIO. Unlike tell_host() it does
>> not wait_event() afterwards so it cannot hang, but it still kicks a
>> queue with nothing queued.
>>
>> Warn and bail out on failure, mirroring tell_host() and
>> virtballoon_free_page_report().
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> ---
>>  drivers/virtio/virtio_balloon.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 0866a8781f0b..454bbb77331d 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -445,6 +445,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
>>  	struct virtqueue *vq;
>>  	struct scatterlist sg;
>>  	unsigned int len, num_stats;
>> +	int err;
>>  
>>  	num_stats = update_balloon_stats(vb);
>>  
>> @@ -452,7 +453,9 @@ static void stats_handle_request(struct virtio_balloon *vb)
>>  	if (!virtqueue_get_buf(vq, &len))
>>  		return;
>>  	sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	err = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> +	if (WARN_ON_ONCE(err))
>> +		return;
>>  	virtqueue_kick(vq);
>>  }
>>  
> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
>
> Although I would just squash #4 and #5.
>
Sure thing. Does this way to avoid re-submit if that is possible.

Den

^ permalink raw reply

* Re: [PATCH v2 5/4] virtio_balloon: warn on failed buffer add in stats_handle_request()
From: David Hildenbrand (Arm) @ 2026-06-24 16:56 UTC (permalink / raw)
  To: Denis V. Lunev, mst; +Cc: virtualization, linux-kernel
In-Reply-To: <20260624154001.2733242-1-den@openvz.org>

On 6/24/26 17:40, Denis V. Lunev wrote:
> Like tell_host(), stats_handle_request() ignores the return value of
> virtqueue_add_outbuf() and kicks the queue regardless. The same "we
> should always be able to add one buffer to an empty queue" assumption
> does not hold once the virtqueue has been broken (e.g. on device
> shutdown), where the add fails with -EIO. Unlike tell_host() it does
> not wait_event() afterwards so it cannot hang, but it still kicks a
> queue with nothing queued.
> 
> Warn and bail out on failure, mirroring tell_host() and
> virtballoon_free_page_report().
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  drivers/virtio/virtio_balloon.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 0866a8781f0b..454bbb77331d 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -445,6 +445,7 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	struct virtqueue *vq;
>  	struct scatterlist sg;
>  	unsigned int len, num_stats;
> +	int err;
>  
>  	num_stats = update_balloon_stats(vb);
>  
> @@ -452,7 +453,9 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	if (!virtqueue_get_buf(vq, &len))
>  		return;
>  	sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	err = virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +	if (WARN_ON_ONCE(err))
> +		return;
>  	virtqueue_kick(vq);
>  }
>  

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

Although I would just squash #4 and #5.

-- 
Cheers,

David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox