Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH v3] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Michael S. Tsirkin @ 2026-06-10  6:50 UTC (permalink / raw)
  To: Gavin Li
  Cc: linux-i2c, viresh.kumar, Chen, Jian Jun, andi.shyti,
	virtualization
In-Reply-To: <20260609134358.36583-1-gavin.li@samsara.com>

On Tue, Jun 09, 2026 at 09:43:58AM -0400, Gavin Li wrote:
> commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible
> completion wait") switched virtio_i2c_complete_reqs() to
> wait_for_completion_interruptible() so a stuck device cannot hang a
> task forever. That left a use-after-free: if the wait returns early on
> a signal, virtio_i2c_xfer() frees reqs and DMA bounce buffers while the
> device may still hold virtqueue tokens pointing at &reqs[i] and DMA
> into read buffers. When those requests complete later,
> virtio_i2c_msg_done() calls complete() on freed memory.
> 
> Waiting uninterruptibly for every completion before freeing avoids the
> UAF but can hang the caller indefinitely if the virtio side never
> completes the request. The virtio spec provides no way to cancel an
> in-flight transfer, so that is not an acceptable tradeoff.

except for a queue reset, or a device reset if that's not
available.

> This commit manages the freeing of the xfer allocations via kref, and
> ensures that each in-flight request holds a reference. This fixes the
> use-after-free by ensuring that the virtio device has a valid location
> to write to until the request completes. This will cause a memory leak
> in cases where the device hangs, but that is much preferable to memory
> corruption.
> 
> Additionally, force usage of a bounce buffer even if the i2c_msg buf is
> DMA-safe, since the buffer passed to the virtqueue must remain valid
> even if the transfer is interrupted. Remove usage of
> i2c_get_dma_safe_msg_buf() since it may pass through msg->buf directly.
> This results in an extra allocation+copy when I2C_M_DMA_SAFE is used.
> 
> Signed-off-by: Gavin Li <gavin.li@samsara.com>

It's possible that copying unconditionally is the best you could do.
Some vague ideas below but it's possible they are not practical.


> ---

pls start a new thread with each version, and include a changelog there.


>  drivers/i2c/busses/i2c-virtio.c | 110 ++++++++++++++++++++++++--------
>  1 file changed, 83 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
> index 5da6fef92bec3..7ee96b08f6685 100644
> --- a/drivers/i2c/busses/i2c-virtio.c
> +++ b/drivers/i2c/busses/i2c-virtio.c
> @@ -13,6 +13,7 @@
>  #include <linux/err.h>
>  #include <linux/i2c.h>
>  #include <linux/kernel.h>
> +#include <linux/kref.h>
>  #include <linux/module.h>
>  #include <linux/virtio.h>
>  #include <linux/virtio_ids.h>
> @@ -31,39 +32,72 @@ struct virtio_i2c {
>  	struct virtqueue *vq;
>  };
>  
> +struct virtio_i2c_xfer;
> +
>  /**
>   * struct virtio_i2c_req - the virtio I2C request structure
> + * @xfer: owning transfer
>   * @completion: completion of virtio I2C message
>   * @out_hdr: the OUT header of the virtio I2C message
> - * @buf: the buffer into which data is read, or from which it's written
> + * @buf: bounce buffer for i2c_msg.buf, set to NULL upon free
>   * @in_hdr: the IN header of the virtio I2C message
>   */
>  struct virtio_i2c_req {
> +	struct virtio_i2c_xfer *xfer;
>  	struct completion completion;
>  	struct virtio_i2c_out_hdr out_hdr	____cacheline_aligned;
>  	uint8_t *buf				____cacheline_aligned;
>  	struct virtio_i2c_in_hdr in_hdr		____cacheline_aligned;
>  };
>  
> +/**
> + * struct virtio_i2c_xfer - a queued I2C transfer
> + * @ref: one ref for the caller, plus one per in-flight virtqueue request
> + * @num: number of messages
> + * @reqs: the virtio I2C requests
> + */
> +struct virtio_i2c_xfer {
> +	struct kref ref;
> +	int num;
> +	struct virtio_i2c_req reqs[];
> +};
> +
> +static void virtio_i2c_xfer_release(struct kref *ref)
> +{
> +	struct virtio_i2c_xfer *xfer = container_of(ref, struct virtio_i2c_xfer, ref);
> +	int i;
> +
> +	for (i = 0; i < xfer->num; i++) {
> +		struct virtio_i2c_req *req = &xfer->reqs[i];
> +		kfree(req->buf);
> +	}
> +
> +	kfree(xfer);
> +}
> +
>  static void virtio_i2c_msg_done(struct virtqueue *vq)
>  {
>  	struct virtio_i2c_req *req;
>  	unsigned int len;
>  
> -	while ((req = virtqueue_get_buf(vq, &len)))
> +	while ((req = virtqueue_get_buf(vq, &len))) {
>  		complete(&req->completion);
> +		kref_put(&req->xfer->ref, virtio_i2c_xfer_release);
> +	}
>  }
>  
> -static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
> -				   struct virtio_i2c_req *reqs,
> +static int virtio_i2c_prepare_xfer(struct virtqueue *vq,
> +				   struct virtio_i2c_xfer *xfer,
>  				   struct i2c_msg *msgs, int num)
>  {
>  	struct scatterlist *sgs[3], out_hdr, msg_buf, in_hdr;
> +	struct virtio_i2c_req *reqs = xfer->reqs;
>  	int i;
>  
>  	for (i = 0; i < num; i++) {
>  		int outcnt = 0, incnt = 0;
>  
> +		reqs[i].xfer = xfer;
>  		init_completion(&reqs[i].completion);
>  
>  		/*
> @@ -82,9 +116,16 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  		sgs[outcnt++] = &out_hdr;
>  
>  		if (msgs[i].len) {
> -			reqs[i].buf = i2c_get_dma_safe_msg_buf(&msgs[i], 1);
> +			/*
> +			 * Even if msg->flags has I2C_M_DMA_SAFE set, a bounce
> +			 * buffer is required because the transfer may be
> +			 * interrupted, after which msg->buf is no longer valid.
> +			 */
> +			reqs[i].buf = kzalloc(msgs[i].len, GFP_KERNEL);

is len always reasonably small?

>  			if (!reqs[i].buf)
>  				break;
> +			if (!(msgs[i].flags & I2C_M_RD))
> +				memcpy(reqs[i].buf, msgs[i].buf, msgs[i].len);
>  
>  			sg_init_one(&msg_buf, reqs[i].buf, msgs[i].len);
>

maybe there's some way to hang on to the original memory?


  
> @@ -97,38 +138,51 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  		sg_init_one(&in_hdr, &reqs[i].in_hdr, sizeof(reqs[i].in_hdr));
>  		sgs[outcnt + incnt++] = &in_hdr;
>  
> +		/* This reference is released in virtio_i2c_msg_done(). */
> +		kref_get(&xfer->ref);
> +
>  		if (virtqueue_add_sgs(vq, sgs, outcnt, incnt, &reqs[i], GFP_KERNEL)) {
> -			i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], false);
> +			kref_put(&xfer->ref, virtio_i2c_xfer_release);
> +
> +			kfree(reqs[i].buf);
> +			reqs[i].buf = NULL; /* prevent free by virtio_i2c_xfer_release */
> +
>  			break;
>  		}
>  	}
>  
> +	xfer->num = i;
>  	return i;
>  }
>  
> -static int virtio_i2c_complete_reqs(struct virtqueue *vq,
> -				    struct virtio_i2c_req *reqs,
> -				    struct i2c_msg *msgs, int num)
> +static int virtio_i2c_complete_xfer(struct virtio_i2c_xfer *xfer,
> +				    struct i2c_msg *msgs)
>  {
> -	bool failed = false;
> -	int i, j = 0;
> +	struct virtio_i2c_req *reqs = xfer->reqs;
> +	int i, fail_index = -1;
>  
> -	for (i = 0; i < num; i++) {
> +	for (i = 0; i < xfer->num; i++) {
>  		struct virtio_i2c_req *req = &reqs[i];
> +		struct i2c_msg *msg = &msgs[i];
>  
> -		if (!failed) {
> -			if (wait_for_completion_interruptible(&req->completion))
> -				failed = true;
> -			else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
> -				failed = true;
> -			else
> -				j++;
> +		if (wait_for_completion_interruptible(&req->completion))
> +			return -EINTR;
> +
> +		if (req->in_hdr.status != VIRTIO_I2C_MSG_OK) {
> +			/* Don't break yet. Try to wait until all requests complete. */
> +			if (fail_index < 0)
> +				fail_index = i;
>  		}
>  
> -		i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
> +		if (fail_index < 0 && (msg->flags & I2C_M_RD))
> +			memcpy(msg->buf, req->buf, msg->len);
> +
> +		kfree(req->buf);
> +		req->buf = NULL; /* prevent free by virtio_i2c_xfer_release */
>  	}
>  
> -	return j;
> +	/* Return number of successful transactions */
> +	return fail_index >= 0 ? fail_index : xfer->num;
>  }
>  
>  static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
> @@ -136,14 +190,16 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  {
>  	struct virtio_i2c *vi = i2c_get_adapdata(adap);
>  	struct virtqueue *vq = vi->vq;
> -	struct virtio_i2c_req *reqs;
> +	struct virtio_i2c_xfer *xfer;
>  	int count;
>  
> -	reqs = kzalloc_objs(*reqs, num);
> -	if (!reqs)
> +	xfer = kzalloc(struct_size(xfer, reqs, num), GFP_KERNEL);
> +	if (!xfer)
>  		return -ENOMEM;
>  
> -	count = virtio_i2c_prepare_reqs(vq, reqs, msgs, num);
> +	kref_init(&xfer->ref);
> +
> +	count = virtio_i2c_prepare_xfer(vq, xfer, msgs, num);
>  	if (!count)
>  		goto err_free;
>  
> @@ -157,10 +213,10 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  	 */
>  	virtqueue_kick(vq);
>  
> -	count = virtio_i2c_complete_reqs(vq, reqs, msgs, count);
> +	count = virtio_i2c_complete_xfer(xfer, msgs);
>  
>  err_free:
> -	kfree(reqs);
> +	kref_put(&xfer->ref, virtio_i2c_xfer_release);
>  	return count;
>  }
>  
> -- 
> 2.54.0
> 


^ permalink raw reply

* Re: [PATCH v3 4/4] scsi: use percpu counters for iostat counters in struct scsi_device
From: Hannes Reinecke @ 2026-06-10  6:21 UTC (permalink / raw)
  To: Sumit Saxena, Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, John Garry
In-Reply-To: <20260609121806.2121755-5-sumit.saxena@broadcom.com>

On 6/9/26 14:18, Sumit Saxena wrote:
> iorequest_cnt and iodone_cnt are updated on every command dispatch and
> completion, often from different CPUs on high queue depth workloads.
> Using adjacent atomic_t fields causes cache line contention between the
> submission and completion paths.
> 
> Extend the same treatment to ioerr_cnt and iotmo_cnt so all four iostat
> counters in struct scsi_device use struct percpu_counter.
> 
> Suggested-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
>   drivers/scsi/scsi_error.c  |  4 ++--
>   drivers/scsi/scsi_lib.c    | 10 +++++-----
>   drivers/scsi/scsi_scan.c   |  8 ++++++++
>   drivers/scsi/scsi_sysfs.c  | 23 ++++++++++++++---------
>   drivers/scsi/sd.c          |  2 +-
>   include/scsi/scsi_device.h |  9 +++++----
>   6 files changed, 35 insertions(+), 21 deletions(-)
> 
Good idea.

Reviewed-by: Hannes Reinecke <hare@kernel.org>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Hannes Reinecke @ 2026-06-10  6:18 UTC (permalink / raw)
  To: Sumit Saxena, Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <20260609121806.2121755-4-sumit.saxena@broadcom.com>

On 6/9/26 14:18, Sumit Saxena wrote:
> From: Bart Van Assche <bvanassche@acm.org>
> 
> Original patch [1] by Bart Van Assche; this version is rebased onto the
> current tree.  In testing it improves IOPS by roughly 16-18% by removing
> the fair-sharing throttle on shared tag queues.
> 
> This patch removes the following code and structure members:
> - The function hctx_may_queue().
> - blk_mq_hw_ctx.nr_active and request_queue.nr_active_requests_shared_tags
>    and also all the code that modifies these two member variables.
> 
> [1]: https://lore.kernel.org/linux-block/20240529213921.3166462-1-bvanassche@acm.org/
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
>   block/blk-core.c       |   2 -
>   block/blk-mq-debugfs.c |  22 ++++++++-
>   block/blk-mq-tag.c     |   4 --
>   block/blk-mq.c         |  17 +------
>   block/blk-mq.h         | 100 -----------------------------------------
>   include/linux/blk-mq.h |   6 ---
>   include/linux/blkdev.h |   2 -
>   7 files changed, 22 insertions(+), 131 deletions(-)
> 
What tests did you perform?
I'm pretty sure you see an improvement when having just a few drives,
but what about having a lot of them (ie tens of drives)?
The whole point of this was to increase fairness between drives, so
of course removing it will make an individual drive going faster ...

Maybe it's an idea to move the fairness algorithm into an I/O scheduler;
that way we can keep the original behaviour yet get the performance
increase if people want it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Christoph Hellwig @ 2026-06-10  6:14 UTC (permalink / raw)
  To: Sumit Saxena
  Cc: Martin K . Petersen, Jens Axboe, James E . J . Bottomley,
	linux-scsi, linux-block, Adam Radford, Khalid Aziz,
	Adaptec OEM Raid Solutions, Matthew Wilcox, Hannes Reinecke,
	Juergen E . Fischer, Russell King, linux-arm-kernel, Finn Thain,
	Michael Schmitz, Anil Gurumurthy, Sudarsana Kalluru,
	Oliver Neukum, Ali Akcaagac, Jamie Lenehan, Ram Vegesna,
	target-devel, Bradley Grove, Satish Kharat, Sesidhar Baddela,
	Karan Tilak Kumar, Yihang Li, Don Brace, storagedev,
	HighPoint Linux Team, Tyrel Datwyler, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, linuxppc-dev,
	Brian King, Lee Duncan, Chris Leech, Mike Christie, open-iscsi,
	Justin Tee, Paul Ely, Kashyap Desai, Shivasharan S,
	Chandrakanth Patil, megaraidlinux.pdl, Sathya Prakash Veerichetty,
	Sreekanth Reddy, mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani,
	Ranjan Kumar, MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori,
	YOKOTA Hiroshi, Jack Wang, Geoff Levand, Michael Reed,
	Nilesh Javali, GR-QLogic-Storage-Upstream, Narsimhulu Musini,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	linux-hyperv, Michael S . Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <20260609121806.2121755-4-sumit.saxena@broadcom.com>

Just dropping the fairness was rejected before and there is no
explanation here on why any of that has changed.

On Tue, Jun 09, 2026 at 05:48:02PM +0530, Sumit Saxena wrote:
> From: Bart Van Assche <bvanassche@acm.org>
> 
> Original patch [1] by Bart Van Assche; this version is rebased onto the
> current tree.  In testing it improves IOPS by roughly 16-18% by removing
> the fair-sharing throttle on shared tag queues.
> 
> This patch removes the following code and structure members:
> - The function hctx_may_queue().
> - blk_mq_hw_ctx.nr_active and request_queue.nr_active_requests_shared_tags
>   and also all the code that modifies these two member variables.

.. and besides that, this commit message is still entirely useless
as it doesn't explain any of the thoughts of why this change is safe
and desirable.  While the mechanics above are totally obvious from
the code change itself.


^ permalink raw reply

* Re: [PATCH v3] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Viresh Kumar @ 2026-06-10  6:03 UTC (permalink / raw)
  To: Gavin Li
  Cc: linux-i2c, Chen, Jian Jun, andi.shyti, virtualization,
	Wolfram Sang, Michael S. Tsirkin, Jason Wang
In-Reply-To: <20260609134358.36583-1-gavin.li@samsara.com>

Ccing I2C/Virtio maintainers.

On 09-06-26, 09:43, Gavin Li wrote:
> Additionally, force usage of a bounce buffer even if the i2c_msg buf is
> DMA-safe, since the buffer passed to the virtqueue must remain valid
> even if the transfer is interrupted. Remove usage of
> i2c_get_dma_safe_msg_buf() since it may pass through msg->buf directly.
> This results in an extra allocation+copy when I2C_M_DMA_SAFE is used.

Please always create separate patches for such changes. It would be better to
not fix multiple issues in the same patch. That would also allow for a clean
revert of the change, just in case.

> @@ -82,9 +116,16 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  		sgs[outcnt++] = &out_hdr;
>  
>  		if (msgs[i].len) {
> -			reqs[i].buf = i2c_get_dma_safe_msg_buf(&msgs[i], 1);
> +			/*
> +			 * Even if msg->flags has I2C_M_DMA_SAFE set, a bounce
> +			 * buffer is required because the transfer may be
> +			 * interrupted, after which msg->buf is no longer valid.
> +			 */
> +			reqs[i].buf = kzalloc(msgs[i].len, GFP_KERNEL);
>  			if (!reqs[i].buf)
>  				break;
> +			if (!(msgs[i].flags & I2C_M_RD))
> +				memcpy(reqs[i].buf, msgs[i].buf, msgs[i].len);
>  
>  			sg_init_one(&msg_buf, reqs[i].buf, msgs[i].len);

I don't think this is acceptable, bringing the performance down because the
remote device may misbehave.

Maybe its time for I2C and Virtio maintainers to provide some feedback here.

-- 
viresh

^ permalink raw reply

* Re: [PATCH v3 1/4] scsi: scan: allocate sdev and starget on the NUMA node of the host adapter
From: Hannes Reinecke @ 2026-06-10  6:00 UTC (permalink / raw)
  To: Sumit Saxena, Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Juergen E . Fischer, Russell King, linux-arm-kernel, Finn Thain,
	Michael Schmitz, Anil Gurumurthy, Sudarsana Kalluru,
	Oliver Neukum, Ali Akcaagac, Jamie Lenehan, Ram Vegesna,
	target-devel, Bradley Grove, Satish Kharat, Sesidhar Baddela,
	Karan Tilak Kumar, Yihang Li, Don Brace, storagedev,
	HighPoint Linux Team, Tyrel Datwyler, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, linuxppc-dev,
	Brian King, Lee Duncan, Chris Leech, Mike Christie, open-iscsi,
	Justin Tee, Paul Ely, Kashyap Desai, Shivasharan S,
	Chandrakanth Patil, megaraidlinux.pdl, Sathya Prakash Veerichetty,
	Sreekanth Reddy, mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani,
	Ranjan Kumar, MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori,
	YOKOTA Hiroshi, Jack Wang, Geoff Levand, Michael Reed,
	Nilesh Javali, GR-QLogic-Storage-Upstream, Narsimhulu Musini,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	linux-hyperv, Michael S . Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, James Rizzo
In-Reply-To: <20260609121806.2121755-2-sumit.saxena@broadcom.com>

On 6/9/26 14:18, Sumit Saxena wrote:
> From: James Rizzo <james.rizzo@broadcom.com>
> 
> When a host adapter is attached to a specific NUMA node, allocating
> scsi_device and scsi_target via kzalloc() may place them on a remote
> node.  All hot-path I/O accesses to these structures then cross the NUMA
> interconnect, adding latency and consuming inter-node bandwidth.
> 
> Use kzalloc_node() with dev_to_node(shost->dma_dev) so allocations land
> on the same node as the HBA, reducing cross-node traffic and improving
> I/O performance on NUMA systems.
> 
> Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
>   drivers/scsi/scsi_scan.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@kernel.org>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply

* Re: [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: Hannes Reinecke @ 2026-06-10  5:59 UTC (permalink / raw)
  To: Sumit Saxena, Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Juergen E . Fischer, Russell King, linux-arm-kernel, Finn Thain,
	Michael Schmitz, Anil Gurumurthy, Sudarsana Kalluru,
	Oliver Neukum, Ali Akcaagac, Jamie Lenehan, Ram Vegesna,
	target-devel, Bradley Grove, Satish Kharat, Sesidhar Baddela,
	Karan Tilak Kumar, Yihang Li, Don Brace, storagedev,
	HighPoint Linux Team, Tyrel Datwyler, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, linuxppc-dev,
	Brian King, Lee Duncan, Chris Leech, Mike Christie, open-iscsi,
	Justin Tee, Paul Ely, Kashyap Desai, Shivasharan S,
	Chandrakanth Patil, megaraidlinux.pdl, Sathya Prakash Veerichetty,
	Sreekanth Reddy, mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani,
	Ranjan Kumar, MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori,
	YOKOTA Hiroshi, Jack Wang, Geoff Levand, Michael Reed,
	Nilesh Javali, GR-QLogic-Storage-Upstream, Narsimhulu Musini,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	linux-hyperv, Michael S . Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, John Garry
In-Reply-To: <20260609121806.2121755-3-sumit.saxena@broadcom.com>

On 6/9/26 14:18, Sumit Saxena wrote:
> scsi_host_alloc() used kzalloc(), which always picks an arbitrary node.
> Extend the function to accept a 'struct device *dev' parameter and use
> kzalloc_node() with dev_to_node(dev) so the Scsi_Host struct lands on
> the same NUMA node as the HBA, mirroring the treatment already applied
> to struct scsi_device, struct scsi_target, and shost_data.
> 
> When dev is NULL (legacy ISA/platform drivers without a dma_dev) the
> allocation falls back to NUMA_NO_NODE, preserving existing behaviour.
> 
> Update all in-tree callers:
>    - PCI-based HBA drivers pass &pdev->dev (or the equivalent struct
>      member such as &phba->pcidev->dev, &h->pdev->dev, &ha->pdev->dev)
>      so their host struct is placed on the adapter's node.
>    - Non-PCI drivers (ISA, Amiga, ARM PCMCIA, virtio, Hyper-V, PS3, …)
>      pass NULL.
>    - libfc's libfc_host_alloc() inline helper passes NULL; FC drivers
>      that want NUMA awareness can open-code the call with their pdev.
> 
> Suggested-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
> ---
>   drivers/scsi/3w-9xxx.c                    | 2 +-
>   drivers/scsi/3w-sas.c                     | 2 +-
>   drivers/scsi/3w-xxxx.c                    | 2 +-
>   drivers/scsi/53c700.c                     | 2 +-
>   drivers/scsi/BusLogic.c                   | 2 +-
>   drivers/scsi/a100u2w.c                    | 2 +-
>   drivers/scsi/a2091.c                      | 2 +-
>   drivers/scsi/a3000.c                      | 2 +-
>   drivers/scsi/aacraid/linit.c              | 2 +-
>   drivers/scsi/advansys.c                   | 6 +++---
>   drivers/scsi/aha152x.c                    | 2 +-
>   drivers/scsi/aha1542.c                    | 2 +-
>   drivers/scsi/aha1740.c                    | 2 +-
>   drivers/scsi/aic7xxx/aic79xx_osm.c        | 2 +-
>   drivers/scsi/aic7xxx/aic7xxx_osm.c        | 2 +-
>   drivers/scsi/aic94xx/aic94xx_init.c       | 2 +-
>   drivers/scsi/am53c974.c                   | 2 +-
>   drivers/scsi/arcmsr/arcmsr_hba.c          | 3 ++-
>   drivers/scsi/arm/acornscsi.c              | 2 +-
>   drivers/scsi/arm/arxescsi.c               | 2 +-
>   drivers/scsi/arm/cumana_1.c               | 2 +-
>   drivers/scsi/arm/cumana_2.c               | 2 +-
>   drivers/scsi/arm/eesox.c                  | 2 +-
>   drivers/scsi/arm/oak.c                    | 2 +-
>   drivers/scsi/arm/powertec.c               | 2 +-
>   drivers/scsi/atari_scsi.c                 | 2 +-
>   drivers/scsi/atp870u.c                    | 2 +-
>   drivers/scsi/bfa/bfad_im.c                | 2 +-
>   drivers/scsi/csiostor/csio_init.c         | 4 ++--
>   drivers/scsi/dc395x.c                     | 2 +-
>   drivers/scsi/dmx3191d.c                   | 2 +-
>   drivers/scsi/elx/efct/efct_xport.c        | 4 ++--
>   drivers/scsi/esas2r/esas2r_main.c         | 2 +-
>   drivers/scsi/fdomain.c                    | 2 +-
>   drivers/scsi/fnic/fnic_main.c             | 2 +-
>   drivers/scsi/g_NCR5380.c                  | 2 +-
>   drivers/scsi/gvp11.c                      | 2 +-
>   drivers/scsi/hisi_sas/hisi_sas_main.c     | 2 +-
>   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 2 +-
>   drivers/scsi/hosts.c                      | 6 ++++--
>   drivers/scsi/hpsa.c                       | 2 +-
>   drivers/scsi/hptiop.c                     | 2 +-
>   drivers/scsi/ibmvscsi/ibmvfc.c            | 2 +-
>   drivers/scsi/ibmvscsi/ibmvscsi.c          | 2 +-
>   drivers/scsi/imm.c                        | 2 +-
>   drivers/scsi/initio.c                     | 2 +-
>   drivers/scsi/ipr.c                        | 2 +-
>   drivers/scsi/ips.c                        | 2 +-
>   drivers/scsi/isci/init.c                  | 2 +-
>   drivers/scsi/jazz_esp.c                   | 2 +-
>   drivers/scsi/libiscsi.c                   | 2 +-
>   drivers/scsi/lpfc/lpfc_init.c             | 2 +-
>   drivers/scsi/mac53c94.c                   | 2 +-
>   drivers/scsi/mac_esp.c                    | 2 +-
>   drivers/scsi/mac_scsi.c                   | 2 +-
>   drivers/scsi/megaraid.c                   | 2 +-
>   drivers/scsi/megaraid/megaraid_mbox.c     | 2 +-
>   drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
>   drivers/scsi/mesh.c                       | 2 +-
>   drivers/scsi/mpi3mr/mpi3mr_os.c           | 2 +-
>   drivers/scsi/mpt3sas/mpt3sas_scsih.c      | 4 ++--
>   drivers/scsi/mvme147.c                    | 2 +-
>   drivers/scsi/mvsas/mv_init.c              | 2 +-
>   drivers/scsi/mvumi.c                      | 2 +-
>   drivers/scsi/myrb.c                       | 2 +-
>   drivers/scsi/myrs.c                       | 2 +-
>   drivers/scsi/ncr53c8xx.c                  | 2 +-
>   drivers/scsi/nsp32.c                      | 2 +-
>   drivers/scsi/pcmcia/nsp_cs.c              | 2 +-
>   drivers/scsi/pcmcia/qlogic_stub.c         | 2 +-
>   drivers/scsi/pcmcia/sym53c500_cs.c        | 2 +-
>   drivers/scsi/pm8001/pm8001_init.c         | 2 +-
>   drivers/scsi/pmcraid.c                    | 2 +-
>   drivers/scsi/ppa.c                        | 2 +-
>   drivers/scsi/ps3rom.c                     | 2 +-
>   drivers/scsi/qla1280.c                    | 2 +-
>   drivers/scsi/qla2xxx/qla_mid.c            | 2 +-
>   drivers/scsi/qla2xxx/qla_os.c             | 2 +-
>   drivers/scsi/qlogicfas.c                  | 2 +-
>   drivers/scsi/qlogicpti.c                  | 2 +-
>   drivers/scsi/scsi_debug.c                 | 2 +-
>   drivers/scsi/sgiwd93.c                    | 2 +-
>   drivers/scsi/smartpqi/smartpqi_init.c     | 2 +-
>   drivers/scsi/snic/snic_main.c             | 2 +-
>   drivers/scsi/stex.c                       | 2 +-
>   drivers/scsi/storvsc_drv.c                | 2 +-
>   drivers/scsi/sun3_scsi.c                  | 2 +-
>   drivers/scsi/sun3x_esp.c                  | 2 +-
>   drivers/scsi/sun_esp.c                    | 2 +-
>   drivers/scsi/sym53c8xx_2/sym_glue.c       | 2 +-
>   drivers/scsi/virtio_scsi.c                | 2 +-
>   drivers/scsi/vmw_pvscsi.c                 | 2 +-
>   drivers/scsi/wd719x.c                     | 2 +-
>   drivers/scsi/xen-scsifront.c              | 2 +-
>   drivers/scsi/zorro_esp.c                  | 2 +-
>   include/scsi/libfc.h                      | 2 +-
>   include/scsi/scsi_host.h                  | 3 ++-
>   97 files changed, 107 insertions(+), 103 deletions(-)
> 
Quite a lot of churn for such a (relatively) simple change.

I think it might be better to introduce a new function
(scsi_host_alloc_node() ?) with the additional parameter,
and make scsi_host_alloc() a wrapper around that.

That will reduce the size of this patch immensely.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.com                               +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply

* Re: [PATCH] [v3] vduse: fix compat handling for VDUSE_IOTLB_GET_FD/VDUSE_VQ_GET_INFO
From: Michael S. Tsirkin @ 2026-06-10  5:26 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason Wang, Xie Yongji, Arnd Bergmann, Eugenio Pérez,
	Xuan Zhuo, Marco Crivellari, Anders Roxell, virtualization,
	linux-kernel
In-Reply-To: <20260213154051.4172275-1-arnd@kernel.org>

On Fri, Feb 13, 2026 at 04:40:46PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> These two ioctls are incompatible on 32-bit x86 userspace, because
> the data structures are shorter than they are on 64-bit.
> 
> Add a proper .compat_ioctl handler for x86 that reads the structures
> with the smaller padding before calling the internal handlers. On
> all other architectures, CONFIG_COMPAT_FOR_U64_ALIGNMENT is disabled
> and no special handling is required.
> 
> Fixes: ad146355bfad ("vduse: Support querying information of IOVA regions")
> Fixes: c8a6153b6c59 ("vduse: Introduce VDUSE - vDPA Device in Userspace")
> Acked-by: Eugenio Pérez <eperezma@redhat.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> v3 changes:
>  - check CONFIG_COMPAT_FOR_U64_ALIGNMENT in preprocessor
> v2 changes:
>  - split compat handler into separate function
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 123 ++++++++++++++++++++++++++++-
>  1 file changed, 122 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 405d59610f76..e0f5a7397221 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1618,6 +1618,127 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_COMPAT_FOR_U64_ALIGNMENT
> +/*
> + * i386 has different alignment constraints than x86_64,
> + * so there are only 3 bytes of padding instead of 7.
> + */
> +struct compat_vduse_iotlb_entry {
> +	compat_u64 offset;
> +	compat_u64 start;
> +	compat_u64 last;
> +	__u8 perm;
> +	__u8 padding[3];
> +};
> +#define COMPAT_VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x10, struct compat_vduse_iotlb_entry)
> +
> +struct compat_vduse_vq_info {
> +	__u32 index;
> +	__u32 num;
> +	compat_u64 desc_addr;
> +	compat_u64 driver_addr;
> +	compat_u64 device_addr;
> +	union {
> +		struct vduse_vq_state_split split;
> +		struct vduse_vq_state_packed packed;
> +	};
> +	__u8 ready;
> +	__u8 padding[3];
> +} __uapi_arch_align;

what is this __uapi_arch_align supposed to be doing?

It compiles by luck because gcc thinks it's a global variable.


> +#define COMPAT_VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x15, struct compat_vduse_vq_info)
> +
> +static long vduse_dev_compat_ioctl(struct file *file, unsigned int cmd,
> +				   unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	if (unlikely(dev->broken))
> +		return -EPERM;
> +
> +	switch (cmd) {
> +	case COMPAT_VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry_v2 entry = {0};
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, _IOC_SIZE(cmd)))
> +			break;
> +
> +		ret = vduse_dev_iotlb_entry(dev, &entry, &f, NULL);
> +		if (ret)
> +			break;
> +
> +		ret = -EINVAL;
> +		if (!f)
> +			break;
> +
> +		ret = copy_to_user(argp, &entry, _IOC_SIZE(cmd));
> +		if (ret) {
> +			ret = -EFAULT;
> +			fput(f);
> +			break;
> +		}
> +		ret = receive_fd(f, NULL, perm_to_file_flags(entry.perm));
> +		fput(f);
> +		break;
> +	}
> +	case COMPAT_VDUSE_VQ_GET_INFO: {
> +		struct vduse_vq_info vq_info = {};
> +		struct vduse_virtqueue *vq;
> +		u32 index;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&vq_info, argp,
> +				   sizeof(struct compat_vduse_vq_info)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_info.index >= dev->vq_num)
> +			break;
> +
> +		index = array_index_nospec(vq_info.index, dev->vq_num);
> +		vq = dev->vqs[index];
> +		vq_info.desc_addr = vq->desc_addr;
> +		vq_info.driver_addr = vq->driver_addr;
> +		vq_info.device_addr = vq->device_addr;
> +		vq_info.num = vq->num;
> +
> +		if (dev->driver_features & BIT_ULL(VIRTIO_F_RING_PACKED)) {
> +			vq_info.packed.last_avail_counter =
> +				vq->state.packed.last_avail_counter;
> +			vq_info.packed.last_avail_idx =
> +				vq->state.packed.last_avail_idx;
> +			vq_info.packed.last_used_counter =
> +				vq->state.packed.last_used_counter;
> +			vq_info.packed.last_used_idx =
> +				vq->state.packed.last_used_idx;
> +		} else
> +			vq_info.split.avail_index =
> +				vq->state.split.avail_index;
> +
> +		vq_info.ready = vq->ready;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &vq_info,
> +		    sizeof(struct compat_vduse_vq_info)))
> +			break;
> +
> +		ret = 0;
> +		break;
> +	}
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return vduse_dev_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> +}
> +#else
> +#define vduse_dev_compat_ioctl compat_ptr_ioctl
> +#endif
> +
>  static int vduse_dev_release(struct inode *inode, struct file *file)
>  {
>  	struct vduse_dev *dev = file->private_data;
> @@ -1678,7 +1799,7 @@ static const struct file_operations vduse_dev_fops = {
>  	.write_iter	= vduse_dev_write_iter,
>  	.poll		= vduse_dev_poll,
>  	.unlocked_ioctl	= vduse_dev_ioctl,
> -	.compat_ioctl	= compat_ptr_ioctl,
> +	.compat_ioctl	= vduse_dev_compat_ioctl,
>  	.llseek		= noop_llseek,
>  };
>  
> -- 
> 2.39.5


^ permalink raw reply

* Re: [PATCH splitout] mm: page_reporting: allow driver to set batch capacity
From: Michael S. Tsirkin @ 2026-06-09 22:00 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	Alexander Duyck
In-Reply-To: <aiiJPwhn6gF3ILur@gourry-fedora-PF4VCD3F>

On Tue, Jun 09, 2026 at 05:44:31PM -0400, Gregory Price wrote:
> On Tue, Jun 09, 2026 at 04:08:08PM -0400, Michael S. Tsirkin wrote:
> > On Tue, Jun 09, 2026 at 01:44:37PM -0400, Gregory Price wrote:
> > > On Tue, Jun 09, 2026 at 11:53:20AM -0400, Michael S. Tsirkin wrote:
> > > > --- a/mm/page_reporting.c
> > > > +++ b/mm/page_reporting.c
> > > > @@ -174,10 +174,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
> > > >  	 * list processed. This should result in us reporting all pages on
> > > >  	 * an idle system in about 30 seconds.
> > > >  	 *
> > > > -	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
> > > > -	 * should always be a power of 2.
> > > > +	 * The division here uses integer division; capacity need
> > > > +	 * not be a power of 2.
> > > >  	 */
> > > > -	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
> > > > +	budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
> > > > 
> > > 
> > > Initial look - is there a div-by-0 here?  I noticed the old check
> > > prevents this from being (0 * 16), but i don't see (on first pass)
> > > the same check anywhere.
> > > 
> > > Unless this line below always forces the above to be a
> > > PAGE_REPORTING_CAPCAITY if it's set to 0.
> > 
> > It does, does it not?
> > 
> > > > +	if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
> > > > +		prdev->capacity = PAGE_REPORTING_CAPACITY;
> > > > +
> > > 
> > > It's worth making this corner condition a little more obvious.
> > > 
> > > The code intends for 
> > > 
> > > if (capacity == 0)
> > >   capacity = PAGE_REPORTING_CAPACITY
> > > 
> > > but that's not reflected in the changelog as a default value.
> > > 
> > > When happens if a driver sets (capacity=0) either on purpose (???)
> > 
> > what would the purpose be? if you don't want reporting do not register.
> > 
> > > or
> > > because there's a bug (???)
> > 
> > exactly ??? since where are we practicing defensive programming in kernel
> > APIs?
> >
> > > and then page_reporting.c forces it up to
> > > 32?
> > > 
> > > There's something to improve here.
> > > 
> > > ~Gregory
> > 
> > 
> > So I'll update the commit log to mention PAGE_REPORTING_CAPACITY.
> > And maybe a comment near capacity field?
> > Should be enough?
> 
> I suppose the question is whether capacity=0 should cause a WARN (i.e.
> only a bug explains that value), or if capacity=0 means something
> special (i.e. use the default) and therefore that should be documented.
> I don't know which of these is the case, but if it's the latter than
> that deserves a comment yes.
> 
> ~Gregory

It's the default. Will document.


^ permalink raw reply

* Re: [PATCH splitout] mm: page_reporting: allow driver to set batch capacity
From: Gregory Price @ 2026-06-09 21:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	Alexander Duyck
In-Reply-To: <20260609160506-mutt-send-email-mst@kernel.org>

On Tue, Jun 09, 2026 at 04:08:08PM -0400, Michael S. Tsirkin wrote:
> On Tue, Jun 09, 2026 at 01:44:37PM -0400, Gregory Price wrote:
> > On Tue, Jun 09, 2026 at 11:53:20AM -0400, Michael S. Tsirkin wrote:
> > > --- a/mm/page_reporting.c
> > > +++ b/mm/page_reporting.c
> > > @@ -174,10 +174,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
> > >  	 * list processed. This should result in us reporting all pages on
> > >  	 * an idle system in about 30 seconds.
> > >  	 *
> > > -	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
> > > -	 * should always be a power of 2.
> > > +	 * The division here uses integer division; capacity need
> > > +	 * not be a power of 2.
> > >  	 */
> > > -	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
> > > +	budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
> > > 
> > 
> > Initial look - is there a div-by-0 here?  I noticed the old check
> > prevents this from being (0 * 16), but i don't see (on first pass)
> > the same check anywhere.
> > 
> > Unless this line below always forces the above to be a
> > PAGE_REPORTING_CAPCAITY if it's set to 0.
> 
> It does, does it not?
> 
> > > +	if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
> > > +		prdev->capacity = PAGE_REPORTING_CAPACITY;
> > > +
> > 
> > It's worth making this corner condition a little more obvious.
> > 
> > The code intends for 
> > 
> > if (capacity == 0)
> >   capacity = PAGE_REPORTING_CAPACITY
> > 
> > but that's not reflected in the changelog as a default value.
> > 
> > When happens if a driver sets (capacity=0) either on purpose (???)
> 
> what would the purpose be? if you don't want reporting do not register.
> 
> > or
> > because there's a bug (???)
> 
> exactly ??? since where are we practicing defensive programming in kernel
> APIs?
>
> > and then page_reporting.c forces it up to
> > 32?
> > 
> > There's something to improve here.
> > 
> > ~Gregory
> 
> 
> So I'll update the commit log to mention PAGE_REPORTING_CAPACITY.
> And maybe a comment near capacity field?
> Should be enough?

I suppose the question is whether capacity=0 should cause a WARN (i.e.
only a bug explains that value), or if capacity=0 means something
special (i.e. use the default) and therefore that should be documented.

I don't know which of these is the case, but if it's the latter than
that deserves a comment yes.

~Gregory

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-09 21:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: Miaohe Lin, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <4BA276D9-9EB9-4E2A-8A05-657ACACFF227@nvidia.com>

On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> 
> > On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>
> >>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>
> >>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>
> >>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>
> >>>>>> In particular, __free_pages_prepare() does:
> >>>>>>
> >>>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>
> >>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>> is set and the write happens after.  Will only get worse if/when
> >>>>>> we add more non-atomic flag operations.
> >>>>>>
> >>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>> around ClearPageHWPoison in the retry path.  This
> >>>>>> serializes with all buddy flag manipulation.  The cost is
> >>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>> (hardware memory errors).
> >>>>>>
> >>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>> in this file operate on pages already removed from the buddy
> >>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>> need zone->lock protection.
> >>>>>
> >>>>> Sashiko is saying this doesn't do anything "Because
> >>>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
> >>>>>
> >>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>
> >>>> Battle of the bots: it's right.
> >>>
> >>> Yep, __free_pages_prepare() changes the page flag without holding
> >>> zone->lock.
> >>
> >> __free_pages_prepare() works on frozen pages and assumes no one else
> >> touches the input page. To avoid this race, memory_failure() might
> >> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >> sure if that works along with memory failure flow.
> >>
> >> Best Regards,
> >> Yan, Zi
> >
> >
> >
> > Actually memory failure already plays with this down the road no?
> >
> > So maybe it's enough to just SetPageHWPoison afterwards again?
> >
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index ee42d4361309..4758fea94a96 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> >  	if (!res) {
> >  		if (is_free_buddy_page(p)) {
> >  			if (take_page_off_buddy(p)) {
> > +				SetPageHWPoison(p);
> >  				page_ref_inc(p);
> >  				res = MF_RECOVERED;
> >  			} else {
> >
> >
> > and maybe in a bunch of other places in there?
> 
> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> just set it again here?

Yea.

> Why not do it after get_hwpoison_page(), since that
> is the expected page flag?

It's still in the buddy at that point right? I'm worried buddy might
poke at flags.

> Miaohe probably can give a better answer here.
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Zi Yan @ 2026-06-09 20:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Miaohe Lin
  Cc: David Hildenbrand (Arm), Andrew Morton, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <20260609162437-mutt-send-email-mst@kernel.org>

On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:

> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>
>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>
>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>
>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>
>>>>>> In particular, __free_pages_prepare() does:
>>>>>>
>>>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>
>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>> is set and the write happens after.  Will only get worse if/when
>>>>>> we add more non-atomic flag operations.
>>>>>>
>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>> around ClearPageHWPoison in the retry path.  This
>>>>>> serializes with all buddy flag manipulation.  The cost is
>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>> (hardware memory errors).
>>>>>>
>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>> in this file operate on pages already removed from the buddy
>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>> need zone->lock protection.
>>>>>
>>>>> Sashiko is saying this doesn't do anything "Because
>>>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
>>>>>
>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>
>>>> Battle of the bots: it's right.
>>>
>>> Yep, __free_pages_prepare() changes the page flag without holding
>>> zone->lock.
>>
>> __free_pages_prepare() works on frozen pages and assumes no one else
>> touches the input page. To avoid this race, memory_failure() might
>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>> sure if that works along with memory failure flow.
>>
>> Best Regards,
>> Yan, Zi
>
>
>
> Actually memory failure already plays with this down the road no?
>
> So maybe it's enough to just SetPageHWPoison afterwards again?
>
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ee42d4361309..4758fea94a96 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!res) {
>  		if (is_free_buddy_page(p)) {
>  			if (take_page_off_buddy(p)) {
> +				SetPageHWPoison(p);
>  				page_ref_inc(p);
>  				res = MF_RECOVERED;
>  			} else {
>
>
> and maybe in a bunch of other places in there?

You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
just set it again here? Why not do it after get_hwpoison_page(), since that
is the expected page flag? Miaohe probably can give a better answer here.


Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-09 20:34 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand (Arm), Andrew Morton, linux-kernel, Miaohe Lin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <38C84F23-E881-4DB2-86BA-93F39D44AE1B@nvidia.com>

On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> 
> > On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >
> >> On 6/9/26 20:10, Andrew Morton wrote:
> >>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>
> >>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>> update to page->flags can race with non-atomic flag operations
> >>>> that run under zone->lock in the buddy allocator.
> >>>>
> >>>> In particular, __free_pages_prepare() does:
> >>>>
> >>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>
> >>>> This non-atomic read-modify-write, while correctly excluding
> >>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>> is set and the write happens after.  Will only get worse if/when
> >>>> we add more non-atomic flag operations.
> >>>>
> >>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>> around ClearPageHWPoison in the retry path.  This
> >>>> serializes with all buddy flag manipulation.  The cost is
> >>>> negligible: one lock/unlock in an extremely rare path
> >>>> (hardware memory errors).
> >>>>
> >>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>> in this file operate on pages already removed from the buddy
> >>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>> need zone->lock protection.
> >>>
> >>> Sashiko is saying this doesn't do anything "Because
> >>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
> >>>
> >>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>
> >> Battle of the bots: it's right.
> >
> > Yep, __free_pages_prepare() changes the page flag without holding
> > zone->lock.
> 
> __free_pages_prepare() works on frozen pages and assumes no one else
> touches the input page. To avoid this race, memory_failure() might
> want to try_get_page() before TestClearPageHWPoison(), but I am not
> sure if that works along with memory failure flow.
> 
> Best Regards,
> Yan, Zi



Actually memory failure already plays with this down the road no?

So maybe it's enough to just SetPageHWPoison afterwards again?


diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..4758fea94a96 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!res) {
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
+				SetPageHWPoison(p);
 				page_ref_inc(p);
 				res = MF_RECOVERED;
 			} else {


and maybe in a bunch of other places in there?

-- 
MST


^ permalink raw reply related

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-09 20:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <20260609111020.e88f51a7b6ebc37360d66fdc@linux-foundation.org>

On Tue, Jun 09, 2026 at 11:10:20AM -0700, Andrew Morton wrote:
> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > update to page->flags can race with non-atomic flag operations
> > that run under zone->lock in the buddy allocator.
> > 
> > In particular, __free_pages_prepare() does:
> > 
> >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > 
> > This non-atomic read-modify-write, while correctly excluding
> > __PG_HWPOISON from the mask, can still lose a concurrent
> > TestSetPageHWPoison if the read happens before the poison bit
> > is set and the write happens after.  Will only get worse if/when
> > we add more non-atomic flag operations.
> > 
> > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > around ClearPageHWPoison in the retry path.  This
> > serializes with all buddy flag manipulation.  The cost is
> > negligible: one lock/unlock in an extremely rare path
> > (hardware memory errors).
> > 
> > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > in this file operate on pages already removed from the buddy
> > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > need zone->lock protection.
> 
> Sashiko is saying this doesn't do anything "Because
> __free_pages_prepare() executes entirely locklessly".  Did it goof?
> 
> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>

Oh. So it only helps with the prezero patches. Maybe other places where
flags are touched locklessly. Not __free_pages_prepare. I was too
focused on that. Scrap this please. I'll try to think of something.


-- 
MST


^ permalink raw reply

* Re: [PATCH splitout] mm: page_reporting: allow driver to set batch capacity
From: Michael S. Tsirkin @ 2026-06-09 20:08 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	Alexander Duyck
In-Reply-To: <aihRBTaTUmgYmZfX@gourry-fedora-PF4VCD3F>

On Tue, Jun 09, 2026 at 01:44:37PM -0400, Gregory Price wrote:
> On Tue, Jun 09, 2026 at 11:53:20AM -0400, Michael S. Tsirkin wrote:
> > --- a/mm/page_reporting.c
> > +++ b/mm/page_reporting.c
> > @@ -174,10 +174,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
> >  	 * list processed. This should result in us reporting all pages on
> >  	 * an idle system in about 30 seconds.
> >  	 *
> > -	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
> > -	 * should always be a power of 2.
> > +	 * The division here uses integer division; capacity need
> > +	 * not be a power of 2.
> >  	 */
> > -	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
> > +	budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
> > 
> 
> Initial look - is there a div-by-0 here?  I noticed the old check
> prevents this from being (0 * 16), but i don't see (on first pass)
> the same check anywhere.
> 
> Unless this line below always forces the above to be a
> PAGE_REPORTING_CAPCAITY if it's set to 0.

It does, does it not?

> > +	if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
> > +		prdev->capacity = PAGE_REPORTING_CAPACITY;
> > +
> 
> It's worth making this corner condition a little more obvious.
> 
> The code intends for 
> 
> if (capacity == 0)
>   capacity = PAGE_REPORTING_CAPACITY
> 
> but that's not reflected in the changelog as a default value.
> 
> When happens if a driver sets (capacity=0) either on purpose (???)

what would the purpose be? if you don't want reporting do not register.

> or
> because there's a bug (???)

exactly ??? since where are we practicing defensive programming in kernel
APIs?

> and then page_reporting.c forces it up to
> 32?
> 
> There's something to improve here.
> 
> ~Gregory


So I'll update the commit log to mention PAGE_REPORTING_CAPACITY.
And maybe a comment near capacity field?
Should be enough?

-- 
MST


^ permalink raw reply

* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Sean Christopherson @ 2026-06-09 19:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260602034916.GGah5SvARd77mkvxe3@fat_crate.local>

On Mon, Jun 01, 2026, Borislav Petkov wrote:
> On Fri, May 29, 2026 at 07:43:49AM -0700, Sean Christopherson wrote:
> > +static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
> > +{
> > +	unsigned int ecx_hz, edx;
> > +
> > +	memset(info, 0, sizeof(*info));
> 
> Let's not clear this unnecessarily...
> 
> > +
> > +	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
> > +		return -ENOENT;
> 
> ... just to return here...
> 
> > +
> > +	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> > +	cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
> > +
> > +	if (!info->denominator || !info->numerator)
> > +		return -ENOENT;
> 
> ... or here.
> 
> We wanna clear it here, when we'll return success.

Actually, if we take the approach of relying on the user to check the return
code, then there's no need to zero the struct since all fields will be explicitly
written, especially if we drop the "tsc_khz" field.  I was zeroing the field
purely as defense in depth.

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-09 19:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <aihKj-0nP7bUbNHH@google.com>

On Tue, Jun 09 2026 at 10:17, Sean Christopherson wrote:

> On Fri, Jun 05, 2026, Thomas Gleixner wrote:
>> On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
>> But we also should have a check in the TSC init code somewhere which
>> validates that X86_FEATURE_CONSTANT_TSC is set when
>> X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
>> w/o X86_FEATURE_CONSTANT_TSC.
>
> Ugh, any objection to punting on this for now?  KVM and Xen guests will trigger
> TSC_KNOWN_FREQ without CONSTANT_TSC, thanks to commits:
>
>   e10f78050323 ("kvmclock: fix TSC calibration for nested guests")
>   898ec52d2ba0 ("x86/xen/time: Set the X86_FEATURE_TSC_KNOWN_FREQ flag in xen_tsc_khz()")
>
> Hyper-V guests might as well?  Hyper-V's handling of TSC is weird, even for a
> hypervisor.

Hypervisors are ranked by weirdness? I ranked them by insanity so far.

> Even when the frequency is provided in CPUID by the hypervisor, QEMU at least
> requires a fairly explicit opt-in to advertise CONSTANT_TSC, presumably to try
> to prevent users from shooting themselves in the foot.

Bah. We really should have enforced the dependency when we introduced
KNOWN_FREQ. But that ship has sailed.

Though for correctness sake this should be fixed at some point in the
foreseeable future.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Zi Yan @ 2026-06-09 18:52 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Michael S. Tsirkin, linux-kernel, Miaohe Lin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <FB250C43-790E-4AB0-BD40-74665FC601A0@nvidia.com>

On 9 Jun 2026, at 14:39, Zi Yan wrote:

> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>
>> On 6/9/26 20:10, Andrew Morton wrote:
>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>
>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>> update to page->flags can race with non-atomic flag operations
>>>> that run under zone->lock in the buddy allocator.
>>>>
>>>> In particular, __free_pages_prepare() does:
>>>>
>>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>
>>>> This non-atomic read-modify-write, while correctly excluding
>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>> is set and the write happens after.  Will only get worse if/when
>>>> we add more non-atomic flag operations.
>>>>
>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>> around ClearPageHWPoison in the retry path.  This
>>>> serializes with all buddy flag manipulation.  The cost is
>>>> negligible: one lock/unlock in an extremely rare path
>>>> (hardware memory errors).
>>>>
>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>> in this file operate on pages already removed from the buddy
>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>> need zone->lock protection.
>>>
>>> Sashiko is saying this doesn't do anything "Because
>>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
>>>
>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>
>> Battle of the bots: it's right.
>
> Yep, __free_pages_prepare() changes the page flag without holding
> zone->lock.

__free_pages_prepare() works on frozen pages and assumes no one else
touches the input page. To avoid this race, memory_failure() might
want to try_get_page() before TestClearPageHWPoison(), but I am not
sure if that works along with memory failure flow.

Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Zi Yan @ 2026-06-09 18:39 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Michael S. Tsirkin, linux-kernel, Miaohe Lin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <8c1f468e-b50a-487a-a267-8d1ea5a61c87@kernel.org>

On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:

> On 6/9/26 20:10, Andrew Morton wrote:
>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>> update to page->flags can race with non-atomic flag operations
>>> that run under zone->lock in the buddy allocator.
>>>
>>> In particular, __free_pages_prepare() does:
>>>
>>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>
>>> This non-atomic read-modify-write, while correctly excluding
>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>> TestSetPageHWPoison if the read happens before the poison bit
>>> is set and the write happens after.  Will only get worse if/when
>>> we add more non-atomic flag operations.
>>>
>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>> around ClearPageHWPoison in the retry path.  This
>>> serializes with all buddy flag manipulation.  The cost is
>>> negligible: one lock/unlock in an extremely rare path
>>> (hardware memory errors).
>>>
>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>> in this file operate on pages already removed from the buddy
>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>> need zone->lock protection.
>>
>> Sashiko is saying this doesn't do anything "Because
>> __free_pages_prepare() executes entirely locklessly".  Did it goof?
>>
>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>
> Battle of the bots: it's right.

Yep, __free_pages_prepare() changes the page flag without holding
zone->lock.

Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: David Hildenbrand (Arm) @ 2026-06-09 18:38 UTC (permalink / raw)
  To: Andrew Morton, Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <20260609111020.e88f51a7b6ebc37360d66fdc@linux-foundation.org>

On 6/9/26 20:10, Andrew Morton wrote:
> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>> update to page->flags can race with non-atomic flag operations
>> that run under zone->lock in the buddy allocator.
>>
>> In particular, __free_pages_prepare() does:
>>
>>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>
>> This non-atomic read-modify-write, while correctly excluding
>> __PG_HWPOISON from the mask, can still lose a concurrent
>> TestSetPageHWPoison if the read happens before the poison bit
>> is set and the write happens after.  Will only get worse if/when
>> we add more non-atomic flag operations.
>>
>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>> around ClearPageHWPoison in the retry path.  This
>> serializes with all buddy flag manipulation.  The cost is
>> negligible: one lock/unlock in an extremely rare path
>> (hardware memory errors).
>>
>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>> in this file operate on pages already removed from the buddy
>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>> need zone->lock protection.
> 
> Sashiko is saying this doesn't do anything "Because
> __free_pages_prepare() executes entirely locklessly".  Did it goof?
> 
> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com

Battle of the bots: it's right.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Andrew Morton @ 2026-06-09 18:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
In-Reply-To: <df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com>

On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:

> TestSetPageHWPoison() is called without zone->lock, so its atomic
> update to page->flags can race with non-atomic flag operations
> that run under zone->lock in the buddy allocator.
> 
> In particular, __free_pages_prepare() does:
> 
>     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> This non-atomic read-modify-write, while correctly excluding
> __PG_HWPOISON from the mask, can still lose a concurrent
> TestSetPageHWPoison if the read happens before the poison bit
> is set and the write happens after.  Will only get worse if/when
> we add more non-atomic flag operations.
> 
> Fix by acquiring zone->lock around TestSetPageHWPoison and
> around ClearPageHWPoison in the retry path.  This
> serializes with all buddy flag manipulation.  The cost is
> negligible: one lock/unlock in an extremely rare path
> (hardware memory errors).
> 
> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> in this file operate on pages already removed from the buddy
> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> need zone->lock protection.

Sashiko is saying this doesn't do anything "Because
__free_pages_prepare() executes entirely locklessly".  Did it goof?

https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com



^ permalink raw reply

* [PATCH] drm/virtio: Use common error handling code in two functions
From: Markus Elfring @ 2026-06-09 18:08 UTC (permalink / raw)
  To: virtualization, dri-devel, Chia-I Wu, Dmitry Osipenko,
	David Airlie, Gerd Hoffmann, Gurchetan Singh, Maarten Lankhorst,
	Maxime Ripard, Simona Vetter, Thomas Zimmermann
  Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Tue, 9 Jun 2026 20:00:07 +0200

Use additional labels so that a bit of exception handling can be better
reused at the end of two function implementations.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/gpu/drm/virtio/virtgpu_vq.c   |  7 +++----
 drivers/gpu/drm/virtio/virtgpu_vram.c | 16 ++++++++--------
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c
index 67865810a2e7..05b19c73103a 100644
--- a/drivers/gpu/drm/virtio/virtgpu_vq.c
+++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
@@ -318,15 +318,14 @@ static struct sg_table *vmalloc_to_sgt(char *data, uint32_t size, int *sg_ents)
 
 	*sg_ents = DIV_ROUND_UP(size, PAGE_SIZE);
 	ret = sg_alloc_table(sgt, *sg_ents, GFP_KERNEL);
-	if (ret) {
-		kfree(sgt);
-		return NULL;
-	}
+	if (ret)
+		goto free_sgt;
 
 	for_each_sgtable_sg(sgt, sg, i) {
 		pg = vmalloc_to_page(data);
 		if (!pg) {
 			sg_free_table(sgt);
+free_sgt:
 			kfree(sgt);
 			return NULL;
 		}
diff --git a/drivers/gpu/drm/virtio/virtgpu_vram.c b/drivers/gpu/drm/virtio/virtgpu_vram.c
index 4ae3cbc35dd3..ec5b669fccfa 100644
--- a/drivers/gpu/drm/virtio/virtgpu_vram.c
+++ b/drivers/gpu/drm/virtio/virtgpu_vram.c
@@ -212,16 +212,12 @@ int virtio_gpu_vram_create(struct virtio_gpu_device *vgdev,
 
 	/* Create fake offset */
 	ret = drm_gem_create_mmap_offset(obj);
-	if (ret) {
-		kfree(vram);
-		return ret;
-	}
+	if (ret)
+		goto free_vram;
 
 	ret = virtio_gpu_resource_id_get(vgdev, &vram->base.hw_res_handle);
-	if (ret) {
-		kfree(vram);
-		return ret;
-	}
+	if (ret)
+		goto free_vram;
 
 	virtio_gpu_cmd_resource_create_blob(vgdev, &vram->base, params, NULL,
 					    0);
@@ -237,6 +233,10 @@ int virtio_gpu_vram_create(struct virtio_gpu_device *vgdev,
 
 	*bo_ptr = &vram->base;
 	return 0;
+
+free_vram:
+	kfree(vram);
+	return ret;
 }
 
 void virtio_gpu_vram_map_deferred(struct virtio_gpu_object_vram *vram)
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net] vsock/virtio: restore msg_iter on transmission failure
From: Octavian Purdila @ 2026-06-09 17:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, syzbot+28e5f3d207b14bae122a, Stefan Hajnoczi,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Arseniy Krasnov, kvm, virtualization, linux-kernel
In-Reply-To: <aifL0f_QO1kocccU@sgarzare-redhat>

On Tue, Jun 9, 2026 at 1:48 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Tue, Jun 09, 2026 at 12:48:05AM +0000, Octavian Purdila wrote:
> >When transmission fails in virtio_transport_send_pkt_info, the msg_iter
> >might have been partially advanced. If we don't restore it, the next
> >attempt to send data will use an incorrect iterator state, leading to
> >desync and warnings like "send_pkt() returns 0, but X expected".
>
> Thanks for the fix! I have some comments.

Thank you for the quick review!

> >diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >index b10666937c490..588623a3e2bbc 100644
> >--- a/net/vmw_vsock/virtio_transport_common.c
> >+++ b/net/vmw_vsock/virtio_transport_common.c
> >@@ -367,6 +367,10 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >       do {
> >               struct sk_buff *skb;
> >               size_t skb_len;
> >+              struct iov_iter saved_iter;
>
> trivial: reverse xmas tree:
> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
>

I'll fix it in v2, sorry for missing this.

> >+
> >+              if (info->msg)
> >+                      saved_iter = info->msg->msg_iter;
>
> What about using iov_iter_save_state()/iov_iter_restore() ?
>
> IIUC we may need to export iov_iter_restore(), so not a strong opinion,
> but it looks better to use those API IMHO.
>

I agree,  I'll add the export as a separate patch in v2 and move to
using these APIs.

> >
> >               skb_len = min(max_skb_len, rest_len);
> >
> >@@ -375,6 +379,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >                                                src_cid, src_port,
> >                                                dst_cid, dst_port);
>
> What about adding a comment on top of virtio_transport_alloc_skb() call
> (or when we save the state) to explain that in specific cases it can
> advance the msg_iter ?
>

Good point, I will add a comment explaining this in v2.

> >               if (!skb) {
> >+                      if (info->msg)
> >+                              info->msg->msg_iter = saved_iter;
> >                       ret = -ENOMEM;
> >                       break;
> >               }
> >@@ -382,8 +388,11 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >               virtio_transport_inc_tx_pkt(vvs, skb);
> >
> >               ret = t_ops->send_pkt(skb, info->net);
> >-              if (ret < 0)
> >+              if (ret < 0) {
> >+                      if (info->msg)
> >+                              info->msg->msg_iter = saved_iter;
>
> Also, what about having a single restore point after the loop?
>
> I mean something like this (untested):
>

Yes, that looks much better. I tested it locally and it works fine. Thanks!

I'll follow-up with v2 in a day or two.

^ permalink raw reply

* Re: [PATCH splitout] mm: page_reporting: allow driver to set batch capacity
From: Gregory Price @ 2026-06-09 17:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi,
	Alexander Duyck
In-Reply-To: <b25ea6c63503f24c8b5b64910a44dc03b5aa610d.1781020302.git.mst@redhat.com>

On Tue, Jun 09, 2026 at 11:53:20AM -0400, Michael S. Tsirkin wrote:
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -174,10 +174,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
>  	 * list processed. This should result in us reporting all pages on
>  	 * an idle system in about 30 seconds.
>  	 *
> -	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
> -	 * should always be a power of 2.
> +	 * The division here uses integer division; capacity need
> +	 * not be a power of 2.
>  	 */
> -	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
> +	budget = DIV_ROUND_UP(area->nr_free, prdev->capacity * 16);
> 

Initial look - is there a div-by-0 here?  I noticed the old check
prevents this from being (0 * 16), but i don't see (on first pass)
the same check anywhere.

Unless this line below always forces the above to be a
PAGE_REPORTING_CAPCAITY if it's set to 0.

> +	if (!prdev->capacity || prdev->capacity > PAGE_REPORTING_CAPACITY)
> +		prdev->capacity = PAGE_REPORTING_CAPACITY;
> +

It's worth making this corner condition a little more obvious.

The code intends for 

if (capacity == 0)
  capacity = PAGE_REPORTING_CAPACITY

but that's not reflected in the changelog as a default value.

When happens if a driver sets (capacity=0) either on purpose (???) or
because there's a bug (???) and then page_reporting.c forces it up to
32?

There's something to improve here.

~Gregory

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Sean Christopherson @ 2026-06-09 17:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <87a4t86a0l.ffs@fw13>

On Fri, Jun 05, 2026, Thomas Gleixner wrote:
> On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
> But we also should have a check in the TSC init code somewhere which
> validates that X86_FEATURE_CONSTANT_TSC is set when
> X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
> w/o X86_FEATURE_CONSTANT_TSC.

Ugh, any objection to punting on this for now?  KVM and Xen guests will trigger
TSC_KNOWN_FREQ without CONSTANT_TSC, thanks to commits:

  e10f78050323 ("kvmclock: fix TSC calibration for nested guests")
  898ec52d2ba0 ("x86/xen/time: Set the X86_FEATURE_TSC_KNOWN_FREQ flag in xen_tsc_khz()")

Hyper-V guests might as well?  Hyper-V's handling of TSC is weird, even for a
hypervisor.

Even when the frequency is provided in CPUID by the hypervisor, QEMU at least
requires a fairly explicit opt-in to advertise CONSTANT_TSC, presumably to try
to prevent users from shooting themselves in the foot.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox