public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Alex Williamson <alex@shazbot.org>
To: <ankita@nvidia.com>
Cc: <vsethi@nvidia.com>, <jgg@nvidia.com>, <mochs@nvidia.com>,
	<jgg@ziepe.ca>, <skolothumtho@nvidia.com>, <cjia@nvidia.com>,
	<zhiw@nvidia.com>, <kjaju@nvidia.com>, <yishaih@nvidia.com>,
	<kevin.tian@intel.com>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>,
	alex@shazbot.org
Subject: Re: [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages
Date: Wed, 4 Mar 2026 16:00:55 -0700	[thread overview]
Message-ID: <20260304160055.38ea91be@shazbot.org> (raw)
In-Reply-To: <20260223155514.152435-13-ankita@nvidia.com>

On Mon, 23 Feb 2026 15:55:11 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> nvgrace-egm module stores the list of retired page offsets to be made
> available for usermode processes. Introduce an ioctl to share the
> information with the userspace.
> 
> The ioctl is called by usermode apps such as QEMU to get the retired
> page offsets. The usermode apps are expected to take appropriate action
> to communicate the list to the VM.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  MAINTAINERS                        |  1 +
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 67 ++++++++++++++++++++++++++++++
>  include/uapi/linux/egm.h           | 28 +++++++++++++
>  3 files changed, 96 insertions(+)
>  create mode 100644 include/uapi/linux/egm.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1fc551d7d667..94cf15a1e82c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27389,6 +27389,7 @@ M:	Ankit Agrawal <ankita@nvidia.com>
>  L:	kvm@vger.kernel.org
>  S:	Supported
>  F:	drivers/vfio/pci/nvgrace-gpu/egm.c
> +F:	include/uapi/linux/egm.h
>  
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 077de3833046..918979d8fcd4 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -5,6 +5,7 @@
>  
>  #include <linux/vfio_pci_core.h>
>  #include <linux/nvgrace-egm.h>
> +#include <linux/egm.h>
>  
>  #define MAX_EGM_NODES 4
>  
> @@ -119,11 +120,77 @@ static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
>  			       vma->vm_page_prot);
>  }
>  
> +static long nvgrace_egm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	unsigned long minsz = offsetofend(struct egm_retired_pages_list, count);
> +	struct egm_retired_pages_list info;
> +	void __user *uarg = (void __user *)arg;
> +	struct chardev *egm_chardev = file->private_data;
> +
> +	if (copy_from_user(&info, uarg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz || !egm_chardev)
> +		return -EINVAL;

How could we get here with !egm_chardev?

> +
> +	switch (cmd) {
> +	case EGM_RETIRED_PAGES_LIST:
> +		int ret;
> +		unsigned long retired_page_struct_size = sizeof(struct egm_retired_pages_info);
> +		struct egm_retired_pages_info tmp;
> +		struct h_node *cur_page;
> +		struct hlist_node *tmp_node;
> +		unsigned long bkt;
> +		int count = 0, index = 0;

No brackets for inline declarations.  Ordering could be improved.

> +
> +		hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node)
> +			count++;

Why not keep track of the count as they're added?

Neither loop here needs the _safe variant here since we're not removing
entries.

> +
> +		if (info.argsz < (minsz + count * retired_page_struct_size)) {
> +			info.argsz = minsz + count * retired_page_struct_size;
> +			info.count = 0;

vfio returns success when there's not enough space for compatibility
for new capabilities.  For a new ioctl just set argsz and count and
return -ENOSPC.

> +			goto done;
> +		} else {

We don't need an else if the previous branch unconditionally goes
somewhere else.

> +			hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node) {
> +				/*
> +				 * This check fails if there was an ECC error
> +				 * after the usermode app read the count of
> +				 * bad pages through this ioctl.
> +				 */
> +				if (minsz + index * retired_page_struct_size >= info.argsz) {
> +					info.argsz = minsz + index * retired_page_struct_size;
> +					info.count = index;

If only we had locking to prevent such races...

> +					goto done;
> +				}
> +
> +				tmp.offset = cur_page->mem_offset;
> +				tmp.size = PAGE_SIZE;

Is firmware recording 4K or 64K pages in this table?

The above comment alludes runtime ECC faults, are those a different
page size from the granularity firmware reports in the table?

> +
> +				ret = copy_to_user(uarg + minsz +
> +						   index * retired_page_struct_size,
> +						   &tmp, retired_page_struct_size);
> +				if (ret)
> +					return -EFAULT;
> +				index++;
> +			}
> +
> +			info.count = index;
> +		}
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +done:
> +	return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> +}
> +
>  static const struct file_operations file_ops = {
>  	.owner = THIS_MODULE,
>  	.open = nvgrace_egm_open,
>  	.release = nvgrace_egm_release,
>  	.mmap = nvgrace_egm_mmap,
> +	.unlocked_ioctl = nvgrace_egm_ioctl,
>  };
>  
>  static void egm_chardev_release(struct device *dev)
> diff --git a/include/uapi/linux/egm.h b/include/uapi/linux/egm.h
> new file mode 100644
> index 000000000000..4d3a2304d4f0
> --- /dev/null
> +++ b/include/uapi/linux/egm.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#ifndef _UAPI_LINUX_EGM_H
> +#define _UAPI_LINUX_EGM_H
> +
> +#include <linux/types.h>
> +
> +#define EGM_TYPE ('E')

Arbitrarily chosen?  Update ioctl-number.rst?

> +
> +struct egm_retired_pages_info {
> +	__aligned_u64 offset;
> +	__aligned_u64 size;
> +};
> +
> +struct egm_retired_pages_list {
> +	__u32 argsz;
> +	/* out */
> +	__u32 count;
> +	/* out */
> +	struct egm_retired_pages_info retired_pages[];
> +};

I imagine you want some uapi description of this ioctl.  Thanks,

Alex

> +
> +#define EGM_RETIRED_PAGES_LIST     _IO(EGM_TYPE, 100)
> +
> +#endif /* _UAPI_LINUX_EGM_H */


  reply	other threads:[~2026-03-04 23:01 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
2026-02-26 14:28   ` Shameer Kolothum Thodi
2026-03-04  0:13   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
2026-02-26 14:55   ` Shameer Kolothum Thodi
2026-03-04 17:14     ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
2026-02-26 15:12   ` Shameer Kolothum Thodi
2026-03-04 17:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2026-03-04 18:09   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
2026-03-04 18:56   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
2026-03-04 19:06   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
2026-02-26 17:08   ` Shameer Kolothum Thodi
2026-03-04 20:16   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
2026-03-04 22:04   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2026-02-26 18:15   ` Shameer Kolothum Thodi
2026-02-26 18:56     ` Jason Gunthorpe
2026-02-26 19:29       ` Shameer Kolothum Thodi
2026-03-04 22:14   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
2026-03-04 22:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
2026-03-04 23:00   ` Alex Williamson [this message]
2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2026-03-04 23:22   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
2026-03-04 23:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
2026-03-04 23:48   ` Alex Williamson
2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
2026-03-11  6:47   ` Ankit Agrawal
2026-03-11 20:37     ` Alex Williamson
2026-03-12 13:51       ` Ankit Agrawal
2026-03-12 14:59         ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260304160055.38ea91be@shazbot.org \
    --to=alex@shazbot.org \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mochs@nvidia.com \
    --cc=skolothumtho@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox