Linux Documentation
 help / color / mirror / Atom feed
* Re: [RFC PATCH 2/2] kasan: hw_tags: Add boot option to elide free time poisoning
From: Dev Jain @ 2026-06-19  4:44 UTC (permalink / raw)
  To: Isaac Manjarres
  Cc: ryabinin.a.a, akpm, corbet, glider, andreyknvl, dvyukov,
	vincenzo.frascino, kasan-dev, linux-mm, linux-kernel, skhan,
	workflows, linux-doc, linux-arm-kernel, ryan.roberts,
	anshuman.khandual, kaleshsingh, 21cnbao, david, will,
	catalin.marinas
In-Reply-To: <aiyi5flNkNNm0pSR@google.com>



On 13/06/26 5:53 am, Isaac Manjarres wrote:
> On Fri, Jun 12, 2026 at 04:44:24AM +0000, Dev Jain wrote:
>> diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
>> index fc9169a547662..4fa8abb312faa 100644
>> --- a/mm/kasan/kasan.h
>> +++ b/mm/kasan/kasan.h
>>  #ifdef CONFIG_KASAN_GENERIC
>> @@ -478,6 +489,16 @@ static inline u8 kasan_random_tag(void) { return 0; }
>>  
>>  static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init)
>>  {
>> +	if (kasan_tag_only_on_alloc_enabled()) {
>> +		if ((value != KASAN_SLAB_REDZONE) && (value != KASAN_PAGE_REDZONE)) {
>> +			if (init)
>> +				memset((void *)kasan_reset_tag(addr), 0, size);
>> +			return;
>> +		}
>> +	}
>> +
>> +	value |= 0xF0;
>> +
> 
> I wonder if it would make more sense to have this as:
> 
> if (kasan_tag_only_on_alloc_enabled() && (value == KASAN_SLAB_FREE ||
>     value == KASAN_PAGE_FREE)) {
> 	if (init)
> 		memset((void *)kasan_reset_tag(addr), 0, size);
> 	return;
> }
> 
> That seems a bit clearer to me as to what it is that you're doing, and
> also makes it so that you don't have to do any bit manipulation
> on the value when you're filling in the redzones.

Ah so you mean, we can define KASAN_SLAB_FREE and KASAN_PAGE_FREE to be
different values, leaving KASAN_SLAB_REDZONE and KASAN_PAGE_REDZONE to
be 0xFE, the poison value. Yep I'll do that.
> 
> Thanks,
> Isaac


^ permalink raw reply

* Re: htmldocs: Documentation/scheduler/sched-arch.rst:108: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
From: Shrikanth Hegde @ 2026-06-19  4:25 UTC (permalink / raw)
  To: Randy Dunlap, kernel test robot; +Cc: oe-kbuild-all, linux-doc
In-Reply-To: <299a6f3b-708d-490e-8866-f15bf851cf83@infradead.org>

Hi Randy, thanks for going through.

On 6/19/26 1:03 AM, Randy Dunlap wrote:
> 
> 
> On 6/17/26 10:19 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/18/26 10:40 AM, kernel test robot wrote:
>>> tree:   https://github.com/intel-lab-lkp/linux/commits/Shrikanth-Hegde/sched-debug-Remove-unused-schedstats/20260618-031604
>>> head:   bcb0c494e4af36dd6306a5a1839a0c03046053af
>>> commit: 4c29e4f3ba22adc04fc456620f2c6abf539d76df sched/docs: Document cpu_preferred_mask and Preferred CPU concept
>>> date:   10 hours ago
>>> compiler: clang version 22.1.8 (https://github.com/llvm/llvm-project ca7933e47d3a3451d81e72ac174dcb5aa28b59d1)
>>> docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
>>> reproduce: (https://download.01.org/0day-ci/archive/20260618/202606180717.yNM0yb41-lkp@intel.com/reproduce)
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <lkp@intel.com>
>>> | Closes: https://lore.kernel.org/oe-kbuild-all/202606180717.yNM0yb41-lkp@intel.com/
>>>
>>> All warnings (new ones prefixed by >>):
>>>
>>>      Checksumming on output with GSO
>>>      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [docutils]
>>>      MAINTAINERS:40: WARNING: Inline strong start-string without end-string. [docutils]
> 
>>>      Documentation/scheduler/sched-arch.rst:107: ERROR: Unexpected indentation. [docutils]
>>>>> Documentation/scheduler/sched-arch.rst:108: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
>>>      Documentation/userspace-api/landlock:504: ./security/landlock/errata/abi-4.h:5: ERROR: Unexpected section title.
>>>
>>>
>>> vim +108 Documentation/scheduler/sched-arch.rst
>>>
>>>      102
>>>      103    Notes:
>>>      104    1. This feature is available under CONFIG_PREFERRED_CPU
>>>      105    2. This feature works for FAIR class only.
>>>      106    3. A task pinned, which can't be moved to preferred CPUs will continue
>>>      107       to run based on its affinity. But no load balancing happens
>>
>> is it flagging here due to missing . ?
> 
> No, but you could add that anyway.
> 
>>>    > 108    4. If needed, steal time based governors/arch dependent method
>>>      109       could be used to cater to different types of cpu numbers.
>>>      110       Arch can do so by implementing its own hooks.
>>>      111    5. Decision to use/not use is driven by kernel. Hence it shouldn't
>>>      112       break user affinities. One of the main reason why CPU hotplug
>>>      113       or Isolated cpuset partitions was not a solution.
>>>      114
> It wants a blank line between each list item (if the list items are multi-line).
> For the list above this one (3 items, all single line), blank lines aren't needed.
> [These comments come from testing, not reading specs.]
> 

Ah ok. I was wondering why it flagged only that.

> I made these changes and a couple of others to make the rendered html look
> reasonable.
> 
> Use (or not).

Sure. thanks.

> ---
> From: Shrikanth Hegde <sshegde@linux.ibm.com>
> To: linux-kernel@vger.kernel.org, mingo@kernel.org, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, yury.norov@gmail.com, kprateek.nayak@amd.com, iii@linux.ibm.com
> Cc: sshegde@linux.ibm.com, tglx@kernel.org, gregkh@linuxfoundation.org, pbonzini@redhat.com, seanjc@google.com, vschneid@redhat.com, huschle@linux.ibm.com, rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de, bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com, hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org, frederic@kernel.org, arighi@nvidia.com, pauld@redhat.com, christian.loehle@arm.com, tj@kernel.org, tommaso.cucinotta@gmail.com, maz@kernel.org, rafael@kernel.org
> Subject: [PATCH v4 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
> Date: Wed, 17 Jun 2026 23:11:21 +0530
> Message-ID: <20260617174139.155540-3-sshegde@linux.ibm.com>
> 
> 
> Add documentation for new cpumask called cpu_preferred_mask. This could
> help users in understanding what this mask is and the concept behind it.
> 
> Document how to enable it and implementation aspects of it.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> v3->v4:
> - update docs to reflect preferred is subset of active.
> 
>   Documentation/scheduler/sched-arch.rst |   61 ++++++++++++++++++++++-
>   1 file changed, 59 insertions(+), 2 deletions(-)
> 
> --- linux-next.orig/Documentation/scheduler/sched-arch.rst
> +++ linux-next/Documentation/scheduler/sched-arch.rst
> @@ -6,7 +6,8 @@ CPU Scheduler implementation hints for a
>   
>   Context switch
>   ==============
> -1. Runqueue locking
> +Runqueue locking
> +
>   By default, the switch_to arch function is called with the runqueue
>   locked. This is usually not a problem unless switch_to may need to
>   take the runqueue lock. This is usually due to a wake up operation in
> @@ -62,11 +63,67 @@ Your cpu_idle routines need to obey the
>   arch/x86/kernel/process.c has examples of both polling and
>   sleeping idle functions.
>   
> +Preferred CPUs
> +==============
> +
> +In virtualised environments it is possible to overcommit CPU resources.
> +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
> +CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
> +hypervisor won't be able to satisfy the CPU requirement and has to context
> +switch within or across VM. i.e hypervisor need to preempt one vCPU to run
> +another. This is called vCPU preemption. This is more expensive compared to
> +task context switch within a vCPU.
> +
> +In such cases it is better that combined vCPU ask from all VM is reduced
> +by not using some of the vCPUs. vCPUs where workload can be safely
> +scheduled which won't increase any contention for pCPU are called as
> +"Preferred CPUs".
> +
> +In most cases preferred CPUs will be same as active CPUs, when there is pCPU
> +contention, Preferred CPUs will reduce based on the amount of steal time.
> +When the pCPU contention goes away as indicated by steal time, Preferred CPUs
> +will become same as active CPUs again. One has to enable the feature by
> +writing 1 to /sys/kernel/debug/sched/steal_monitor/enable
> +
> +One of the design construct is preferred CPUs is always subset of active CPUs.
> +With CONFIG_PREFERRED_CPU=n, it is same as active CPUs.
> +
> +For scheduling decisions such as wakeup, pushing the task etc, needs this
> +CPU state info. This is maintained in cpu_preferred_mask.
> +
> +vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
> +should not be used at this moment provided it doesn't break user affinity.
> +This is achieved by:
> +
> +1. Selecting a preferred CPU at wakeup.
> +2. Push the task away from non-preferred CPU at tick.
> +3. Only select preferred CPUs for load balance.
> +
> +/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
> +cpulist format.
> +
> +Notes:
> +
> +1. This feature is available under CONFIG_PREFERRED_CPU
> +
> +2. This feature works for FAIR class only.
> +
> +3. A task pinned, which can't be moved to preferred CPUs will continue
> +   to run based on its affinity. But no load balancing happens
> +
> +4. If needed, steal time based governors/arch dependent method
> +   could be used to cater to different types of cpu numbers.
> +   Arch can do so by implementing its own hooks.
> +
> +5. Decision to use/not use is driven by kernel. Hence it shouldn't
> +   break user affinities. One of the main reason why CPU hotplug
> +   or Isolated cpuset partitions was not a solution.
>   
>   Possible arch/ problems
>   =======================
>   
>   Possible arch problems I found (and either tried to fix or didn't):
>   
> -sparc - IRQs on at this point(?), change local_irq_save to _disable.
> +sparc:
> +      - IRQs on at this point(?), change local_irq_save to _disable.
>         - TODO: needs secondary CPUs to disable preempt (See #1)
> 


^ permalink raw reply

* [PATCH v2 4/4] usb: gadget: f_fs: Introduce rw_proxy file descriptors
From: Neill Kapron @ 2026-06-19  4:06 UTC (permalink / raw)
  To: gregkh, corbet, skhan
  Cc: linux-usb, linux-doc, linux-kernel, kernel-team, Neill Kapron
In-Reply-To: <20260619040609.4010746-1-nkapron@google.com>

Currently, FunctionFS exposes each USB endpoint as a separate,
unidirectional file descriptor (e.g., `ep1` for IN, `ep2` for OUT).
While this mirrors the underlying hardware structure, it forces
userspace daemons implementing bidirectional protocols to manage
multiple file descriptors. When dealing with legacy protocols which
require exposing a single, bi-directional fd to userspace, this becomes
problematic.

This patch introduces the `FUNCTIONFS_RW_PROXY_EPS` UAPI flag. When
passed in the descriptor header during initialization, FunctionFS
provisions a "rw_proxy" bidirectional file descriptor (e.g., `ep1_rw`)
alongside every pair of IN/OUT endpoints.

Implementation details:
- RW proxy files act as a pure VFS alias, proxying operations
  directly to the base ffs_epfile instances. A `read()` proxies to
  the OUT endpoint's file, and a `write()` proxies to the IN file.
- Because operations are proxied natively, they reuse the underlying
  base endpoint's lock (`epfile->mutex`) and tracking state. This
  serializes concurrent I/O, preventing buffer corruption or races
  even if userspace mixes transfers across both the rw_proxy and base files
  while allowing full-duplex synchronous operations to occur concurrently
  without serializing on a single lock.
- Control operations (like IOCTLs) and intentional stalls (via
  reverse-direction I/O) must still be issued on the base endpoints, as the
  rw_proxy returns `-ENOTTY` for IOCTLs and cannot trigger stalls.

Assisted-by: Antigravity:gemini-3.1-pro
Signed-off-by: Neill Kapron <nkapron@google.com>
---
 Documentation/usb/functionfs.rst    | 56 +++++++++++++++++++
 drivers/usb/gadget/function/f_fs.c  | 87 ++++++++++++++++++++++++-----
 drivers/usb/gadget/function/u_fs.h  |  8 ++-
 include/uapi/linux/usb/functionfs.h |  1 +
 4 files changed, 136 insertions(+), 16 deletions(-)

diff --git a/Documentation/usb/functionfs.rst b/Documentation/usb/functionfs.rst
index 582e53549d5b..b189cf5626ba 100644
--- a/Documentation/usb/functionfs.rst
+++ b/Documentation/usb/functionfs.rst
@@ -96,6 +96,58 @@ One such IOCTL is:
     * ``-ENODEV``: The FunctionFS instance is not active.
     * ``-EINVAL``: The endpoint is not an IN endpoint.
     * ``-EFAULT``: Invalid user space pointer for the argument.
+
+RW Proxy Endpoints
+==================
+
+If the ``FUNCTIONFS_RW_PROXY_EPS`` flag is passed in the descriptor header
+(requires ``FUNCTIONFS_DESCRIPTORS_MAGIC_V2``), FunctionFS will provision a
+bidirectional rw_proxy file descriptor (e.g., "ep1_rw") alongside each pair
+of IN and OUT endpoints. The rw_proxy file aliases the underlying hardware
+endpoints, allowing userspace to use a single file descriptor for both reading
+(OUT) and writing (IN).
+
+This flag requires the total number of hardware endpoints to be an even number.
+FunctionFS will automatically walk the provided endpoints and group them into
+adjacent pairs (e.g., ep1 and ep2 form the first pair, ep3 and ep4 form the
+second pair). Each pair must consist of exactly one IN endpoint and one OUT
+endpoint.
+
+For each valid pair, a rw_proxy file is created and named after the first
+endpoint in the pair with a "_rw" suffix. For example, if ep1 and ep2 are
+paired, a rw_proxy file named "ep1_rw" is created. If ep3 and ep4 are paired,
+"ep3_rw" is created.
+
+If the ``FUNCTIONFS_VIRTUAL_ADDR`` flag is also enabled, the endpoints will be
+named using their physical endpoint address in hexadecimal instead of their
+index. RW proxy files will inherit this naming convention. For example, if the
+first endpoint of a pair maps to address 0x02, the rw_proxy file will be
+named "ep02_rw".
+
+When this flag is enabled, userspace has the choice of performing data transfers
+via the single rw_proxy file descriptor or the two base file descriptors. The
+rw_proxy file descriptor acts as a pure VFS alias that proxies all operations
+directly to the underlying base file descriptors.
+
+Because it is a pure proxy, there are no data races or buffer corruptions if
+userspace uses both the rw_proxy endpoint and the base endpoints concurrently.
+The native mutexes of the base endpoints perfectly serialize all concurrent
+transfers. However, userspace should generally pick one method and stick to it
+to avoid interleaving its own data stream.
+
+- **IOCTLs (Clear Halt, etc.):** RW proxy endpoints do not support IOCTLs and
+  will return ``-ENOTTY``. To clear a host-initiated halt, userspace must issue
+  the ``FUNCTIONFS_CLEAR_HALT`` ioctl directly on the corresponding base
+  endpoint file descriptor.
+- **Intentional Stalls:** The traditional mechanism for intentionally halting an
+  endpoint by issuing a reverse-direction data operation (e.g., attempting to
+  read from an IN endpoint) continues to work, but it must be issued on the
+  base endpoint. RW proxy endpoints cannot be used to trigger a stall because
+  they are fully bidirectional.
+
+Note that DMABUF data transfers (``FUNCTIONFS_DMABUF_TRANSFER``) are unsupported
+via the rw_proxy endpoint because it does not support IOCTLs. If DMABUF
+transfers are required, users must use the standard base endpoints.
 DMABUF interface
 ================
 
@@ -103,6 +155,10 @@ FunctionFS additionally supports a DMABUF based interface, where the
 userspace can attach DMABUF objects (externally created) to an endpoint,
 and subsequently use them for data transfers.
 
+Note: The DMABUF interface is unsupported on rw_proxy endpoints. See
+the RW Proxy Endpoints section for details on using DMABUF alongside
+the ``FUNCTIONFS_RW_PROXY_EPS`` flag.
+
 A userspace application can then use this interface to share DMABUF
 objects between several interfaces, allowing it to transfer data in a
 zero-copy fashion, for instance between IIO and the USB stack.
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 07aba722dd5b..c5647febf3ea 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -159,7 +159,9 @@ struct ffs_epfile {
 	struct mutex			mutex;
 
 	struct ffs_data			*ffs;
-	struct ffs_ep			*ep;	/* P: ffs->eps_lock */
+	struct ffs_ep			*ep;		/* P: ffs->eps_lock */
+	struct ffs_epfile		*epfile_in;	/* P: ffs->eps_lock */
+	struct ffs_epfile		*epfile_out;	/* P: ffs->eps_lock */
 
 	/*
 	 * Buffer for holding data from partial reads which may happen since
@@ -219,12 +221,13 @@ struct ffs_epfile {
 	struct ffs_buffer		*read_buffer;
 #define READ_BUFFER_DROP ((struct ffs_buffer *)ERR_PTR(-ESHUTDOWN))
 
-	char				name[5];
+	char				name[8];
 
 	unsigned char			in;	/* P: ffs->eps_lock */
 	unsigned char			isoc;	/* P: ffs->eps_lock */
 
 	u8				zlp_enabled; /* P: ffs->eps_lock */
+	bool				is_rw_proxy;
 
 	/* Protects dmabufs */
 	struct mutex			dmabufs_mutex;
@@ -978,9 +981,8 @@ static ssize_t __ffs_epfile_read_data(struct ffs_epfile *epfile,
 	return ret;
 }
 
-static struct ffs_ep *ffs_epfile_wait_ep(struct file *file)
+static struct ffs_ep *ffs_epfile_wait_ep(struct ffs_epfile *epfile, struct file *file)
 {
-	struct ffs_epfile *epfile = file->private_data;
 	struct ffs_ep *ep;
 	int ret;
 
@@ -1007,17 +1009,22 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data)
 	char *data = NULL;
 	ssize_t ret, data_len = -EINVAL;
 	int halt;
+	bool is_rw_proxy = epfile->is_rw_proxy;
 
 	/* Are we still active? */
 	if (WARN_ON(epfile->ffs->state != FFS_ACTIVE))
 		return -ENODEV;
 
-	ep = ffs_epfile_wait_ep(file);
+	/* Proxy to base endpoint if rw_proxy */
+	if (is_rw_proxy)
+		epfile = io_data->read ? epfile->epfile_out : epfile->epfile_in;
+
+	ep = ffs_epfile_wait_ep(epfile, file);
 	if (IS_ERR(ep))
 		return PTR_ERR(ep);
 
 	/* Do we halt? */
-	halt = (!io_data->read == !epfile->in);
+	halt = is_rw_proxy ? 0 : (!io_data->read == !epfile->in);
 	if (halt && epfile->isoc)
 		return -EINVAL;
 
@@ -1115,7 +1122,7 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data)
 			req->num_sgs = 0;
 		}
 
-		req->zero = epfile->zlp_enabled;
+		req->zero = !io_data->read ? epfile->zlp_enabled : 0;
 		req->length = data_len;
 
 		io_data->buf = data;
@@ -1168,7 +1175,7 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data)
 			req->num_sgs = 0;
 		}
 
-		req->zero = epfile->zlp_enabled;
+		req->zero = !io_data->read ? epfile->zlp_enabled : 0;
 		req->length = data_len;
 
 		io_data->buf = data;
@@ -1646,7 +1653,7 @@ static int ffs_dmabuf_transfer(struct file *file,
 
 	priv = attach->importer_priv;
 
-	ep = ffs_epfile_wait_ep(file);
+	ep = ffs_epfile_wait_ep(epfile, file);
 	if (IS_ERR(ep)) {
 		ret = PTR_ERR(ep);
 		goto err_attachment_put;
@@ -1764,6 +1771,9 @@ static long ffs_epfile_ioctl(struct file *file, unsigned code,
 	if (WARN_ON(epfile->ffs->state != FFS_ACTIVE))
 		return -ENODEV;
 
+	if (epfile->is_rw_proxy)
+		return -ENOTTY;
+
 	switch (code) {
 	case FUNCTIONFS_DMABUF_ATTACH:
 	{
@@ -1814,7 +1824,7 @@ static long ffs_epfile_ioctl(struct file *file, unsigned code,
 	}
 
 	/* Wait for endpoint to be enabled */
-	ep = ffs_epfile_wait_ep(file);
+	ep = ffs_epfile_wait_ep(epfile, file);
 	if (IS_ERR(ep))
 		return PTR_ERR(ep);
 
@@ -2212,7 +2222,7 @@ static void ffs_data_closed(struct ffs_data *ffs)
 
 		if (epfiles)
 			ffs_epfiles_destroy(ffs->sb, epfiles,
-					 ffs->eps_count);
+					 ffs->epfiles_count);
 
 		if (ffs->setup_state == FFS_SETUP_PENDING)
 			__ffs_ep0_stall(ffs);
@@ -2270,7 +2280,7 @@ static void ffs_data_clear(struct ffs_data *ffs)
 	 * copy of epfile will save us from use-after-free.
 	 */
 	if (epfiles) {
-		ffs_epfiles_destroy(ffs->sb, epfiles, ffs->eps_count);
+		ffs_epfiles_destroy(ffs->sb, epfiles, ffs->epfiles_count);
 		ffs->epfiles = NULL;
 	}
 
@@ -2368,11 +2378,16 @@ static void functionfs_unbind(struct ffs_data *ffs)
 static int ffs_epfiles_create(struct ffs_data *ffs)
 {
 	struct ffs_epfile *epfile, *epfiles;
-	unsigned i, count;
+	unsigned int i, count, epfiles_count;
 	int err;
 
 	count = ffs->eps_count;
-	epfiles = kzalloc_objs(*epfiles, count);
+	epfiles_count = count;
+	if (ffs->user_flags & FUNCTIONFS_RW_PROXY_EPS)
+		epfiles_count += count / 2;
+	ffs->epfiles_count = epfiles_count;
+
+	epfiles = kzalloc_objs(*epfiles, epfiles_count);
 	if (!epfiles)
 		return -ENOMEM;
 
@@ -2395,6 +2410,32 @@ static int ffs_epfiles_create(struct ffs_data *ffs)
 		}
 	}
 
+	if (ffs->user_flags & FUNCTIONFS_RW_PROXY_EPS) {
+		struct ffs_epfile *comp = epfiles + count;
+
+		for (i = 0; i < count; i += 2, ++comp) {
+			struct ffs_epfile *ep1 = &epfiles[i];
+			struct ffs_epfile *ep2 = &epfiles[i + 1];
+			bool ep1_in = ffs->eps_addrmap[i + 1] & USB_ENDPOINT_DIR_MASK;
+
+			comp->ffs = ffs;
+			comp->is_rw_proxy = true;
+			comp->epfile_in = ep1_in ? ep1 : ep2;
+			comp->epfile_out = ep1_in ? ep2 : ep1;
+			mutex_init(&comp->mutex);
+			mutex_init(&comp->dmabufs_mutex);
+			INIT_LIST_HEAD(&comp->dmabufs);
+			snprintf(comp->name, sizeof(comp->name), "%s_rw",
+				 epfiles[i].name);
+			err = ffs_sb_create_file(ffs->sb, comp->name,
+						 comp, &ffs_epfile_operations);
+			if (err) {
+				ffs_epfiles_destroy(ffs->sb, epfiles, count + (i / 2));
+				return err;
+			}
+		}
+	}
+
 	ffs->epfiles = epfiles;
 	return 0;
 }
@@ -2972,7 +3013,8 @@ static int __ffs_data_got_descs(struct ffs_data *ffs,
 			      FUNCTIONFS_VIRTUAL_ADDR |
 			      FUNCTIONFS_EVENTFD |
 			      FUNCTIONFS_ALL_CTRL_RECIP |
-			      FUNCTIONFS_CONFIG0_SETUP)) {
+			      FUNCTIONFS_CONFIG0_SETUP |
+			      FUNCTIONFS_RW_PROXY_EPS)) {
 			ret = -ENOSYS;
 			goto error;
 		}
@@ -3060,6 +3102,21 @@ static int __ffs_data_got_descs(struct ffs_data *ffs,
 		goto error;
 	}
 
+	if (ffs->user_flags & FUNCTIONFS_RW_PROXY_EPS) {
+		if (ffs->eps_count % 2) {
+			ret = -EINVAL;
+			goto error;
+		}
+
+		for (i = 1; i < ffs->eps_count; i += 2) {
+			if ((ffs->eps_addrmap[i] & USB_ENDPOINT_DIR_MASK) ==
+			    (ffs->eps_addrmap[i + 1] & USB_ENDPOINT_DIR_MASK)) {
+				ret = -EINVAL;
+				goto error;
+			}
+		}
+	}
+
 	ffs->raw_descs_data	= _data;
 	ffs->raw_descs		= raw_descs;
 	ffs->raw_descs_length	= data - raw_descs;
diff --git a/drivers/usb/gadget/function/u_fs.h b/drivers/usb/gadget/function/u_fs.h
index 6a80182aadd7..c280c495fbd2 100644
--- a/drivers/usb/gadget/function/u_fs.h
+++ b/drivers/usb/gadget/function/u_fs.h
@@ -252,8 +252,14 @@ struct ffs_data {
 
 	unsigned short			strings_count;
 	unsigned short			interfaces_count;
+
+	/*
+	 * eps_count tracks the number of underlying hardware endpoints.
+	 * epfiles_count tracks the total number of VFS endpoint files.
+	 * When companion endpoints are active, epfiles_count > eps_count.
+	 */
 	unsigned short			eps_count;
-	unsigned short			_pad1;
+	unsigned short			epfiles_count;
 
 	/* filled by __ffs_data_got_strings() */
 	/* ids in stringtabs are set in functionfs_bind() */
diff --git a/include/uapi/linux/usb/functionfs.h b/include/uapi/linux/usb/functionfs.h
index 06134dca4e8a..290308c9dd92 100644
--- a/include/uapi/linux/usb/functionfs.h
+++ b/include/uapi/linux/usb/functionfs.h
@@ -25,6 +25,7 @@ enum functionfs_flags {
 	FUNCTIONFS_EVENTFD = 32,
 	FUNCTIONFS_ALL_CTRL_RECIP = 64,
 	FUNCTIONFS_CONFIG0_SETUP = 128,
+	FUNCTIONFS_RW_PROXY_EPS = 256,
 };
 
 /* Descriptor of an non-audio endpoint */
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* [PATCH v2 3/4] usb: gadget: f_fs: Add zero-length packet ioctl
From: Neill Kapron @ 2026-06-19  4:06 UTC (permalink / raw)
  To: gregkh, corbet, skhan
  Cc: linux-usb, linux-doc, linux-kernel, kernel-team, Neill Kapron
In-Reply-To: <20260619040609.4010746-1-nkapron@google.com>

When transferring data from a USB gadget to a host, a transfer is
considered complete when the host receives a short packet (a packet
smaller than wMaxPacketSize). If the total transfer length is an
exact multiple of wMaxPacketSize, a Zero-Length Packet (ZLP) must
be appended to signal the end of the transfer.

FunctionFS currently provides no mechanism for userspace to instruct
the kernel to set the `req->zero` flag on transfers. Userspace
workarounds, such as manually submitting separate 0-byte requests,
may not be available for legacy protocols which must maintain write
behavior compatibility when moved to functionfs implementations.

To resolve this, introduce a new ioctl, FUNCTIONFS_ENDPOINT_ENABLE_ZLP,
which takes a pointer to a __u32 flag. When enabled, all subsequent
transfers on that endpoint will have the `req->zero` flag set, allowing
the underlying USB Device Controller (UDC) hardware to automatically
append a ZLP only when mathematically required. For logical transfers
chunked across multiple requests, userspace can dynamically toggle this
flag, enabling it only prior to submitting the final chunk.

The flag defaults to false to maintain backward compatibility. Once
enabled, the state is persistent for the lifetime of the endpoint and
will not be reset by opening or closing the endpoint file descriptors.

Assisted-by: Gemini-CLI:gemini-3.1-pro-preview
Signed-off-by: Neill Kapron <nkapron@google.com>
---
 Documentation/usb/functionfs.rst    | 24 ++++++++++++++++++++++++
 drivers/usb/gadget/function/f_fs.c  | 25 ++++++++++++++++++++++++-
 include/uapi/linux/usb/functionfs.h | 23 +++++++++++++++++++++++
 3 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/Documentation/usb/functionfs.rst b/Documentation/usb/functionfs.rst
index f7487b0d8057..582e53549d5b 100644
--- a/Documentation/usb/functionfs.rst
+++ b/Documentation/usb/functionfs.rst
@@ -72,6 +72,30 @@ have been written to their ep0's.
 Conversely, the gadget is unregistered after the first USB function
 closes its endpoints.
 
+Endpoint IOCTLs
+===============
+
+FunctionFS supports additional IOCTLs that can be performed on data endpoints
+(ie. not ep0). For a full list of these IOCTLs, please refer to the documentation
+in ``include/uapi/linux/usb/functionfs.h``.
+
+One such IOCTL is:
+
+  ``FUNCTIONFS_ENDPOINT_ENABLE_ZLP(__u32 *)``
+    Enable or disable automatic zero-length packet (ZLP) appending for the
+    endpoint. The argument is a pointer to a __u32: 0 to disable, non-zero to
+    enable. When enabled, the kernel will automatically append a ZLP at the end
+    of a transfer if the payload length is an exact multiple of the endpoint's
+    max packet size. This is useful for compatibility with legacy protocols
+    which require automatic ZLP appending to data written from userspace. This
+    IOCTL can only be used on IN endpoints. It can be called at any time after
+    the FunctionFS instance is active, even before the host has connected or
+    enabled the endpoint. Returns zero on success, or a negative errno value on
+    error:
+
+    * ``-ENODEV``: The FunctionFS instance is not active.
+    * ``-EINVAL``: The endpoint is not an IN endpoint.
+    * ``-EFAULT``: Invalid user space pointer for the argument.
 DMABUF interface
 ================
 
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 374ab36eaaa3..07aba722dd5b 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -224,7 +224,7 @@ struct ffs_epfile {
 	unsigned char			in;	/* P: ffs->eps_lock */
 	unsigned char			isoc;	/* P: ffs->eps_lock */
 
-	unsigned char			_pad;
+	u8				zlp_enabled; /* P: ffs->eps_lock */
 
 	/* Protects dmabufs */
 	struct mutex			dmabufs_mutex;
@@ -1114,6 +1114,8 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data)
 			req->buf = data;
 			req->num_sgs = 0;
 		}
+
+		req->zero = epfile->zlp_enabled;
 		req->length = data_len;
 
 		io_data->buf = data;
@@ -1165,6 +1167,8 @@ static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data)
 			req->buf = data;
 			req->num_sgs = 0;
 		}
+
+		req->zero = epfile->zlp_enabled;
 		req->length = data_len;
 
 		io_data->buf = data;
@@ -1707,6 +1711,7 @@ static int ffs_dmabuf_transfer(struct file *file,
 
 	/* Now that the dma_fence is in place, queue the transfer. */
 
+	usb_req->zero = epfile->zlp_enabled;
 	usb_req->length = req->length;
 	usb_req->buf = NULL;
 	usb_req->sg = priv->sgt->sgl;
@@ -1754,6 +1759,7 @@ static long ffs_epfile_ioctl(struct file *file, unsigned code,
 	struct ffs_epfile *epfile = file->private_data;
 	struct ffs_ep *ep;
 	int ret;
+	__u32 enable_zlp = 0;
 
 	if (WARN_ON(epfile->ffs->state != FFS_ACTIVE))
 		return -ENODEV;
@@ -1786,6 +1792,23 @@ static long ffs_epfile_ioctl(struct file *file, unsigned code,
 
 		return ffs_dmabuf_transfer(file, &req);
 	}
+	/*
+	 * We handle this IOCTL before ffs_epfile_wait_ep() to allow userspace
+	 * to configure ZLP behavior immediately without blocking indefinitely
+	 * while waiting for the USB host to connect and enable the endpoint.
+	 */
+	case FUNCTIONFS_ENDPOINT_ENABLE_ZLP:
+		if (!epfile->in)
+			return -EINVAL;
+
+		if (copy_from_user(&enable_zlp, (void __user *)value, sizeof(enable_zlp)))
+			return -EFAULT;
+
+		spin_lock_irq(&epfile->ffs->eps_lock);
+		epfile->zlp_enabled = !!enable_zlp;
+		spin_unlock_irq(&epfile->ffs->eps_lock);
+
+		return 0;
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/usb/functionfs.h b/include/uapi/linux/usb/functionfs.h
index beef1752e36e..06134dca4e8a 100644
--- a/include/uapi/linux/usb/functionfs.h
+++ b/include/uapi/linux/usb/functionfs.h
@@ -414,4 +414,27 @@ struct usb_functionfs_event {
 #define FUNCTIONFS_DMABUF_TRANSFER	_IOW('g', 133, \
 					     struct usb_ffs_dmabuf_transfer_req)
 
+/*
+ * Enable or disable automatic zero-length packet (ZLP) appending for the
+ * endpoint. The argument is a pointer to a __u32: 0 to disable, non-zero to
+ * enable.
+ *
+ * When enabled, the kernel will automatically append a ZLP at the end of
+ * a transfer if the payload length is an exact multiple of the endpoint's
+ * max packet size.
+ *
+ * This is useful for compatibility with legacy protocols which require
+ * automatic ZLP appending to data written from userspace.
+ *
+ * This ioctl can only be used on IN endpoints. It can be called at any time
+ * after the FunctionFS instance is active, even before the host has connected
+ * or enabled the endpoint.
+ *
+ * Returns zero on success, or a negative errno value on error:
+ * -ENODEV: The FunctionFS instance is not active.
+ * -EINVAL: The endpoint is not an IN endpoint.
+ * -EFAULT: Invalid user space pointer for the argument.
+ */
+#define	FUNCTIONFS_ENDPOINT_ENABLE_ZLP	_IOW('g', 134, __u32)
+
 #endif /* _UAPI__LINUX_FUNCTIONFS_H__ */
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* [PATCH v2 2/4] usb: gadget: f_fs: Tie read_buffer lifetime to ffs_epfile
From: Neill Kapron @ 2026-06-19  4:06 UTC (permalink / raw)
  To: gregkh, corbet, skhan, Felipe Balbi, Michal Nazarewicz
  Cc: linux-usb, linux-doc, linux-kernel, kernel-team, Neill Kapron,
	stable
In-Reply-To: <20260619040609.4010746-1-nkapron@google.com>

Currently, ffs_epfile_release unconditionally frees the endpoint's
read_buffer when a file descriptor is closed. If userspace explicitly
opens the endpoint multiple times and closes one, the read_buffer is
destroyed. This can lead to silent data loss if other file descriptors
are still actively reading from the endpoint.

By tying the lifetime of the read_buffer to the ffs_epfile structure itself
(which is destroyed when the functionfs instance is torn down in
ffs_epfiles_destroy), we eliminate the brittle dependency on open/release
calls while correctly matching the conceptual lifetime of unread data on
the hardware endpoint.

Fixes: 9353afbbfa7b ("usb: gadget: f_fs: buffer data from ‘oversized’ OUT requests")
Cc: stable@vger.kernel.org
Assisted-by: Antigravity:gemini-3.1-pro
Signed-off-by: Neill Kapron <nkapron@google.com>
---
 drivers/usb/gadget/function/f_fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 38e36faefe92..374ab36eaaa3 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -1374,7 +1374,6 @@ ffs_epfile_release(struct inode *inode, struct file *file)
 
 	mutex_unlock(&epfile->dmabufs_mutex);
 
-	__ffs_epfile_read_buffer_free(epfile);
 	ffs_data_closed(epfile->ffs);
 
 	return 0;
@@ -2390,6 +2389,7 @@ static void ffs_epfiles_destroy(struct super_block *sb,
 
 	for (; count; --count, ++epfile) {
 		BUG_ON(mutex_is_locked(&epfile->mutex));
+		__ffs_epfile_read_buffer_free(epfile);
 		simple_remove_by_name(root, epfile->name, clear_one);
 	}
 
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* [PATCH v2 1/4] usb: gadget: f_fs: Initialize epfile->in early to fix endpoint direction checks
From: Neill Kapron @ 2026-06-19  4:06 UTC (permalink / raw)
  To: gregkh, corbet, skhan, Paul Cercueil, Simona Vetter,
	Christian König
  Cc: linux-usb, linux-doc, linux-kernel, kernel-team, Neill Kapron,
	stable
In-Reply-To: <20260619040609.4010746-1-nkapron@google.com>

When parsing endpoint descriptors, ffs_data_got_descs() generates the
eps_addrmap which contains the endpoint direction. However, epfile->in
was previously only populated in ffs_func_eps_enable() which executes
upon USB host connection. As a result, early userspace ioctls like
FUNCTIONFS_DMABUF_ATTACH that run before the host connects would see
epfile->in as 0, leading to incorrect DMA directions.

By moving the initialization to ffs_epfiles_create(), epfile->in is
accurate before userspace opens the endpoint files.

Fixes: 7b07a2a7ca02 ("usb: gadget: functionfs: Add DMABUF import interface")
Cc: stable@vger.kernel.org
Assisted-by: Antigravity:gemini-3.1-pro
Signed-off-by: Neill Kapron <nkapron@google.com>
---
 drivers/usb/gadget/function/f_fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 75912ce6ab55..38e36faefe92 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -2364,6 +2364,7 @@ static int ffs_epfiles_create(struct ffs_data *ffs)
 			sprintf(epfile->name, "ep%02x", ffs->eps_addrmap[i]);
 		else
 			sprintf(epfile->name, "ep%u", i);
+		epfile->in = (ffs->eps_addrmap[i] & USB_ENDPOINT_DIR_MASK) ? 1 : 0;
 		err = ffs_sb_create_file(ffs->sb, epfile->name,
 					 epfile, &ffs_epfile_operations);
 		if (err) {
@@ -2453,7 +2454,6 @@ static int ffs_func_eps_enable(struct ffs_function *func)
 		ret = usb_ep_enable(ep->ep);
 		if (!ret) {
 			epfile->ep = ep;
-			epfile->in = usb_endpoint_dir_in(ep->ep->desc);
 			epfile->isoc = usb_endpoint_xfer_isoc(ep->ep->desc);
 		} else {
 			break;
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* [PATCH v2 0/4] usb: gadget: f_fs: Add R/W proxy EPs and ZLP support
From: Neill Kapron @ 2026-06-19  4:06 UTC (permalink / raw)
  To: gregkh, corbet, skhan
  Cc: linux-usb, linux-doc, linux-kernel, kernel-team, Neill Kapron

We are working to deprecate a widely used, out of tree gadget driver by
moving the functionality to userspace via functionfs. To do so, we have to
maintain strict compatibility with the legacy driver, as there are many
third party applications which can’t be modified and are reliant on this
interface. Specifically, the following requirements must be met:

- The function must expose a single file descriptor to userspace for both
  reads and writes,
- It must block on writes when it can not handle more data,
- It must handle arbitrary write transaction sizes,
- It must automatically append a zero length packet (ZLP) when the write
  transaction ends on a boundary of a multiple of the max packet size.

Initially, we pursued a compatibility layer in userspace which implemented
a socket pair to combine the OUT and IN endpoint files. This approach
proved problematic for several reasons.

To preserve the write transaction boundary for ZLP calculation, we
attempted to use SOCK_SEQPACKET. This created problems as larger
transactions required contiguous buffers to be allocated, and, even if we
ignore the constraint to the arbitrary write size and limited it to 1MB,
the socket would occasionally return -ENOBUFS to the end user if a write
operation was attempted when other sockets on the system were consuming
more than 7MB of the 8MB wmem_max limit.

After significant investigation including switching to SOCK_STREAM and
attempting a heuristic timeout approach, we decided the best path forward
was to pursue a native proxy endpoint approach in the functionfs driver
itself.

This patchset introduces the `FUNCTIONFS_RW_PROXY_EPS` flag to functionfs
which, when set, creates an additional proxy file for reading or writing
to a pair of endpoints. In an attempt to limit the change to the UAPI
surface, we added several constraints to this proxy file. We chose not to
handle ioctls on this proxy file, as the current ioctls do not have a
directionality associated with them, and would require essentially
creating duplicate ioctls with a direction argument. To use this flag, an
even number of in/out endpoints must be created, and a proxy ep is created
for each pair of endpoints in the descriptors.

With this new r/w proxy ep, we are able to transparently hand it to the
end application. However, to match the legacy driver’s ZLP functionality,
a new ioctl is added, `FUNCTIONFS_ENDPOINT_ENABLE_ZLP`. This allows the
userspace functionfs daemon to write the necessary descriptors, configure
the auto ZLP functionality on the IN EP, then handoff the proxy ep to the
application. When enabled, functionfs sets the req->zero flag. The UDC
driver then automatically adds the ZLP if the transaction length % max
packet size is 0.

An addition, several bugfix patches are added.
- A patch which fixes an issue if certain ioctls (like the new 
  `FUNCTIONFS_ENDPOINT_ENABLE_ZLP` or `FUNCTIONFS_DMABUF_ATTACH`) are
  called prior to the host being connected.
- A patch which moves the read buffer lifecyle from ffs_epfile_release()
  to ffs_epfiles_destroy, fixing an issue where ep's which have been
  opened() more than once free the read buffer with the first closure.

This patchset has been tested on a kernel based on 7.1-rc7, as well as a
backported version tested on 6.1. Existing functionfs implementations
continue to work without modification, and the new functionality passes
tests designed for our legacy kernel driver implementation.

---
Changes in V2:
- Added `Cc: stable...` tag to epfile-in early initialization bugfix
- Added `Tie read_buffer lifetime to ffs_epfile` bugfix change to
  address read buffer lifecycle
- Removed 'opened_count' and associated logic to track file open/close
- Reduced `name` char array from 10 to 8 to match size required
- Updated coverletter to reflect above
---

Neill Kapron (4):
  usb: gadget: f_fs: Initialize epfile->in early to fix endpoint
    direction checks
  usb: gadget: f_fs: Tie read_buffer lifetime to ffs_epfile
  usb: gadget: f_fs: Add zero-length packet ioctl
  usb: gadget: f_fs: Introduce rw_proxy file descriptors

 Documentation/usb/functionfs.rst    |  80 ++++++++++++++++++++
 drivers/usb/gadget/function/f_fs.c  | 112 ++++++++++++++++++++++++----
 drivers/usb/gadget/function/u_fs.h  |   8 +-
 include/uapi/linux/usb/functionfs.h |  24 ++++++
 4 files changed, 207 insertions(+), 17 deletions(-)

-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply

* Re: [PATCH v3 08/12] fs/resctrl: Make info/kernel_mode writable and identify the bound group
From: Babu Moger @ 2026-06-19  1:29 UTC (permalink / raw)
  To: Reinette Chatre, corbet, tony.luck, Dave.Martin, james.morse,
	tglx, bp, dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman
In-Reply-To: <57f6324b-6340-4633-b3a0-b40683a5ec12@intel.com>

Hi Reinette,

On 6/16/26 18:42, Reinette Chatre wrote:
> Hi Babu,
> 
> On 4/30/26 4:24 PM, Babu Moger wrote:
>> info/kernel_mode lists the kernel-mode CLOSID/RMID policies the kernel
> 
> (also here please drop the x86 specific details and consider the resctrl
> fs changes to be valid from MPAM perspective also)

Sure.

> 
>> supports and the one currently active, but user space has no way to
>> switch policies or rebind to a different rdtgroup, and the file does
>> not name the group that owns the kernel CLOSID/RMID.
> 
> This adds a new feature. No need to describe this change as a bugfix.

ok.

> 
>>
>> Make info/kernel_mode writable.  The format used by both read and
>> write is one line per mode:
> 
> This sounds like multiple modes can be written to the file as long as they
> are separated by newline? I do not think it should be needed to support
> write of more than one mode at a time.

Will change it.

> 
>>
>>    inherit_ctrl_and_mon:group=none
>>    [global_assign_ctrl_inherit_mon_per_cpu:group=g1//]
>>    global_assign_ctrl_assign_mon_per_cpu:group=none
>>
>> The active mode is wrapped in "[...]" and ":group=<ctrl>/<mon>/" names
>> the bound rdtgroup ("//" for the default control group).  Inactive
>> modes report ":group=none".  Documented in
>> Documentation/filesystems/resctrl.rst.
> 
> Above describes the output of the file. This changelog can just focus on
> what needs to be supported when user space writes to the file.
> 
>>
>> The write path strims input, strips the optional "[...]", validates
> 
> strims?
> 
> Wait, why support the brackets as input? This seems unnecessary.

Will remove it.

> 
>> the mode against resctrl_kcfg.kmode, and resolves the optional
>> ":group=" suffix via the new helper rdtgroup_by_kmode_path().  An
>> omitted suffix or an INHERIT-mode write binds to the default group.
>> On success, rdtgroup_config_kmode_clear() tears down the previous
>> binding and rdtgroup_config_kmode() programs the new one before
>> resctrl_kcfg.k_rdtgrp and resctrl_kcfg.kmode_cur are updated under
>> rdtgroup_mutex.  Allocation failures in the helpers are propagated so
>> the write fails atomically.
> 
> This also reads like it just describes the code.
> 

Will re-write it.

>>
>> Add struct rdtgroup fields kmode and kmode_cpu_mask to track the
>> per-group binding.
> 
> Please do not just describe the code but *why* this change is needed and
> what it means and how it is used.

ok.

> 
>>
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> ---
>> v3: New patch to handle the changed interface file info/kernel_mode.
>> ---
>>   Documentation/filesystems/resctrl.rst |  51 ++++
>>   fs/resctrl/internal.h                 |   6 +
>>   fs/resctrl/rdtgroup.c                 | 375 +++++++++++++++++++++++++-
>>   3 files changed, 431 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
>> index b003bed339fd..89fbf8b4fb2a 100644
>> --- a/Documentation/filesystems/resctrl.rst
>> +++ b/Documentation/filesystems/resctrl.rst
>> @@ -522,6 +522,57 @@ conveyed in the error returns from file operations. E.g.
>>   	# cat info/last_cmd_status
>>   	mask f7 has non-consecutive 1-bits
>>   
>> +"kernel_mode":
> 
> (dropping the documentation here since I believe earlier comments apply)

Will rewrite the documentation. Will drop the x86 specific details.

> 
> ...
> 
>> diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h
>> index 1a9b29119f88..9435ce663f54 100644
>> --- a/fs/resctrl/internal.h
>> +++ b/fs/resctrl/internal.h
>> @@ -216,6 +216,10 @@ struct mongroup {
>>    * @mon:			mongroup related data
>>    * @mode:			mode of resource group
>>    * @mba_mbps_event:		input monitoring event id when mba_sc is enabled
>> + * @kmode:			true if this group is currently bound as the kernel-mode
>> + *				CLOSID/RMID owner (resctrl_kcfg.k_rdtgrp)
> 
> (drop CLOSID/RMID)

ack.

> 
>> + * @kmode_cpu_mask:		CPUs scoped for this group's kernel-mode binding;
>> + *				when empty, all online CPUs are used
> 
> Why does "empty" signify "all online CPUs"? This complicates implementation and
> creates different interface from existing CPUs interface of resource groups.

Will change it. It will display all the bound CPUs.

> 
>>    * @plr:			pseudo-locked region
>>    */
>>   struct rdtgroup {
>> @@ -229,6 +233,8 @@ struct rdtgroup {
>>   	struct mongroup			mon;
>>   	enum rdtgrp_mode		mode;
>>   	enum resctrl_event_id		mba_mbps_event;
>> +	bool				kmode;
>> +	struct cpumask			kmode_cpu_mask;
>>   	struct pseudo_lock_region	*plr;
>>   };
>>   
>> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
>> index 9cdcfa64c4a2..5383b4eb23ed 100644
>> --- a/fs/resctrl/rdtgroup.c
>> +++ b/fs/resctrl/rdtgroup.c
>> @@ -1055,6 +1055,378 @@ static int resctrl_kernel_mode_show(struct kernfs_open_file *of,
>>   	return 0;
>>   }
>>   
>> +/**
>> + * rdtgroup_config_kmode() - Push @rdtgrp's kernel CLOSID/RMID to hardware
>> + * @rdtgrp:	Resctrl group whose CLOSID/RMID should be programmed.
>> + *
>> + * Derives CLOSID/RMID from @rdtgrp->type:
>> + *   - RDTMON_GROUP: parent control group's CLOSID with the monitor group's RMID.
> 
> This seem unnecessary since when a monitor group is created it's closid is inherited
> from it's control group?

ok.

> 
>> + *   - RDTCTRL_GROUP: the control group's own CLOSID and default RMID.
>> + *
>> + * Calls resctrl_arch_configure_kmode() with the kernel-mode binding enabled
>> + * on the online subset of @rdtgrp->kmode_cpu_mask (or all online CPUs when
>> + * that mask is empty), and disabled on the complementary online CPUs so
>> + * stale enable bits from a previously bound group are cleared in the same
>> + * reprogram step.  The caller (resctrl_kernel_mode_write()) is responsible
>> + * for validating that the (kmode, group type) pair is permitted before
>> + * invoking this helper.
>> + *
>> + * Context: Caller must hold rdtgroup_mutex.
> 
> Please use lockdep_assert_held(&rdtgroup_mutex) instead. See "Documenting locking requirements"
> in Documentation/process/maintainer-tip.rst

ok. Sure.

> 
>> + *
>> + * Return: 0 on success, -EINVAL for a pseudo-locked group, -ENOMEM if
>> + * cpumask allocation fails.
>> + */
>> +static int rdtgroup_config_kmode(struct rdtgroup *rdtgrp)
>> +{
>> +	cpumask_var_t enable_mask, disable_mask;
>> +	u32 closid, rmid;
>> +	bool need_disable;
> 
> (needs reverse fir)

Sure.

> 
>> +
>> +	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
>> +		rdt_last_cmd_puts("Resource group is pseudo-locked\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!zalloc_cpumask_var(&enable_mask, GFP_KERNEL))
>> +		return -ENOMEM;
>> +
>> +	need_disable = !cpumask_empty(&rdtgrp->kmode_cpu_mask);
> 
> As I understand rdtgroup_config_kmode() is called when the kernel mode is switched.
> Also, earlier patches made it explicit that "Default scope is all online CPUs".
> 
> It is not clear to me how kmode_cpu_mask is initialized here ... it almost seems as though
> if a resource was associated with a mode at some point and received some CPU changes then
> when the mode switches between some other resource groups and then back to the original
> then the old cpu_mask will be used on the mode switch. Should the resource group's cpu_mask
> not be re-initialized to all online CPUs? If done then all of this cpu_mask wrangling seems
> unnecessary to me, just use all online CPUs?
> 

Yes. That is correct. It needs to be changed.

> 
>> +	if (need_disable && !zalloc_cpumask_var(&disable_mask, GFP_KERNEL)) {
>> +		free_cpumask_var(enable_mask);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	if (rdtgrp->type == RDTMON_GROUP) {
>> +		closid = rdtgrp->mon.parent->closid;
>> +		rmid = rdtgrp->mon.rmid;
>> +	} else {
>> +		closid = rdtgrp->closid;
>> +		rmid = rdtgrp->mon.rmid;
>> +	}
> 
> Considering MON group inherits the CLOSID if its parent, can above be simplified
> to just be?
> 	closid = rdtgrp->closid;
> 	rmid = rdtgrp->mon.rmid;
> 

Yes.




>> +
>> +	/*
>> +	 * Empty kmode_cpu_mask: enable on every online CPU.  Otherwise enable
>> +	 * only CPUs in the group mask and explicitly clear on other online CPUs
>> +	 * so a previously bound group's enable bits don't linger.
>> +	 */
>> +	if (!need_disable) {
>> +		cpumask_copy(enable_mask, cpu_online_mask);
>> +	} else {
>> +		cpumask_copy(enable_mask, &rdtgrp->kmode_cpu_mask);
>> +		cpumask_andnot(disable_mask, cpu_online_mask, &rdtgrp->kmode_cpu_mask);
>> +	}
>> +
>> +	if (!cpumask_empty(enable_mask))
>> +		resctrl_arch_configure_kmode(enable_mask, closid, rmid, true);
>> +
>> +	if (need_disable && !cpumask_empty(disable_mask))
>> +		resctrl_arch_configure_kmode(disable_mask, closid, rmid, false);
>> +
>> +	rdtgrp->kmode = true;
>> +
>> +	free_cpumask_var(enable_mask);
>> +	if (need_disable)
>> +		free_cpumask_var(disable_mask);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * rdtgroup_config_kmode_clear() - Tear down the kernel-mode binding on @rdtgrp
>> + * @rdtgrp:	Resctrl group whose kernel-mode binding is being released.
>> + *		May be %NULL when no group is currently bound, in which case
>> + *		this is a no-op.
>> + * @kmode:	Kernel-mode policy currently active on @rdtgrp, as a
>> + *		BIT(&enum resctrl_kernel_modes) value.  When this is
>> + *		BIT(INHERIT_CTRL_AND_MON) the hardware tear-down is skipped
>> + *		because no MSR was previously programmed.
>> + *
>> + * Disables the kernel-mode binding on the CPUs @rdtgrp covers (its
>> + * @kmode_cpu_mask, or all online CPUs when that mask is empty) and resets
>> + * the per-group bookkeeping (@kmode and @kmode_cpu_mask).  This is the
>> + * disable counterpart of rdtgroup_config_kmode() and exists so that a write
>> + * that transitions the active mode to BIT(INHERIT_CTRL_AND_MON) -- which
>> + * skips rdtgroup_config_kmode() entirely -- still tears down the previously
>> + * bound group instead of leaving stale enable bits behind.
>> + *
>> + * On allocation failure the function returns -ENOMEM and leaves both the
>> + * hardware state and @rdtgrp's bookkeeping unchanged so the caller can fail
>> + * the operation atomically and last_cmd_status reflects reality.
>> + *
>> + * Context: Caller must hold rdtgroup_mutex.
>> + *
>> + * Return: 0 on success (including the @rdtgrp == %NULL and INHERIT cases),
>> + * -ENOMEM if cpumask allocation fails.
>> + */
>> +static int rdtgroup_config_kmode_clear(struct rdtgroup *rdtgrp, int kmode)
>> +{
>> +	cpumask_var_t disable_mask;
>> +	u32 closid, rmid;
>> +
>> +	if (!rdtgrp)
>> +		return 0;
>> +
>> +	if (kmode == BIT(INHERIT_CTRL_AND_MON))
>> +		goto out_clear;
>> +
>> +	if (!zalloc_cpumask_var(&disable_mask, GFP_KERNEL))
>> +		return -ENOMEM;
>> +
>> +	if (rdtgrp->type == RDTMON_GROUP) {
>> +		closid = rdtgrp->mon.parent->closid;
>> +		rmid = rdtgrp->mon.rmid;
>> +	} else {
>> +		closid = rdtgrp->closid;
>> +		rmid = rdtgrp->mon.rmid;
>> +	}
> 

I can directly use it like below. I dont need to check for RDTMON_GROUP.

	closid = rdtgrp->closid;
  	rmid = rdtgrp->mon.rmid;


> Same comment as above ... but actually, why is closid/rmid needed at all? This
> function is intended to *reset* the kernel mode so needing a valid/active closid and
> rmid does not look right.

This is a bit tricky. I may need CLOSID/RMID in 
resctrl_arch_configure_kmode(). According to the specification, only the 
PLZA_EN field is allowed to differ across CPUs where PLZA is enabled; 
all other fields must remain consistent across CPUs within the same 
domain. If CLOSID/RMID are not passed, it could result in inconsistent 
values across CPUs.

> 
>> +
>> +	if (cpumask_empty(&rdtgrp->kmode_cpu_mask))
>> +		cpumask_copy(disable_mask, cpu_online_mask);
>> +	else
>> +		cpumask_copy(disable_mask, &rdtgrp->kmode_cpu_mask);
> 
> Having kmode_cpu_mask accurately reflect the online CPUs will simplify this to
> not need any of this wrangling and kmode_cpu_mask can just be used directly.

Agree.

> 
>> +
>> +	resctrl_arch_configure_kmode(disable_mask, closid, rmid, false);
>> +	free_cpumask_var(disable_mask);
>> +
>> +out_clear:
>> +	cpumask_clear(&rdtgrp->kmode_cpu_mask);
>> +	rdtgrp->kmode = false;
>> +	return 0;
>> +}
>> +
>> +/**
>> + * rdtgroup_by_kmode_path() - Resolve a "<ctrl>/<mon>/" path to an rdtgroup
>> + * @ctrl_name:	Control-group name, or "" for the default control group.
>> + * @mon_name:	Monitor-group name, or "" to select the control group itself.
>> + *
>> + * Matches the path syntax emitted by resctrl_kernel_mode_show():
>> + *   "//"            - the default control group
>> + *   "<ctrl>//"      - control group @ctrl_name
>> + *   "/<mon>/"       - monitor group @mon_name under the default control group
>> + *   "<ctrl>/<mon>/" - monitor group @mon_name under control group @ctrl_name
>> + *
>> + * Context: Caller must hold rdtgroup_mutex.
> 
> (lockdep)
> 

Sure.

>> + *
>> + * Return: Pointer to the matching rdtgroup, &rdtgroup_default when both
>> + * names are empty (the show form "//"), or NULL if no such group exists.
>> + */
>> +static struct rdtgroup *rdtgroup_by_kmode_path(const char *ctrl_name,
>> +					       const char *mon_name)
>> +{
>> +	struct rdtgroup *rdtg, *parent = NULL, *crg;
>> +
>> +	/* Show emits "//" for the default control group; round-trip it here. */
>> +	if (!*ctrl_name && !*mon_name)
>> +		return &rdtgroup_default;
>> +
>> +	/* Control-group-only form: "<ctrl>//". */
>> +	if (!*mon_name) {
>> +		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> +			if (rdtg->type != RDTCTRL_GROUP)
>> +				continue;
>> +			if (!strcmp(rdt_kn_name(rdtg->kn), ctrl_name))
>> +				return rdtg;
>> +		}
>> +		return NULL;
>> +	}
>> +
>> +	/* Monitor-group form: locate the parent control group first. */
>> +	if (!*ctrl_name) {
>> +		parent = &rdtgroup_default;
>> +	} else {
>> +		list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> +			if (rdtg->type != RDTCTRL_GROUP)
>> +				continue;
>> +			if (!strcmp(rdt_kn_name(rdtg->kn), ctrl_name)) {
>> +				parent = rdtg;
>> +				break;
>> +			}
>> +		}
>> +		if (!parent)
>> +			return NULL;
>> +	}
>> +
>> +	list_for_each_entry(crg, &parent->mon.crdtgrp_list, mon.crdtgrp_list)
>> +		if (!strcmp(rdt_kn_name(crg->kn), mon_name))
>> +			return crg;
>> +	return NULL;
>> +}
>> +
>> +/**
>> + * resctrl_kernel_mode_write() - Select kernel mode and bind group via info/kernel_mode
>> + * @of:		kernfs file handle.
>> + * @buf:	One line in the same format emitted by resctrl_kernel_mode_show(),
>> + *		i.e. "<mode>[:group=<ctrl>/<mon>/]" with an optional surrounding
>> + *		"[...]"; must end with a newline.  The ":group=<spec>" suffix is
>> + *		optional -- when omitted the default control group
>> + *		(&rdtgroup_default) is used.
>> + * @nbytes:	Length of @buf.
>> + * @off:	File offset (unused).
>> + *
>> + * Parses @buf, validates that <mode> is listed in resctrl_mode_str[] and is
>> + * supported by the platform (resctrl_kcfg.kmode), resolves <ctrl>/<mon>/ to
>> + * an existing rdtgroup (or picks &rdtgroup_default if no group was specified
>> + * or if the new mode is INHERIT), clears any previous binding via
>> + * rdtgroup_config_kmode_clear(), programs hardware via
>> + * rdtgroup_config_kmode() when @kmode is not BIT(INHERIT_CTRL_AND_MON), and
>> + * on success updates resctrl_kcfg.k_rdtgrp and resctrl_kcfg.kmode_cur.  The
>> + * display-only "group=none" form produced by show for inactive modes is
>> + * rejected.  Errors are reported in last_cmd_status.
>> + *
>> + * Return: @nbytes on success, negative errno with last_cmd_status set on error.
>> + */
>> +static ssize_t resctrl_kernel_mode_write(struct kernfs_open_file *of,
>> +					 char *buf, size_t nbytes, loff_t off)
>> +{
>> +	char *mode_str, *group_str, *slash;
>> +	const char *ctrl_name, *mon_name;
>> +	struct rdtgroup *rdtgrp;
>> +	int ret = 0;
>> +	size_t len;
>> +	u32 kmode;
>> +	int i;
>> +
>> +	if (nbytes == 0 || buf[nbytes - 1] != '\n')
>> +		return -EINVAL;
>> +	buf[nbytes - 1] = '\0';
>> +
>> +	/* Tolerate surrounding whitespace before the bracket/mode parsing. */
>> +	buf = strim(buf);
>> +	len = strlen(buf);
>> +
>> +	/* Strip the optional "[...]" that show uses to mark the active line. */
>> +	if (len >= 2 && buf[0] == '[' && buf[len - 1] == ']') {
>> +		buf[len - 1] = '\0';
>> +		buf++;
>> +		len -= 2;
>> +	}
> 
> I do not think the brackets should be valid input.

Sure.

> 
>> +
>> +	/*
>> +	 * Split "<mode>:group=<spec>"; the ":group=<spec>" suffix is optional
>> +	 * and when omitted the default control group (&rdtgroup_default) is used.
>> +	 */
>> +	group_str = strstr(buf, ":group=");
>> +	if (group_str) {
>> +		*group_str = '\0';
>> +		group_str += strlen(":group=");
>> +	}
>> +	mode_str = buf;
>> +
>> +	mutex_lock(&rdtgroup_mutex);
>> +	rdt_last_cmd_clear();
>> +
>> +	for (i = 0; i < RESCTRL_NUM_KERNEL_MODES; i++)
>> +		if (!strcmp(mode_str, resctrl_mode_str[i]))
>> +			break;
>> +	if (i == RESCTRL_NUM_KERNEL_MODES) {
>> +		rdt_last_cmd_puts("Unknown kernel mode\n");
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +
>> +	if (!(resctrl_kcfg.kmode & BIT(i))) {
>> +		rdt_last_cmd_puts("Kernel mode not available\n");
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +
>> +	kmode = BIT(i);
> 
> Can kmode be of enum type to be assigned the actual enum value to avoid all these BIT(enum value) usages?

You mean?

enum resctrl_kernel_modes {
	INHERIT_CTRL_AND_MON		= 1U << 0,  /* 1 */
	GLOBAL_ASSIGN_CTRL_INHERIT_MON	= 1U << 1,  /* 2 */
	GLOBAL_ASSIGN_CTRL_ASSIGN_MON	= 1U << 2,  /* 4 */
};

#define RESCTRL_NUM_KERNEL_MODES  3

> 
>> +
>> +	if (!group_str) {
>> +		/* No ":group=" suffix: fall back to the default control group. */
>> +		rdtgrp = &rdtgroup_default;
>> +	} else if (!strcmp(group_str, "none")) {
>> +		/* Display-only placeholder emitted by show; not selectable. */
>> +		rdt_last_cmd_puts("Cannot bind to 'none' group\n");
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	} else {
>> +		/* Require exactly "<ctrl>/<mon>/" - two '/' with the second terminating. */
> 
> User should not be expected to provide monitor group when the monitoring is inherited.

Yes. Will add that check later. I will update the comment here.

> 
>> +		slash = strchr(group_str, '/');
>> +		if (!slash) {
>> +			rdt_last_cmd_puts("Group must be <ctrl>/<mon>/\n");
>> +			ret = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +		*slash = '\0';
>> +		ctrl_name = group_str;
>> +		mon_name = slash + 1;
>> +		slash = strchr(mon_name, '/');
>> +		if (!slash || slash[1] != '\0') {
>> +			rdt_last_cmd_puts("Group must be <ctrl>/<mon>/\n");
>> +			ret = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +		*slash = '\0';
>> +
>> +		rdtgrp = rdtgroup_by_kmode_path(ctrl_name, mon_name);
>> +		if (!rdtgrp) {
>> +			rdt_last_cmd_puts("Group not found\n");
>> +			ret = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * INHERIT mode binds nothing; force the bound group to the default so
>> +	 * round-trips with show (which prints "group=//") are stable and any
>> +	 * user-supplied :group= suffix is silently normalised.
>> +	 */
>> +	if (kmode == BIT(INHERIT_CTRL_AND_MON))
>> +		rdtgrp = &rdtgroup_default;
> 
> rdtgrp = NULL ?
> 

Yes.

>> +
>> +	/* No-op if the same mode is already active on the same group. */
>> +	if (resctrl_kcfg.kmode_cur == kmode && resctrl_kcfg.k_rdtgrp == rdtgrp)
>> +		goto out_unlock;
>> +
>> +	/*
>> +	 * global_assign_ctrl_assign_mon_per_cpu binds one CLOSID and RMID for
>> +	 * all kernel work (Documentation/filesystems/resctrl.rst uses
>> +	 * "<ctrl>/<mon>/", i.e. an RDTMON_GROUP).
>> +	 *
>> +	 * global_assign_ctrl_inherit_mon_per_cpu assigns one CLOSID globally
>> +	 * while leaving RMID inheritance to user contexts; that uses the
>> +	 * control group's CLOSID slot only, i.e. an RDTCTRL_GROUP.
>> +	 */
>> +	if (kmode == BIT(GLOBAL_ASSIGN_CTRL_ASSIGN_MON_PER_CPU) &&
>> +	    rdtgrp->type != RDTMON_GROUP) {
>> +		rdt_last_cmd_puts("global_assign_ctrl_assign_mon_per_cpu requires a monitor group\n");
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +	if (kmode == BIT(GLOBAL_ASSIGN_CTRL_INHERIT_MON_PER_CPU) &&
>> +	    rdtgrp->type != RDTCTRL_GROUP) {
>> +		rdt_last_cmd_puts("global_assign_ctrl_inherit_mon_per_cpu requires a control group\n");
>> +		ret = -EINVAL;
>> +		goto out_unlock;
>> +	}
>> +
>> +	/* Switching to a different group: release the old binding first. */
>> +	if (resctrl_kcfg.k_rdtgrp != rdtgrp) {
>> +		ret = rdtgroup_config_kmode_clear(resctrl_kcfg.k_rdtgrp,
>> +						  resctrl_kcfg.kmode_cur);
>> +		if (ret) {
>> +			rdt_last_cmd_puts("Failed to release previous kernel-mode binding\n");
>> +			goto out_unlock;
>> +		}
>> +	}
>> +
>> +	if (kmode != BIT(INHERIT_CTRL_AND_MON)) {
>> +		ret = rdtgroup_config_kmode(rdtgrp);
>> +		if (ret) {
>> +			rdt_last_cmd_puts("Kernel mode change failed\n");
> 
> If it fails here then previous binding was released successfully but new binding failed. What is
> state of system?

Yea. Need to change this one. Will use Qinyun Tan patch.

Thanks
Babu


^ permalink raw reply

* [PATCH v8 46/46] KVM: selftests: Update private memory exits test to work with per-gmem attributes
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Skip setting memory to private in the private memory exits test when using
per-gmem memory attributes, as memory is initialized to private by default
for guest_memfd, and using vm_mem_set_private() on a guest_memfd instance
requires creating guest_memfd with GUEST_MEMFD_FLAG_MMAP (which is totally
doable, but would need to be conditional and is ultimately unnecessary).

Expect an emulated MMIO instead of a memory fault exit when attributes are
per-gmem, as deleting the memslot effectively drops the private status,
i.e. the GPA becomes shared and thus supports emulated MMIO.

Skip the "memslot not private" test entirely, as private vs. shared state
for x86 software-protected VMs comes from the memory attributes themselves,
and so when doing in-place conversions there can never be a disconnect
between the expected and actual states.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/x86/private_mem_kvm_exits_test.c | 36 ++++++++++++++++++----
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
index 10db9fe6d9063..70ed16066c63e 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
@@ -62,8 +62,9 @@ static void test_private_access_memslot_deleted(void)
 
 	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
 
-	/* Request to access page privately */
-	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+	/* Request to access page privately. */
+	if (!kvm_has_gmem_attributes)
+		vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
 
 	pthread_create(&vm_thread, NULL,
 		       (void *(*)(void *))run_vcpu_get_exit_reason,
@@ -74,10 +75,26 @@ static void test_private_access_memslot_deleted(void)
 	pthread_join(vm_thread, &thread_return);
 	exit_reason = (u32)(u64)thread_return;
 
-	TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
-	TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
-	TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
-	TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+	/*
+	 * If attributes are tracked per-gmem, deleting the memslot that points
+	 * at the gmem instance effectively makes the memory shared, and so the
+	 * read should trigger emulated MMIO.
+	 *
+	 * If attributes are tracked per-VM, deleting the memslot shouldn't
+	 * affect the private attribute, and so KVM should generate a memory
+	 * fault exit (emulated MMIO on private GPAs is disallowed).
+	 */
+	if (kvm_has_gmem_attributes) {
+		TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MMIO);
+		TEST_ASSERT_EQ(vcpu->run->mmio.phys_addr, EXITS_TEST_GPA);
+		TEST_ASSERT_EQ(vcpu->run->mmio.len, sizeof(u64));
+		TEST_ASSERT_EQ(vcpu->run->mmio.is_write, false);
+	} else {
+		TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+		TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+		TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+		TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+	}
 
 	kvm_vm_free(vm);
 }
@@ -88,6 +105,13 @@ static void test_private_access_memslot_not_private(void)
 	struct kvm_vcpu *vcpu;
 	u32 exit_reason;
 
+	/*
+	 * Accessing non-private memory as private with a software-protected VM
+	 * isn't possible when doing in-place conversions.
+	 */
+	if (kvm_has_gmem_attributes)
+		return;
+
 	vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
 					   guest_repeatedly_read);
 

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 45/46] KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Update the private memory conversions selftest to also test conversions
that are done "in-place" via per-guest_memfd memory attributes. In-place
conversions require the host to be able to mmap() the guest_memfd so that
the host and guest can share the same backing physical memory.

This includes several updates, that are conditioned on the system
supporting per-guest_memfd attributes (kvm_has_gmem_attributes):

1. Set up guest_memfd requesting MMAP and INIT_SHARED.

2. With in-place conversions, the host's mapping points directly to the
   guest's memory. When the guest converts a region to private, host access
   to that region is blocked. Update the test to expect a SIGBUS when
   attempting to access the host virtual address (HVA) of private memory.

3. Use vm_mem_set_memory_attributes(), which chooses how to set memory
   attributes based on whether kvm_has_gmem_attributes.

Restrict the test to using VM_MEM_SRC_SHMEM because guest_memfd's required
mmap() flags and page sizes happens to align with those of
VM_MEM_SRC_SHMEM. As long as VM_MEM_SRC_SHMEM is used for src_type,
vm_mem_add() works as intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86/private_mem_conversions_test.c         | 44 ++++++++++++++++++----
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 289ad10063fca..4308c67952310 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -306,9 +306,12 @@ static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
 	if (do_fallocate)
 		vm_guest_mem_fallocate(vm, gpa, size, map_shared);
 
-	if (set_attributes)
-		vm_set_memory_attributes(vm, gpa, size,
-					 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+	if (set_attributes) {
+		u64 attrs = map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+		vm_mem_set_memory_attributes(vm, gpa, size, attrs);
+	}
+
 	run->hypercall.ret = 0;
 }
 
@@ -352,8 +355,20 @@ static void *__test_mem_conversions(void *__vcpu)
 				size_t nr_bytes = min_t(size_t, vm->page_size, size - i);
 				u8 *hva = addr_gpa2hva(vm, gpa + i);
 
-				/* In all cases, the host should observe the shared data. */
-				memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+				/*
+				 * When using per-guest_memfd memory attributes,
+				 * i.e. in-place conversion, host accesses will
+				 * point at guest memory and should SIGBUS when
+				 * guest memory is private.  When using per-VM
+				 * attributes, i.e. separate backing for shared
+				 * vs. private, the host should always observe
+				 * the shared data.
+				 */
+				if (kvm_has_gmem_attributes &&
+				    uc.args[0] == SYNC_PRIVATE)
+					TEST_EXPECT_SIGBUS(READ_ONCE(*hva));
+				else
+					memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
 
 				/* For shared, write the new pattern to guest memory. */
 				if (uc.args[0] == SYNC_SHARED)
@@ -382,6 +397,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, u32 nr_v
 	const size_t slot_size = memfd_size / nr_memslots;
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 	pthread_t threads[KVM_MAX_VCPUS];
+	u64 gmem_flags;
 	struct kvm_vm *vm;
 	int memfd, i;
 
@@ -397,12 +413,17 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, u32 nr_v
 
 	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-	memfd = vm_create_guest_memfd(vm, memfd_size, 0);
+	if (kvm_has_gmem_attributes)
+		gmem_flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED;
+	else
+		gmem_flags = 0;
+
+	memfd = vm_create_guest_memfd(vm, memfd_size, gmem_flags);
 
 	for (i = 0; i < nr_memslots; i++)
 		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
 			   BASE_DATA_SLOT + i, slot_size / vm->page_size,
-			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, 0);
+			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, gmem_flags);
 
 	for (i = 0; i < nr_vcpus; i++) {
 		gpa_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
@@ -452,17 +473,24 @@ static void usage(const char *cmd)
 
 int main(int argc, char *argv[])
 {
-	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
+	enum vm_mem_backing_src_type src_type;
 	u32 nr_memslots = 1;
 	u32 nr_vcpus = 1;
 	int opt;
 
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
 
+	src_type = kvm_has_gmem_attributes ? VM_MEM_SRC_SHMEM :
+					     DEFAULT_VM_MEM_SRC;
+
 	while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) {
 		switch (opt) {
 		case 's':
 			src_type = parse_backing_src_type(optarg);
+			TEST_ASSERT(!kvm_has_gmem_attributes ||
+				    src_type == VM_MEM_SRC_SHMEM,
+				    "Testing in-place conversions, only %s mem_type supported\n",
+				    vm_mem_backing_src_alias(VM_MEM_SRC_SHMEM)->name);
 			break;
 		case 'n':
 			nr_vcpus = atoi_positive("nr_vcpus", optarg);

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 44/46] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

The TEST_EXPECT_SIGBUS macro is not thread-safe as it uses a global
sigjmp_buf and installs a global SIGBUS signal handler. If multiple threads
execute the macro concurrently, they will race on installing the signal
handler and stomp on other threads' jump buffers, leading to incorrect test
behavior.

Make TEST_EXPECT_SIGBUS thread-safe with the following changes:

Share the KVM tests' global signal handler. sigaction() applies to all
threads; without sharing a global signal handler, one thread may have
removed the signal handler that another thread added, hence leading to
unexpected signals.

The alternative of layering signal handlers was considered, but calling
sigaction() within TEST_EXPECT_SIGBUS() necessarily creates a race. To
avoid adding new setup and teardown routines to do sigaction() and keep
usage of TEST_EXPECT_SIGBUS() simple, share the KVM tests' global signal
handler.

Opportunistically rename report_unexpected_signal to
catchall_signal_handler.

To continue to only expect SIGBUS within specific regions of code, use a
thread-specific variable, expecting_sigbus, to replace installing and
removing signal handlers.

Make the execution environment for the thread, sigjmp_buf, a
thread-specific variable.

As part of TEST_EXPECT_SIGBUS(), assert the prerequisite for this setup,
that the current signal handler is the catchall_signal_handler.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/include/test_util.h | 32 +++++++++++++------------
 tools/testing/selftests/kvm/lib/kvm_util.c      | 18 ++++++++++----
 tools/testing/selftests/kvm/lib/test_util.c     |  7 ------
 3 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 51287fac8138a..bd75162ec868d 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -82,21 +82,23 @@ do {									\
 	__builtin_unreachable(); \
 } while (0)
 
-extern sigjmp_buf expect_sigbus_jmpbuf;
-void expect_sigbus_handler(int signum);
-
-#define TEST_EXPECT_SIGBUS(action)						\
-do {										\
-	struct sigaction sa_old, sa_new = {					\
-		.sa_handler = expect_sigbus_handler,				\
-	};									\
-										\
-	sigaction(SIGBUS, &sa_new, &sa_old);					\
-	if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) {				\
-		action;								\
-		TEST_FAIL("'%s' should have triggered SIGBUS", #action);	\
-	}									\
-	sigaction(SIGBUS, &sa_old, NULL);					\
+extern __thread sigjmp_buf expect_sigbus_jmpbuf;
+extern __thread volatile sig_atomic_t expecting_sigbus;
+extern void catchall_signal_handler(int signum);
+
+#define TEST_EXPECT_SIGBUS(action)					\
+do {									\
+	struct sigaction __sa = {};					\
+									\
+	TEST_ASSERT_EQ(sigaction(SIGBUS, NULL, &__sa), 0);		\
+	TEST_ASSERT_EQ(__sa.sa_handler, &catchall_signal_handler);	\
+									\
+	expecting_sigbus = true;					\
+	if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) {			\
+		action;							\
+		TEST_FAIL("'%s' should have triggered SIGBUS", #action);\
+	}								\
+	expecting_sigbus = false;					\
 } while (0)
 
 size_t parse_size(const char *size);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 6b304e8a0e0d5..b4f104436875b 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2292,13 +2292,20 @@ __weak void kvm_selftest_arch_init(void)
 {
 }
 
-static void report_unexpected_signal(int signum)
+__thread sigjmp_buf expect_sigbus_jmpbuf;
+__thread volatile sig_atomic_t expecting_sigbus;
+
+void catchall_signal_handler(int signum)
 {
+	switch (signum) {
+	case SIGBUS: {
+		if (expecting_sigbus)
+			siglongjmp(expect_sigbus_jmpbuf, 1);
+
+		TEST_FAIL("Unexpected SIGBUS (%d)\n", signum);
+	}
 #define KVM_CASE_SIGNUM(sig)					\
 	case sig: TEST_FAIL("Unexpected " #sig " (%d)\n", signum)
-
-	switch (signum) {
-	KVM_CASE_SIGNUM(SIGBUS);
 	KVM_CASE_SIGNUM(SIGSEGV);
 	KVM_CASE_SIGNUM(SIGILL);
 	KVM_CASE_SIGNUM(SIGFPE);
@@ -2310,12 +2317,13 @@ static void report_unexpected_signal(int signum)
 void __attribute((constructor)) kvm_selftest_init(void)
 {
 	struct sigaction sig_sa = {
-		.sa_handler = report_unexpected_signal,
+		.sa_handler = catchall_signal_handler,
 	};
 
 	/* Tell stdout not to buffer its content. */
 	setbuf(stdout, NULL);
 
+	expecting_sigbus = false;
 	sigaction(SIGBUS, &sig_sa, NULL);
 	sigaction(SIGSEGV, &sig_sa, NULL);
 	sigaction(SIGILL, &sig_sa, NULL);
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index bab1bd2b775b6..30eb701e4becd 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -18,13 +18,6 @@
 
 #include "test_util.h"
 
-sigjmp_buf expect_sigbus_jmpbuf;
-
-void __attribute__((used)) expect_sigbus_handler(int signum)
-{
-	siglongjmp(expect_sigbus_jmpbuf, 1);
-}
-
 /*
  * Random number generator that is usable from guest code. This is the
  * Park-Miller LCG using standard constants.

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 42/46] KVM: selftests: Provide common function to set memory attributes
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Introduce vm_mem_set_memory_attributes(), which handles setting of memory
attributes for a range of guest physical addresses, regardless of whether
the attributes should be set via guest_memfd or via the memory attributes
at the VM level.

Refactor existing vm_mem_set_{shared,private} functions to use the new
function. Opportunistically update the size parameter to use size_t instead
of u64.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h | 46 +++++++++++++++++++-------
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 3a6b1fa7f26ef..db1442da21bb1 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -454,18 +454,6 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
 	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
 }
 
-static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
-				      u64 size)
-{
-	vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
-}
-
-static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
-				     u64 size)
-{
-	vm_set_memory_attributes(vm, gpa, size, 0);
-}
-
 static inline int __gmem_set_memory_attributes(int fd, u64 offset,
 					       size_t size, u64 attributes,
 					       u64 *error_offset)
@@ -532,6 +520,40 @@ static inline void gmem_set_shared(int fd, u64 offset, size_t size)
 	gmem_set_memory_attributes(fd, offset, size, 0);
 }
 
+static inline void vm_mem_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
+						size_t size, u64 attrs)
+{
+	if (kvm_has_gmem_attributes) {
+		gpa_t end = gpa + size;
+		off_t fd_offset;
+		gpa_t addr;
+		size_t len;
+		int fd;
+
+		for (addr = gpa; addr < end; addr += len) {
+			fd = kvm_gpa_to_guest_memfd(vm, addr, &fd_offset, &len);
+			len = min(end - addr, len);
+
+			gmem_set_memory_attributes(fd, fd_offset, len, attrs);
+		}
+	} else {
+		vm_set_memory_attributes(vm, gpa, size, attrs);
+	}
+}
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
+				      size_t size)
+{
+	vm_mem_set_memory_attributes(vm, gpa, size,
+				     KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
+				     size_t size)
+{
+	vm_mem_set_memory_attributes(vm, gpa, size, 0);
+}
+
 void vm_guest_mem_fallocate(struct kvm_vm *vm, gpa_t gpa, u64 size,
 			    bool punch_hole);
 

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 43/46] KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Check that a valid fd provided to mmap() must be accompanied by MAP_SHARED.

With an invalid fd (usually used for anonymous mappings), there are no
constraints on mmap() flags.

Add this check to make sure that when a guest_memfd is used as region->fd,
the flag provided to mmap() will include MAP_SHARED.

Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rephrase assertion message.]
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 0b2256ea65ff9..6b304e8a0e0d5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1110,6 +1110,9 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
 	}
 
+	TEST_ASSERT(region->fd == -1 || backing_src_is_shared(src_type),
+		    "A valid fd provided to mmap() must be accompanied by MAP_SHARED.");
+
 	region->mmap_start = __kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
 					vm_mem_backing_src_alias(src_type)->flag,
 					region->fd, mmap_offset);

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 41/46] KVM: selftests: Provide function to look up guest_memfd details from gpa
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Introduce a new helper, kvm_gpa_to_guest_memfd(), to find the
guest_memfd-related details of a memory region that contains a given guest
physical address (GPA).

The function returns the file descriptor for the memfd, the offset into
the file that corresponds to the GPA, and the number of bytes remaining
in the region from that GPA.

kvm_gpa_to_guest_memfd() was factored out from vm_guest_mem_fallocate();
refactor vm_guest_mem_fallocate() to use the new helper.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h |  3 +++
 tools/testing/selftests/kvm/lib/kvm_util.c     | 37 ++++++++++++++++----------
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 79ab64ac8b869..3a6b1fa7f26ef 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -428,6 +428,9 @@ static inline void vm_enable_cap(struct kvm_vm *vm, u32 cap, u64 arg0)
 	vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
+int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, gpa_t gpa, off_t *fd_offset,
+			   size_t *nr_bytes);
+
 /*
  * KVM_SET_MEMORY_ATTRIBUTES{,2} overwrites _all_ attributes.  These
  * flows need significant enhancements to support multiple attributes.
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 524ef97d634bf..0b2256ea65ff9 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1305,27 +1305,20 @@ void vm_guest_mem_fallocate(struct kvm_vm *vm, u64 base, u64 size,
 			    bool punch_hole)
 {
 	const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
-	struct userspace_mem_region *region;
 	u64 end = base + size;
-	gpa_t gpa, len;
 	off_t fd_offset;
-	int ret;
+	int fd, ret;
+	size_t len;
+	gpa_t gpa;
 
 	for (gpa = base; gpa < end; gpa += len) {
-		u64 offset;
-
-		region = userspace_mem_region_find(vm, gpa, gpa);
-		TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
-			    "Private memory region not found for GPA 0x%lx", gpa);
+		fd = kvm_gpa_to_guest_memfd(vm, gpa, &fd_offset, &len);
+		len = min(end - gpa, len);
 
-		offset = gpa - region->region.guest_phys_addr;
-		fd_offset = region->region.guest_memfd_offset + offset;
-		len = min_t(u64, end - gpa, region->region.memory_size - offset);
-
-		ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
+		ret = fallocate(fd, mode, fd_offset, len);
 		TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx",
 			    punch_hole ? "punch hole" : "allocate", gpa, len,
-			    region->region.guest_memfd, mode, fd_offset);
+			    fd, mode, fd_offset);
 	}
 }
 
@@ -1662,6 +1655,22 @@ void *addr_gpa2alias(struct kvm_vm *vm, gpa_t gpa)
 	return (void *) ((uintptr_t) region->host_alias + offset);
 }
 
+int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, gpa_t gpa, off_t *fd_offset,
+			   size_t *nr_bytes)
+{
+	struct userspace_mem_region *region;
+	gpa_t gpa_offset;
+
+	region = userspace_mem_region_find(vm, gpa, gpa);
+	TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
+		    "guest_memfd memory region not found for GPA 0x%lx", gpa);
+
+	gpa_offset = gpa - region->region.guest_phys_addr;
+	*fd_offset = region->region.guest_memfd_offset + gpa_offset;
+	*nr_bytes = region->region.memory_size - gpa_offset;
+	return region->region.guest_memfd;
+}
+
 /* Create an interrupt controller chip for the specified VM. */
 void vm_create_irqchip(struct kvm_vm *vm)
 {

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 40/46] KVM: selftests: Reset shared memory after hole-punching
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

private_mem_conversions_test used to reset the shared memory that was used
for the test to an initial pattern at the end of each test iteration. Then,
it would punch out the pages, which would zero memory.

Without in-place conversion, the resetting would write shared memory, and
hole-punching will zero private memory, hence resetting the test to the
state at the beginning of the for loop.

With in-place conversion, resetting writes memory as shared, and
hole-punching zeroes the same physical memory, hence undoing the reset
done before the hole punch.

Move the resetting after the hole-punching, and reset the entire
PER_CPU_DATA_SIZE instead of just the tested range.

With in-place conversion, this zeroes and then resets the same physical
memory. Without in-place conversion, the private memory is zeroed, and the
shared memory is reset to init_p.

This is sufficient since at each test stage, the memory is assumed to start
as shared, and private memory is always assumed to start zeroed. Conversion
zeroes memory, so the future test stages will work as expected.

Fixes: 43f623f350ce1 ("KVM: selftests: Add x86-only selftest for private memory conversions")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/x86/private_mem_conversions_test.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 861baff201e78..289ad10063fca 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -202,15 +202,18 @@ static void guest_test_explicit_conversion(u64 base_gpa, bool do_fallocate)
 		guest_sync_shared(gpa, size, p3, p4);
 		memcmp_g(gpa, p4, size);
 
-		/* Reset the shared memory back to the initial pattern. */
-		memset((void *)gpa, init_p, size);
-
 		/*
 		 * Free (via PUNCH_HOLE) *all* private memory so that the next
 		 * iteration starts from a clean slate, e.g. with respect to
 		 * whether or not there are pages/folios in guest_mem.
 		 */
 		guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
+
+		/*
+		 * Hole-punching above zeroed private memory. Reset shared
+		 * memory in preparation for the next GUEST_STAGE.
+		 */
+		memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
 	}
 }
 

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 39/46] KVM: selftests: Test conversion with elevated page refcount
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Add a selftest to verify that converting a shared guest_memfd page to a
private page fails if the page has an elevated reference count.

When KVM converts a shared page to a private one, it expects the page to
have a reference count equal to the reference counts taken by the
filemap. If another kernel subsystem holds a reference to the page, the
conversion must be aborted.

The test asserts that both bulk and single-page conversion attempts
correctly fail with EAGAIN for the pinned page. After the page is unpinned,
the test verifies that subsequent conversions succeed.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86/guest_memfd_conversions_test.c         | 56 ++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
index 99b0023609670..4ebbd29029526 100644
--- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -441,6 +441,62 @@ GMEM_CONVERSION_TEST_INIT_SHARED(forked_accesses)
 #undef TEST_STATE_AWAIT
 }
 
+static void test_convert_to_private_fails(test_data_t *t, u64 pgoff,
+					  size_t nr_pages,
+					  u64 expected_error_offset)
+{
+	/* +1 to make it anything but expected_error_offset. */
+	u64 error_offset = expected_error_offset + 1;
+	u64 offset = pgoff * page_size;
+	int ret;
+
+	do {
+		ret = __gmem_set_private(t->gmem_fd, offset,
+					 nr_pages * page_size, &error_offset);
+	} while (ret == -1 && errno == EINTR);
+	TEST_ASSERT(ret == -1 && errno == EAGAIN,
+		    "Wanted EAGAIN on page %lu, got %d (ret = %d)", pgoff,
+		    errno, ret);
+	TEST_ASSERT_EQ(error_offset, expected_error_offset);
+}
+
+GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(elevated_refcount, 4)
+{
+	int i;
+
+	pin_pages(t->mem + test_page * page_size, page_size);
+
+	for (i = 0; i < nr_pages; i++)
+		test_shared(t, i, 0, 'A', 'B');
+
+	/*
+	 * Converting in bulk should fail as long any page in the range has
+	 * unexpected refcounts.
+	 */
+	test_convert_to_private_fails(t, 0, nr_pages, test_page * page_size);
+
+	for (i = 0; i < nr_pages; i++) {
+		/*
+		 * Converting page-wise should also fail as long any page in the
+		 * range has unexpected refcounts.
+		 */
+		if (i == test_page)
+			test_convert_to_private_fails(t, i, 1, test_page * page_size);
+		else
+			test_convert_to_private(t, i, 'B', 'C');
+	}
+
+	unpin_pages();
+
+	gmem_set_private(t->gmem_fd, 0, nr_pages * page_size);
+
+	for (i = 0; i < nr_pages; i++) {
+		char expected = i == test_page ? 'B' : 'C';
+
+		test_private(t, i, expected, 'D');
+	}
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 38/46] KVM: selftests: Add helpers to pin pages with CONFIG_GUP_TEST
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Add helper functions to allow KVM selftests to pin memory using
CONFIG_GUP_TEST. This is useful for testing scenarios where some page has
an increased refcount. such as in guest_memfd in-place conversion tests.

The helpers open /sys/kernel/debug/gup_test and invoke the
PIN_LONGTERM_TEST_START and PIN_LONGTERM_TEST_STOP ioctls. Since this
functionality depends on the kernel being built with CONFIG_GUP_TEST,
provide stub implementations that trigger a test failure if the
configuration is missing.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h |  3 +++
 tools/testing/selftests/kvm/lib/kvm_util.c     | 23 +++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 323d06b5699ec..79ab64ac8b869 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -1195,6 +1195,9 @@ static inline int pin_self_to_any_cpu(void)
 	return pin_task_to_any_cpu(pthread_self());
 }
 
+void pin_pages(void *vaddr, uint64_t size);
+void unpin_pages(void);
+
 void kvm_print_vcpu_pinning_help(void);
 void kvm_parse_vcpu_pinning(const char *pcpus_string, u32 vcpu_to_pcpu[],
 			    int nr_vcpus);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index b73817f7bc803..524ef97d634bf 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -18,6 +18,8 @@
 #include <unistd.h>
 #include <linux/kernel.h>
 
+#include "../../../../mm/gup_test.h"
+
 #define KVM_UTIL_MIN_PFN	2
 
 u32 guest_random_seed;
@@ -639,6 +641,27 @@ int __pin_task_to_cpu(pthread_t task, int cpu)
 	return pthread_setaffinity_np(task, sizeof(cpuset), &cpuset);
 }
 
+static int gup_test_fd = -1;
+
+void pin_pages(void *vaddr, uint64_t size)
+{
+	const struct pin_longterm_test args = {
+		.addr = (uint64_t)vaddr,
+		.size = size,
+		.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
+	};
+
+	gup_test_fd = __open_path_or_exit("/sys/kernel/debug/gup_test", O_RDWR,
+					  "Is CONFIG_GUP_TEST enabled?");
+
+	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
+}
+
+void unpin_pages(void)
+{
+	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
+}
+
 static u32 parse_pcpu(const char *cpu_str, const cpu_set_t *allowed_mask)
 {
 	u32 pcpu = atoi_non_negative("CPU number", cpu_str);

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 37/46] KVM: selftests: Test that shared/private status is consistent across processes
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Add a test to verify that a guest_memfd's shared/private status is
consistent across processes, and that any shared pages previously mapped in
any process are unmapped from all processes.

The test forks a child process after creating the shared guest_memfd
region so that the second process exists alongside the main process for the
entire test.

The processes then take turns to access memory to check that the
shared/private status is consistent across processes.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/x86/guest_memfd_conversions_test.c         | 118 +++++++++++++++++++++
 1 file changed, 118 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
index f03af2c46426f..99b0023609670 100644
--- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -2,6 +2,8 @@
 /*
  * Copyright (c) 2024, Google LLC.
  */
+#include <pthread.h>
+#include <time.h>
 #include <sys/mman.h>
 #include <unistd.h>
 
@@ -323,6 +325,122 @@ GMEM_CONVERSION_TEST_INIT_SHARED(truncate)
 	test_private(t, 0, 0, 'A');
 }
 
+/* Test that shared/private memory protections work and are seen from any process. */
+GMEM_CONVERSION_TEST_INIT_SHARED(forked_accesses)
+{
+	enum test_state {
+		STATE_INIT,
+		STATE_CHECK_SHARED,
+		STATE_DONE_CHECKING_SHARED,
+		STATE_CHECK_PRIVATE,
+		STATE_DONE_CHECKING_PRIVATE,
+	};
+
+	struct sync_state {
+		pthread_mutex_t mutex;
+		pthread_cond_t cond;
+		enum test_state step;
+	} *sync;
+
+	pthread_mutexattr_t mattr;
+	pthread_condattr_t cattr;
+	pid_t child_pid, parent_pid;
+	int status;
+
+	sync = kvm_mmap(sizeof(*sync), PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_ANONYMOUS, -1);
+
+	pthread_mutexattr_init(&mattr);
+	pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED);
+	pthread_mutex_init(&sync->mutex, &mattr);
+	pthread_mutexattr_destroy(&mattr);
+
+	pthread_condattr_init(&cattr);
+	pthread_condattr_setpshared(&cattr, PTHREAD_PROCESS_SHARED);
+	pthread_cond_init(&sync->cond, &cattr);
+	pthread_condattr_destroy(&cattr);
+
+	sync->step = STATE_INIT;
+
+#define TEST_STATE_AWAIT(__state)						\
+	do {									\
+		pthread_mutex_lock(&sync->mutex);				\
+		while (sync->step != (__state)) {				\
+			struct timespec ts, stop;				\
+			int ret;						\
+										\
+			clock_gettime(CLOCK_REALTIME, &ts);			\
+			stop = timespec_add_ns(ts, 100 * 1000000UL);		\
+										\
+			ret = pthread_cond_timedwait(&sync->cond, &sync->mutex, &stop); \
+			if (ret == ETIMEDOUT) {					\
+				bool alive = (child_pid == 0) ?			\
+					     (getppid() == parent_pid) :		\
+					     (waitpid(child_pid, NULL, WNOHANG) == 0); \
+				TEST_ASSERT(alive, "Other process exited prematurely"); \
+			} else {						\
+				TEST_ASSERT(!ret, "pthread_cond_timedwait failed"); \
+			}							\
+		}								\
+		pthread_mutex_unlock(&sync->mutex);				\
+	} while (0)
+
+#define TEST_STATE_SET(__state)							\
+	do {									\
+		pthread_mutex_lock(&sync->mutex);				\
+		sync->step = (__state);						\
+		pthread_cond_broadcast(&sync->cond);				\
+		pthread_mutex_unlock(&sync->mutex);				\
+	} while (0)
+
+	parent_pid = getpid();
+	child_pid = fork();
+	TEST_ASSERT(child_pid != -1, "fork failed");
+
+	if (child_pid == 0) {
+		const char inconsequential = 0xdd;
+
+		TEST_STATE_AWAIT(STATE_CHECK_SHARED);
+
+		/*
+		 * This maps the pages into the child process as well, and tests
+		 * that the conversion process will unmap the guest_memfd memory
+		 * from all processes.
+		 */
+		host_do_rmw(t->mem, 0, 0xB, 0xC);
+
+		TEST_STATE_SET(STATE_DONE_CHECKING_SHARED);
+		TEST_STATE_AWAIT(STATE_CHECK_PRIVATE);
+
+		TEST_EXPECT_SIGBUS(READ_ONCE(t->mem[0]));
+		TEST_EXPECT_SIGBUS(WRITE_ONCE(t->mem[0], inconsequential));
+
+		TEST_STATE_SET(STATE_DONE_CHECKING_PRIVATE);
+		exit(0);
+	}
+
+	test_shared(t, 0, 0, 0xA, 0xB);
+
+	TEST_STATE_SET(STATE_CHECK_SHARED);
+	TEST_STATE_AWAIT(STATE_DONE_CHECKING_SHARED);
+
+	test_convert_to_private(t, 0, 0xC, 0xD);
+
+	TEST_STATE_SET(STATE_CHECK_PRIVATE);
+	TEST_STATE_AWAIT(STATE_DONE_CHECKING_PRIVATE);
+
+	TEST_ASSERT_EQ(waitpid(child_pid, &status, 0), child_pid);
+	TEST_ASSERT(WIFEXITED(status) && WEXITSTATUS(status) == 0,
+		    "Child exited with unexpected status");
+
+	pthread_mutex_destroy(&sync->mutex);
+	pthread_cond_destroy(&sync->cond);
+	kvm_munmap(sync, sizeof(*sync));
+
+#undef TEST_STATE_SET
+#undef TEST_STATE_AWAIT
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 36/46] KVM: selftests: Test that truncation does not change shared/private status
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Add a test to verify that deallocating a page in a guest memfd region via
fallocate() with FALLOC_FL_PUNCH_HOLE does not alter the shared or private
status of the corresponding memory range.

When a page backing a guest memfd mapping is deallocated, e.g., by punching
a hole or truncating the file, and then subsequently faulted back in, the
new page must inherit the correct shared/private status tracked by
guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/x86/guest_memfd_conversions_test.c       | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
index 0b024fb7227f0..f03af2c46426f 100644
--- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -10,6 +10,7 @@
 #include <linux/sizes.h>
 
 #include "kvm_util.h"
+#include "kvm_syscalls.h"
 #include "kselftest_harness.h"
 #include "test_util.h"
 #include "ucall_common.h"
@@ -309,6 +310,19 @@ GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(unallocated_folios, 8)
 		test_convert_to_shared(t, i, 'B', 'C', 'D');
 }
 
+/* Truncation should not affect shared/private status. */
+GMEM_CONVERSION_TEST_INIT_SHARED(truncate)
+{
+	host_do_rmw(t->mem, 0, 0, 'A');
+	kvm_fallocate(t->gmem_fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, page_size);
+	host_do_rmw(t->mem, 0, 0, 'A');
+
+	test_convert_to_private(t, 0, 'A', 'B');
+
+	kvm_fallocate(t->gmem_fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, page_size);
+	test_private(t, 0, 0, 'A');
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 34/46] KVM: selftests: Test conversion before allocation
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Add two test cases to the guest_memfd conversions selftest to cover
the scenario where a conversion is requested before any memory has been
allocated in the guest_memfd region.

The KVM_SET_MEMORY_ATTRIBUTES2 ioctl can be called on a memory region at
any time. If the guest had not yet faulted in any pages for that region,
the kernel must record the conversion request and apply the requested state
when the pages are eventually allocated.

The new tests cover both conversion directions.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/x86/guest_memfd_conversions_test.c       | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
index 8e17d5c08aeb8..b43ac196330f1 100644
--- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -265,6 +265,20 @@ GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
 #undef combine
 }
 
+/*
+ * Test that even if there are no folios yet, conversion requests are recorded
+ * in guest_memfd.
+ */
+GMEM_CONVERSION_TEST_INIT_SHARED(before_allocation_shared)
+{
+	test_convert_to_private(t, 0, 0, 'A');
+}
+
+GMEM_CONVERSION_TEST_INIT_PRIVATE(before_allocation_private)
+{
+	test_convert_to_shared(t, 0, 0, 'A', 'B');
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 35/46] KVM: selftests: Convert with allocated folios in different layouts
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Add a guest_memfd selftest to verify that memory conversions work
correctly with allocated folios in different layouts.

By iterating through which pages are initially faulted, the test covers
various layouts of contiguous allocated and unallocated regions, exercising
conversion with different range layouts.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86/guest_memfd_conversions_test.c         | 30 ++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
index b43ac196330f1..0b024fb7227f0 100644
--- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -279,6 +279,36 @@ GMEM_CONVERSION_TEST_INIT_PRIVATE(before_allocation_private)
 	test_convert_to_shared(t, 0, 0, 'A', 'B');
 }
 
+/*
+ * Test that when some of the folios in the conversion range are allocated,
+ * conversion requests are handled correctly in guest_memfd.  Vary the ranges
+ * allocated before conversion, using test_page, to cover various layouts of
+ * contiguous allocated and unallocated regions.
+ */
+GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(unallocated_folios, 8)
+{
+	const int second_page_to_fault = 4;
+	int i;
+
+	/*
+	 * Fault 2 of the pages to test filemap range operations except when
+	 * test_page == second_page_to_fault.
+	 */
+	host_do_rmw(t->mem, test_page, 0, 'A');
+	if (test_page != second_page_to_fault)
+		host_do_rmw(t->mem, second_page_to_fault, 0, 'A');
+
+	gmem_set_private(t->gmem_fd, 0, nr_pages * page_size);
+	for (i = 0; i < nr_pages; ++i) {
+		char expected = (i == test_page || i == second_page_to_fault) ? 'A' : 0;
+
+		test_private(t, i, expected, 'B');
+	}
+
+	for (i = 0; i < nr_pages; ++i)
+		test_convert_to_shared(t, i, 'B', 'C', 'D');
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 31/46] KVM: selftests: Test basic single-page conversion flow
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Add a selftest for the guest_memfd memory attribute conversion ioctls.
The test starts the guest_memfd as all-private (the default state), and
verifies the basic flow of converting a single page to shared and then back
to private.

Add infrastructure that supports extensions to other conversion flow
tests. This infrastructure will be used in upcoming patches for other
conversion tests.

Add test as an x86-specific test since guest_memfd's testing
vehicle (KVM_X86_SW_PROTECTED_VM) is x86-specific.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm           |   1 +
 .../kvm/x86/guest_memfd_conversions_test.c         | 199 +++++++++++++++++++++
 2 files changed, 200 insertions(+)

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 4ace12606e937..b0e64a6dde21a 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -152,6 +152,7 @@ TEST_GEN_PROGS_x86 += x86/max_vcpuid_cap_test
 TEST_GEN_PROGS_x86 += x86/triple_fault_event_test
 TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test
 TEST_GEN_PROGS_x86 += x86/aperfmperf_test
+TEST_GEN_PROGS_x86 += x86/guest_memfd_conversions_test
 TEST_GEN_PROGS_x86 += access_tracking_perf_test
 TEST_GEN_PROGS_x86 += coalesced_io_test
 TEST_GEN_PROGS_x86 += dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
new file mode 100644
index 0000000000000..8e09e241723e5
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, Google LLC.
+ */
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <linux/align.h>
+#include <linux/kvm.h>
+#include <linux/sizes.h>
+
+#include "kvm_util.h"
+#include "kselftest_harness.h"
+#include "test_util.h"
+#include "ucall_common.h"
+
+FIXTURE(gmem_conversions) {
+	struct kvm_vcpu *vcpu;
+	int gmem_fd;
+	/* HVA of the first byte of the memory mmap()-ed from gmem_fd. */
+	char *mem;
+};
+
+typedef FIXTURE_DATA(gmem_conversions) test_data_t;
+
+FIXTURE_SETUP(gmem_conversions) { }
+
+static size_t page_size;
+
+static void guest_do_rmw(void);
+#define GUEST_MEMFD_SHARING_TEST_GVA 0x90000000ULL
+
+/*
+ * Defer setup until the individual test is invoked so that tests can specify
+ * the number of pages and flags for the guest_memfd instance.
+ */
+static void gmem_conversions_do_setup(test_data_t *t, int nr_pages,
+				      int gmem_flags)
+{
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+	/*
+	 * Use high GPA above APIC_DEFAULT_PHYS_BASE to avoid clashing with
+	 * APIC_DEFAULT_PHYS_BASE.
+	 */
+	const gpa_t gpa = SZ_4G;
+	const u32 slot = 1;
+	struct kvm_vm *vm;
+
+	vm = __vm_create_shape_with_one_vcpu(shape, &t->vcpu, nr_pages, guest_do_rmw);
+
+	vm_mem_add(vm, VM_MEM_SRC_SHMEM, gpa, slot, nr_pages,
+		   KVM_MEM_GUEST_MEMFD, -1, 0, gmem_flags);
+
+	t->gmem_fd = kvm_slot_to_fd(vm, slot);
+	t->mem = addr_gpa2hva(vm, gpa);
+	virt_map(vm, GUEST_MEMFD_SHARING_TEST_GVA, gpa, nr_pages);
+}
+
+static void gmem_conversions_do_teardown(test_data_t *t)
+{
+	/* No need to close gmem_fd, it's owned by the VM structure. */
+	kvm_vm_free(t->vcpu->vm);
+}
+
+FIXTURE_TEARDOWN(gmem_conversions)
+{
+	gmem_conversions_do_teardown(self);
+}
+
+/*
+ * In these test definition macros, __nr_pages and nr_pages is used to set up
+ * the total number of pages in the guest_memfd under test. This will be
+ * available in the test definitions as nr_pages.
+ */
+
+#define __GMEM_CONVERSION_TEST(test, __nr_pages, flags)				\
+static void __gmem_conversions_##test(test_data_t *t, int nr_pages);		\
+										\
+TEST_F(gmem_conversions, test)							\
+{										\
+	gmem_conversions_do_setup(self, __nr_pages, flags);			\
+	__gmem_conversions_##test(self, __nr_pages);				\
+}										\
+static void __gmem_conversions_##test(test_data_t *t, int nr_pages)		\
+
+#define GMEM_CONVERSION_TEST(test, __nr_pages, flags)				\
+	__GMEM_CONVERSION_TEST(test, __nr_pages, (flags) | GUEST_MEMFD_FLAG_MMAP)
+
+#define __GMEM_CONVERSION_TEST_INIT_PRIVATE(test, __nr_pages)			\
+	GMEM_CONVERSION_TEST(test, __nr_pages, 0)
+
+#define GMEM_CONVERSION_TEST_INIT_PRIVATE(test)					\
+	__GMEM_CONVERSION_TEST_INIT_PRIVATE(test, 1)
+
+struct guest_check_data {
+	void *mem;
+	char expected_val;
+	char write_val;
+};
+static struct guest_check_data guest_data;
+
+static void guest_do_rmw(void)
+{
+	for (;;) {
+		char *mem = READ_ONCE(guest_data.mem);
+
+		GUEST_ASSERT_EQ(READ_ONCE(*mem), READ_ONCE(guest_data.expected_val));
+		WRITE_ONCE(*mem, READ_ONCE(guest_data.write_val));
+
+		GUEST_SYNC(0);
+	}
+}
+
+static void run_guest_do_rmw(struct kvm_vcpu *vcpu, u64 pgoff,
+			     char expected_val, char write_val)
+{
+	struct ucall uc;
+	int r;
+
+	guest_data.mem = (void *)GUEST_MEMFD_SHARING_TEST_GVA + pgoff * page_size;
+	guest_data.expected_val = expected_val;
+	guest_data.write_val = write_val;
+	sync_global_to_guest(vcpu->vm, guest_data);
+
+	do {
+		r = __vcpu_run(vcpu);
+	} while (r == -1 && errno == EINTR);
+
+	TEST_ASSERT_EQ(r, 0);
+
+	switch (get_ucall(vcpu, &uc)) {
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+	case UCALL_SYNC:
+		break;
+	default:
+		TEST_FAIL("Unexpected ucall %lu", uc.cmd);
+	}
+}
+
+static void host_do_rmw(char *mem, u64 pgoff, char expected_val,
+			char write_val)
+{
+	TEST_ASSERT_EQ(READ_ONCE(mem[pgoff * page_size]), expected_val);
+	WRITE_ONCE(mem[pgoff * page_size], write_val);
+}
+
+static void test_private(test_data_t *t, u64 pgoff, char starting_val,
+			 char write_val)
+{
+	TEST_EXPECT_SIGBUS(WRITE_ONCE(t->mem[pgoff * page_size], write_val));
+	run_guest_do_rmw(t->vcpu, pgoff, starting_val, write_val);
+	TEST_EXPECT_SIGBUS(READ_ONCE(t->mem[pgoff * page_size]));
+}
+
+static void test_convert_to_private(test_data_t *t, u64 pgoff,
+				    char starting_val, char write_val)
+{
+	gmem_set_private(t->gmem_fd, pgoff * page_size, page_size);
+	test_private(t, pgoff, starting_val, write_val);
+}
+
+static void test_shared(test_data_t *t, u64 pgoff, char starting_val,
+			char host_write_val, char write_val)
+{
+	host_do_rmw(t->mem, pgoff, starting_val, host_write_val);
+	run_guest_do_rmw(t->vcpu, pgoff, host_write_val, write_val);
+	TEST_ASSERT_EQ(READ_ONCE(t->mem[pgoff * page_size]), write_val);
+}
+
+static void test_convert_to_shared(test_data_t *t, u64 pgoff,
+				   char starting_val, char host_write_val,
+				   char write_val)
+{
+	gmem_set_shared(t->gmem_fd, pgoff * page_size, page_size);
+	test_shared(t, pgoff, starting_val, host_write_val, write_val);
+}
+
+GMEM_CONVERSION_TEST_INIT_PRIVATE(init_private)
+{
+	test_private(t, 0, 0, 'A');
+	test_convert_to_shared(t, 0, 'A', 'B', 'C');
+	test_convert_to_private(t, 0, 'C', 'E');
+}
+
+
+int main(int argc, char *argv[])
+{
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) &
+		     KVM_MEMORY_ATTRIBUTE_PRIVATE);
+
+	page_size = getpagesize();
+
+	return test_harness_run(argc, argv);
+}

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 33/46] KVM: selftests: Test conversion precision in guest_memfd
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

The existing guest_memfd conversion tests only use single-page memory
regions. This provides no coverage for multi-page guest_memfd objects,
specifically whether KVM correctly handles the page index for conversion
operations. An incorrect implementation could, for example, always operate
on the first page regardless of the index provided.

Add a new test case to verify that conversions between private and shared
memory correctly target the specified page within a multi-page guest_memfd.

This test also verifies the precision of memory conversions by converting a
single page an then iterating through all other pages ensure they remain in
their original state.

To support this test, add a new GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED
macro that handles setting up and tearing down the VM for each page
iteration. The teardown logic is adjusted to prevent a double-free in this
new scenario.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86/guest_memfd_conversions_test.c         | 66 ++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
index 5b070d3374eae..8e17d5c08aeb8 100644
--- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -61,8 +61,13 @@ static void gmem_conversions_do_setup(test_data_t *t, int nr_pages,
 
 static void gmem_conversions_do_teardown(test_data_t *t)
 {
+	/* Use NULL to avoid second free in FIXTURE_TEARDOWN (multipage tests). */
+	if (!t->vcpu)
+		return;
+
 	/* No need to close gmem_fd, it's owned by the VM structure. */
 	kvm_vm_free(t->vcpu->vm);
+	t->vcpu = NULL;
 }
 
 FIXTURE_TEARDOWN(gmem_conversions)
@@ -101,6 +106,29 @@ static void __gmem_conversions_##test(test_data_t *t, int nr_pages)		\
 #define GMEM_CONVERSION_TEST_INIT_SHARED(test)					\
 	__GMEM_CONVERSION_TEST_INIT_SHARED(test, 1)
 
+/*
+ * Repeats test over nr_pages in a guest_memfd of size nr_pages, providing each
+ * test iteration with test_page, the index of the page under test in
+ * guest_memfd. test_page takes values 0..(nr_pages - 1) inclusive.
+ */
+#define GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(test, __nr_pages)		\
+static void __gmem_conversions_multipage_##test(test_data_t *t, int nr_pages,	\
+						const int test_page);		\
+										\
+TEST_F(gmem_conversions, test)							\
+{										\
+	const u64 flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED; \
+	int i;									\
+										\
+	for (i = 0; i < __nr_pages; ++i) {					\
+		gmem_conversions_do_setup(self, __nr_pages, flags);		\
+		__gmem_conversions_multipage_##test(self, __nr_pages, i);	\
+		gmem_conversions_do_teardown(self);				\
+	}									\
+}										\
+static void __gmem_conversions_multipage_##test(test_data_t *t, int nr_pages,	\
+						const int test_page)
+
 struct guest_check_data {
 	void *mem;
 	char expected_val;
@@ -199,6 +227,44 @@ GMEM_CONVERSION_TEST_INIT_SHARED(init_shared)
 	test_convert_to_shared(t, 0, 'C', 'D', 'E');
 }
 
+GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
+{
+	int i;
+
+	/* Get a char that varies with both i and n. */
+#define combine(x, n) ((x << 4) + (n))
+#define i_(n) (combine(i, n))
+#define t_(n) (combine(test_page, n))
+
+	/*
+	 * Start with the highest index, to catch any errors when, perhaps, the
+	 * first page is returned even for the last index.
+	 */
+	for (i = nr_pages - 1; i >= 0; --i)
+		test_shared(t, i, 0, i_(0), i_(2));
+
+	test_convert_to_private(t, test_page, t_(2), t_(3));
+
+	for (i = 0; i < nr_pages; ++i) {
+		if (i == test_page)
+			test_private(t, test_page, t_(3), t_(4));
+		else
+			test_shared(t, i, i_(2), i_(3), i_(4));
+	}
+
+	test_convert_to_shared(t, test_page, t_(4), t_(5), t_(6));
+
+	for (i = 0; i < nr_pages; ++i) {
+		char expected = i == test_page ? t_(6) : i_(4);
+
+		test_shared(t, i, expected, i_(7), i_(8));
+	}
+
+#undef t_
+#undef i_
+#undef combine
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 32/46] KVM: selftests: Test conversion flow when INIT_SHARED
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Add a test case to verify that conversions between private and shared
memory work correctly when the memory is initially created as shared.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../testing/selftests/kvm/x86/guest_memfd_conversions_test.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
index 8e09e241723e5..5b070d3374eae 100644
--- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
@@ -95,6 +95,12 @@ static void __gmem_conversions_##test(test_data_t *t, int nr_pages)		\
 #define GMEM_CONVERSION_TEST_INIT_PRIVATE(test)					\
 	__GMEM_CONVERSION_TEST_INIT_PRIVATE(test, 1)
 
+#define __GMEM_CONVERSION_TEST_INIT_SHARED(test, __nr_pages)			\
+	GMEM_CONVERSION_TEST(test, __nr_pages, GUEST_MEMFD_FLAG_INIT_SHARED)
+
+#define GMEM_CONVERSION_TEST_INIT_SHARED(test)					\
+	__GMEM_CONVERSION_TEST_INIT_SHARED(test, 1)
+
 struct guest_check_data {
 	void *mem;
 	char expected_val;
@@ -186,6 +192,12 @@ GMEM_CONVERSION_TEST_INIT_PRIVATE(init_private)
 	test_convert_to_private(t, 0, 'C', 'E');
 }
 
+GMEM_CONVERSION_TEST_INIT_SHARED(init_shared)
+{
+	test_shared(t, 0, 0, 'A', 'B');
+	test_convert_to_private(t, 0, 'B', 'C');
+	test_convert_to_shared(t, 0, 'C', 'D', 'E');
+}
 
 int main(int argc, char *argv[])
 {

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related

* [PATCH v8 30/46] KVM: selftests: Add helpers for calling ioctls on guest_memfd
From: Ackerley Tng via B4 Relay @ 2026-06-19  0:32 UTC (permalink / raw)
  To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, Baoquan He
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260618-gmem-inplace-conversion-v8-0-9d2959357853@google.com>

From: Sean Christopherson <seanjc@google.com>

Add helper functions to kvm_util.h to support calling ioctls, specifically
KVM_SET_MEMORY_ATTRIBUTES2, on a guest_memfd file descriptor.

Introduce gmem_ioctl() and __gmem_ioctl() macros, modeled after the
existing vm_ioctl() helpers, to provide a standard way to call ioctls
on a guest_memfd.

Add gmem_set_memory_attributes() and its derivatives (gmem_set_private(),
gmem_set_shared()) to set memory attributes on a guest_memfd region.
Also provide "__" variants that return the ioctl error code instead of
aborting the test. These helpers will be used by upcoming guest_memfd
tests.

To avoid code duplication, factor out the check for supported memory
attributes into a new macro, TEST_ASSERT_SUPPORTED_ATTRIBUTES, and use
it in both the existing vm_set_memory_attributes() and the new
gmem_set_memory_attributes() helpers.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h | 94 +++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 0cacf3698b259..323d06b5699ec 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -392,6 +392,16 @@ static __always_inline void static_assert_is_vcpu(struct kvm_vcpu *vcpu) { }
 	__TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, (vcpu)->vm);	\
 })
 
+#define __gmem_ioctl(gmem_fd, cmd, arg)				\
+	kvm_do_ioctl(gmem_fd, cmd, arg)
+
+#define gmem_ioctl(gmem_fd, cmd, arg)				\
+({								\
+	int ret = __gmem_ioctl(gmem_fd, cmd, arg);		\
+								\
+	TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(#cmd, ret));	\
+})
+
 /*
  * Looks up and returns the value corresponding to the capability
  * (KVM_CAP_*) given by cap.
@@ -418,8 +428,16 @@ static inline void vm_enable_cap(struct kvm_vm *vm, u32 cap, u64 arg0)
 	vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
+/*
+ * KVM_SET_MEMORY_ATTRIBUTES{,2} overwrites _all_ attributes.  These
+ * flows need significant enhancements to support multiple attributes.
+ */
+#define TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes)				\
+	TEST_ASSERT(!(attributes) || (attributes) == KVM_MEMORY_ATTRIBUTE_PRIVATE,	\
+		    "Update me to support multiple attributes!")
+
 static inline void vm_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
-					    u64 size, u64 attributes)
+					    size_t size, u64 attributes)
 {
 	struct kvm_memory_attributes attr = {
 		.attributes = attributes,
@@ -428,17 +446,11 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
 		.flags = 0,
 	};
 
-	/*
-	 * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
-	 * need significant enhancements to support multiple attributes.
-	 */
-	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
-		    "Update me to support multiple attributes!");
+	TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes);
 
 	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
 }
 
-
 static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
 				      u64 size)
 {
@@ -451,6 +463,72 @@ static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
 	vm_set_memory_attributes(vm, gpa, size, 0);
 }
 
+static inline int __gmem_set_memory_attributes(int fd, u64 offset,
+					       size_t size, u64 attributes,
+					       u64 *error_offset)
+{
+	struct kvm_memory_attributes2 attr = {
+		.attributes = attributes,
+		.offset = offset,
+		.size = size,
+		.flags = 0,
+		.error_offset = 0,
+	};
+	int r;
+
+	r = __gmem_ioctl(fd, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
+
+	/* Copy error_offset regardless of r so caller can check. */
+	if (error_offset)
+		*error_offset = attr.error_offset;
+
+	return r;
+}
+
+static inline int __gmem_set_private(int fd, u64 offset, size_t size,
+				     u64 *error_offset)
+{
+	return __gmem_set_memory_attributes(fd, offset, size,
+					    KVM_MEMORY_ATTRIBUTE_PRIVATE,
+					    error_offset);
+}
+
+static inline int __gmem_set_shared(int fd, u64 offset, size_t size,
+				    u64 *error_offset)
+{
+	return __gmem_set_memory_attributes(fd, offset, size, 0,
+					    error_offset);
+}
+
+static inline void gmem_set_memory_attributes(int fd, u64 offset,
+					      size_t size, u64 attributes)
+{
+	struct kvm_memory_attributes2 attr = {
+		.attributes = attributes,
+		.offset = offset,
+		.size = size,
+		.flags = 0,
+	};
+
+	TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes);
+
+	__TEST_REQUIRE(kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) > 0,
+		       "No valid attributes for guest_memfd ioctl!");
+
+	gmem_ioctl(fd, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
+}
+
+static inline void gmem_set_private(int fd, u64 offset, size_t size)
+{
+	gmem_set_memory_attributes(fd, offset, size,
+				   KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void gmem_set_shared(int fd, u64 offset, size_t size)
+{
+	gmem_set_memory_attributes(fd, offset, size, 0);
+}
+
 void vm_guest_mem_fallocate(struct kvm_vm *vm, gpa_t gpa, u64 size,
 			    bool punch_hole);
 

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox