* [PATCH v5 16/20] iommu/dma: Check atomic pool allocation result directly
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
The non-blocking, non-coherent allocation path uses dma_alloc_from_pool(),
which returns the allocated page and fills cpu_addr only on success.
Do not rely on cpu_addr to detect allocation failure in this path. Check
the returned page directly before using it for the IOMMU mapping.
Fixes: 9420139f516d ("dma-pool: fix coherent pool allocations for IOMMU mappings")
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
drivers/iommu/dma-iommu.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 725c7adb0a8d..52c599f4472c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1671,13 +1671,16 @@ void *iommu_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
}
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
- !gfpflags_allow_blocking(gfp) && !coherent)
+ !gfpflags_allow_blocking(gfp) && !coherent) {
page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
gfp, attrs, NULL);
- else
+ if (!page)
+ return NULL;
+ } else {
cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
- if (!cpu_addr)
- return NULL;
+ if (!cpu_addr)
+ return NULL;
+ }
*handle = __iommu_dma_map(dev, page_to_phys(page), size, ioprot,
dev->coherent_dma_mask);
--
2.43.0
^ permalink raw reply related
* [PATCH v5 17/20] dma: swiotlb: free dynamic pools from process context
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
swiotlb_dyn_free() is used after removing a dynamic swiotlb pool from
RCU-protected lists. It can call swiotlb_free_tlb(), which may need to
restore the encryption state of an unencrypted pool with
set_memory_encrypted() before freeing the pages.
RCU callbacks run in atomic context, but set_memory_encrypted() is not
guaranteed to be atomic-safe on all architectures. For example, page
attribute updates may allocate page tables or take sleeping locks.
Use queue_rcu_work() for dynamic pool freeing instead. This keeps the RCU
grace period before freeing a published pool, while running the actual pool
teardown from workqueue context. Use the same helper for the transient-pool
error path, since that path may also be reached from atomic DMA mapping
context.
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
include/linux/swiotlb.h | 4 ++--
kernel/dma/swiotlb.c | 19 +++++++++++--------
2 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 4dcbf3931be1..526f82e9da45 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -64,7 +64,7 @@ extern void __init swiotlb_update_mem_attributes(void);
* @areas: Array of memory area descriptors.
* @slots: Array of slot descriptors.
* @node: Member of the IO TLB memory pool list.
- * @rcu: RCU head for swiotlb_dyn_free().
+ * @dyn_free: RCU work item used to free the pool from process context.
* @transient: %true if transient memory pool.
*/
struct io_tlb_pool {
@@ -79,7 +79,7 @@ struct io_tlb_pool {
struct io_tlb_slot *slots;
#ifdef CONFIG_SWIOTLB_DYNAMIC
struct list_head node;
- struct rcu_head rcu;
+ struct rcu_work dyn_free;
bool transient;
bool unencrypted;
#endif
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index f4e8b241a1c4..4c56f64602ea 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -774,13 +774,10 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
add_mem_pool(mem, pool);
}
-/**
- * swiotlb_dyn_free() - RCU callback to free a memory pool
- * @rcu: RCU head in the corresponding struct io_tlb_pool.
- */
-static void swiotlb_dyn_free(struct rcu_head *rcu)
+static void swiotlb_dyn_free_work(struct work_struct *work)
{
- struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
+ struct io_tlb_pool *pool =
+ container_of(to_rcu_work(work), struct io_tlb_pool, dyn_free);
size_t slots_size = array_size(sizeof(*pool->slots), pool->nslabs);
size_t tlb_size = pool->end - pool->start;
@@ -789,6 +786,12 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
kfree(pool);
}
+static void swiotlb_schedule_dyn_free(struct io_tlb_pool *pool)
+{
+ INIT_RCU_WORK(&pool->dyn_free, swiotlb_dyn_free_work);
+ queue_rcu_work(system_wq, &pool->dyn_free);
+}
+
/**
* __swiotlb_find_pool() - find the IO TLB pool for a physical address
* @dev: Device which has mapped the DMA buffer.
@@ -835,7 +838,7 @@ static void swiotlb_del_pool(struct device *dev, struct io_tlb_pool *pool)
list_del_rcu(&pool->node);
spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
- call_rcu(&pool->rcu, swiotlb_dyn_free);
+ swiotlb_schedule_dyn_free(pool);
}
#endif /* CONFIG_SWIOTLB_DYNAMIC */
@@ -1276,7 +1279,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
index = swiotlb_search_pool_area(dev, pool, 0, orig_addr, tbl_dma_addr,
alloc_size, alloc_align_mask);
if (index < 0) {
- swiotlb_dyn_free(&pool->rcu);
+ swiotlb_schedule_dyn_free(pool);
return -1;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v5 18/20] dma: swiotlb: handle set_memory_decrypted() failures
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
Check the return value when converting swiotlb pools between encrypted and
decrypted mappings. If the default pool cannot be decrypted after early
initialization, mark the pool fully used so it cannot satisfy future bounce
allocations.
For late initialization, return the `set_memory_decrypted()` failure. For
restricted DMA pools, fail device initialization if the reserved pool
cannot be decrypted.
This prevents swiotlb from using pools whose encryption attributes do not
match their metadata, and avoids returning pages with uncertain encryption
state back to the allocator.
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
kernel/dma/swiotlb.c | 80 +++++++++++++++++++++++++++++++++++---------
1 file changed, 65 insertions(+), 15 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 4c56f64602ea..14d834ca298b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -248,6 +248,23 @@ static inline unsigned long nr_slots(u64 val)
return DIV_ROUND_UP(val, IO_TLB_SIZE);
}
+static void swiotlb_mark_pool_used(struct io_tlb_pool *pool)
+{
+ unsigned long i;
+
+ for (i = 0; i < pool->nareas; i++) {
+ pool->areas[i].index = 0;
+ pool->areas[i].used = pool->area_nslabs;
+ }
+
+ for (i = 0; i < pool->nslabs; i++) {
+ pool->slots[i].list = 0;
+ pool->slots[i].orig_addr = INVALID_PHYS_ADDR;
+ pool->slots[i].alloc_size = 0;
+ pool->slots[i].pad_slots = 0;
+ }
+}
+
/*
* Early SWIOTLB allocation may be too early to allow an architecture to
* perform the desired operations. This function allows the architecture to
@@ -272,8 +289,16 @@ void __init swiotlb_update_mem_attributes(void)
return;
bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
- if (io_tlb_default_mem.unencrypted)
- set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
+ if (io_tlb_default_mem.unencrypted) {
+ int ret;
+
+ ret = set_memory_decrypted((unsigned long)mem->vaddr,
+ bytes >> PAGE_SHIFT);
+ if (ret) {
+ pr_warn("Failed to decrypt default memory pool, disabling it\n");
+ swiotlb_mark_pool_used(mem);
+ }
+ }
}
static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
@@ -442,9 +467,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
{
struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+ unsigned int order, area_order, slot_order;
+ bool leak_pages = false;
unsigned int nareas;
unsigned char *vstart = NULL;
- unsigned int order, area_order;
bool retried = false;
int rc = 0;
@@ -504,6 +530,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
(PAGE_SIZE << order) >> 20);
}
+ rc = -ENOMEM;
nareas = limit_nareas(default_nareas, nslabs);
area_order = get_order(array_size(sizeof(*mem->areas), nareas));
mem->areas = (struct io_tlb_area *)
@@ -511,14 +538,20 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
if (!mem->areas)
goto error_area;
+ slot_order = get_order(array_size(sizeof(*mem->slots), nslabs));
mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
- get_order(array_size(sizeof(*mem->slots), nslabs)));
+ slot_order);
if (!mem->slots)
goto error_slots;
- if (io_tlb_default_mem.unencrypted)
- set_memory_decrypted((unsigned long)vstart,
- (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
+ if (io_tlb_default_mem.unencrypted) {
+ rc = set_memory_decrypted((unsigned long)vstart,
+ (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
+ if (rc) {
+ leak_pages = true;
+ goto error_decrypt;
+ }
+ }
swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
nareas);
@@ -527,16 +560,20 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
swiotlb_print_info();
return 0;
+error_decrypt:
+ free_pages((unsigned long)mem->slots, slot_order);
error_slots:
free_pages((unsigned long)mem->areas, area_order);
error_area:
- free_pages((unsigned long)vstart, order);
- return -ENOMEM;
+ if (!leak_pages)
+ free_pages((unsigned long)vstart, order);
+ return rc;
}
void __init swiotlb_exit(void)
{
struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
+ bool leak_pages = false;
unsigned long tbl_vaddr;
size_t tbl_size, slots_size;
unsigned int area_order;
@@ -552,19 +589,23 @@ void __init swiotlb_exit(void)
tbl_size = PAGE_ALIGN(mem->end - mem->start);
slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
- if (io_tlb_default_mem.unencrypted)
- set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
+ if (io_tlb_default_mem.unencrypted) {
+ if (set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT))
+ leak_pages = true;
+ }
if (mem->late_alloc) {
area_order = get_order(array_size(sizeof(*mem->areas),
mem->nareas));
free_pages((unsigned long)mem->areas, area_order);
- free_pages(tbl_vaddr, get_order(tbl_size));
+ if (!leak_pages)
+ free_pages(tbl_vaddr, get_order(tbl_size));
free_pages((unsigned long)mem->slots, get_order(slots_size));
} else {
memblock_free(mem->areas,
array_size(sizeof(*mem->areas), mem->nareas));
- memblock_phys_free(mem->start, tbl_size);
+ if (!leak_pages)
+ memblock_phys_free(mem->start, tbl_size);
memblock_free(mem->slots, slots_size);
}
@@ -1938,9 +1979,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
* restricted mem pool is decrypted by default
*/
if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+ int ret;
+
mem->unencrypted = true;
- set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
- rmem->size >> PAGE_SHIFT);
+ ret = set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
+ rmem->size >> PAGE_SHIFT);
+ if (ret) {
+ dev_err(dev, "Failed to decrypt restricted DMA pool\n");
+ kfree(pool->areas);
+ kfree(pool->slots);
+ kfree(mem);
+ return ret;
+ }
} else {
mem->unencrypted = false;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v5 19/20] dma: free atomic pool pages by physical address
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
dma_direct_alloc_pages() may satisfy atomic allocations from the coherent
atomic pools. The pool allocation is keyed by the virtual address stored in
the gen_pool, but the pages API returns only the backing struct page.
On architectures with CONFIG_DMA_DIRECT_REMAP, atomic pool chunks are added
to the gen_pool using their remapped virtual address.
dma_direct_free_pages() reconstructs a linear-map address with
page_address(page) and passes that to dma_free_from_pool(). That address
does not match the gen_pool virtual range, so the pool lookup can fail and
the code can fall through to freeing a pool-owned page through the normal
page allocator path.
Add a page-based pool free helper that looks up the owning pool chunk by
physical address, translates it back to the gen_pool virtual address, and
frees that address to the pool. Use it from dma_direct_free_pages() while
keeping the existing virtual-address helper for coherent allocation frees.
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
include/linux/dma-map-ops.h | 1 +
kernel/dma/direct.c | 4 +--
kernel/dma/pool.c | 54 +++++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+), 2 deletions(-)
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 696b2c3a2305..8be059e69935 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -215,6 +215,7 @@ struct page *dma_alloc_from_pool(struct device *dev, size_t size,
void **cpu_addr, gfp_t flags, unsigned long attrs,
bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t));
bool dma_free_from_pool(struct device *dev, void *start, size_t size);
+bool dma_free_from_pool_page(struct device *dev, struct page *page, size_t size);
int dma_direct_set_offset(struct device *dev, phys_addr_t cpu_start,
dma_addr_t dma_start, u64 size);
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 907c6084c616..488d53ed21f3 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -488,9 +488,9 @@ void dma_direct_free_pages(struct device *dev, size_t size,
*/
bool mark_mem_encrypted = force_dma_unencrypted(dev);
- /* If cpu_addr is not from an atomic pool, dma_free_from_pool() fails */
+ /* If page is not from an atomic pool, dma_free_from_pool_page() fails */
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
- dma_free_from_pool(dev, vaddr, size))
+ dma_free_from_pool_page(dev, page, size))
return;
phys = page_to_phys(page);
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index e7df8d279e75..43b8101d860f 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -356,3 +356,57 @@ bool dma_free_from_pool(struct device *dev, void *start, size_t size)
return false;
}
+
+struct dma_pool_phys_match {
+ phys_addr_t phys;
+ size_t size;
+ unsigned long addr;
+ bool found;
+};
+
+static void dma_pool_find_phys(struct gen_pool *pool, struct gen_pool_chunk *chunk,
+ void *data)
+{
+ struct dma_pool_phys_match *match = data;
+ phys_addr_t end = match->phys + match->size - 1;
+ phys_addr_t chunk_end;
+
+ if (match->found)
+ return;
+
+ chunk_end = chunk->phys_addr + (chunk->end_addr - chunk->start_addr);
+ if (match->phys < chunk->phys_addr || end > chunk_end)
+ return;
+
+ match->addr = chunk->start_addr + (match->phys - chunk->phys_addr);
+ match->found = true;
+}
+
+static bool dma_free_from_pool_phys(struct dma_gen_pool *dma_pool, phys_addr_t phys,
+ size_t size)
+{
+ struct dma_pool_phys_match match = {
+ .phys = phys,
+ .size = size,
+ };
+
+ gen_pool_for_each_chunk(dma_pool->pool, dma_pool_find_phys, &match);
+ if (!match.found)
+ return false;
+
+ gen_pool_free(dma_pool->pool, match.addr, size);
+ return true;
+}
+
+bool dma_free_from_pool_page(struct device *dev, struct page *page, size_t size)
+{
+ struct dma_gen_pool *dma_pool = NULL;
+ phys_addr_t phys = page_to_phys(page);
+
+ while ((dma_pool = dma_guess_pool(dma_pool, 0))) {
+ if (dma_free_from_pool_phys(dma_pool, phys, size))
+ return true;
+ }
+
+ return false;
+}
--
2.43.0
^ permalink raw reply related
* [PATCH v5 20/20] swiotlb: Preserve allocation virtual address for dynamic pools
From: Aneesh Kumar K.V (Arm) @ 2026-05-22 4:28 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>
swiotlb_alloc_tlb() can allocate from the DMA atomic pool when a decrypted
pool is needed from atomic context. With CONFIG_DMA_DIRECT_REMAP, the
atomic pool is backed by remapped virtual addresses, which are not the same
as the direct-map addresses returned by phys_to_virt().
swiotlb_init_io_tlb_pool() currently reconstructs the pool virtual address
from the physical start address. For atomic-pool backed allocations this
stores the wrong address in pool->vaddr. Later, swiotlb_free_tlb() passes
that address to dma_free_from_pool(), which will fail to recognize the
chunk
Pass the virtual address returned by the allocation path into
swiotlb_init_io_tlb_pool(), and store that address in pool->vaddr. This
keeps the pool free path using the same virtual address as the allocator.
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
kernel/dma/swiotlb.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 14d834ca298b..e4bd8c9eaeda 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -302,9 +302,9 @@ void __init swiotlb_update_mem_attributes(void)
}
static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
- unsigned long nslabs, bool late_alloc, unsigned int nareas)
+ void *vaddr, unsigned long nslabs, bool late_alloc,
+ unsigned int nareas)
{
- void *vaddr = phys_to_virt(start);
unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
mem->nslabs = nslabs;
@@ -445,7 +445,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
return;
}
- swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false, nareas);
+ swiotlb_init_io_tlb_pool(mem, __pa(tlb), tlb, nslabs, false, nareas);
add_mem_pool(&io_tlb_default_mem, mem);
if (flags & SWIOTLB_VERBOSE)
@@ -553,7 +553,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
}
}
- swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
+ swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), vstart, nslabs, true,
nareas);
add_mem_pool(&io_tlb_default_mem, mem);
@@ -664,25 +664,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
* @phys_limit: Maximum allowed physical address of the buffer.
* @attrs: DMA attributes for the allocation.
* @gfp: GFP flags for the allocation.
+ * @vaddr: Receives the virtual address for the allocated buffer.
*
* Return: Allocated pages, or %NULL on allocation failure.
*/
static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
- u64 phys_limit, unsigned long attrs, gfp_t gfp)
+ u64 phys_limit, unsigned long attrs, gfp_t gfp, void **vaddr)
{
struct page *page;
+ *vaddr = NULL;
+
/*
* Allocate from the atomic pools if memory is encrypted and
* the allocation is atomic, because decrypting may block.
*/
if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
- void *vaddr;
-
if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
return NULL;
- return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
+ return dma_alloc_from_pool(dev, bytes, vaddr, gfp,
attrs, dma_coherent_ok);
}
@@ -705,6 +706,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
return NULL;
}
+ if (page)
+ *vaddr = phys_to_virt(page_to_phys(page));
return page;
}
@@ -750,6 +753,7 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
{
struct io_tlb_pool *pool;
unsigned int slot_order;
+ void *tlb_vaddr;
struct page *tlb;
size_t pool_size;
size_t tlb_size;
@@ -767,7 +771,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
tlb_size = nslabs << IO_TLB_SHIFT;
- while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
+ while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp,
+ &tlb_vaddr))) {
if (nslabs <= minslabs)
goto error_tlb;
nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
@@ -781,12 +786,12 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
if (!pool->slots)
goto error_slots;
- swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, nareas);
+ swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), tlb_vaddr, nslabs,
+ true, nareas);
return pool;
error_slots:
- swiotlb_free_tlb(page_address(tlb), tlb_size,
- !!(attrs & DMA_ATTR_CC_SHARED));
+ swiotlb_free_tlb(tlb_vaddr, tlb_size, !!(attrs & DMA_ATTR_CC_SHARED));
error_tlb:
kfree(pool);
error:
@@ -1995,7 +2000,8 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
mem->unencrypted = false;
}
- swiotlb_init_io_tlb_pool(pool, rmem->base, nslabs,
+ swiotlb_init_io_tlb_pool(pool, rmem->base, phys_to_virt(rmem->base),
+ nslabs,
false, nareas);
mem->force_bounce = true;
mem->for_alloc = true;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Aneesh Kumar K.V @ 2026-05-22 4:39 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jiri Pirko, dri-devel, linaro-mm-sig, iommu, linux-media,
sumit.semwal, benjamin.gaignard, Brian.Starkey, jstultz,
tjmercier, christian.koenig, m.szyprowski, robin.murphy, leon,
sean.anderson, ptesarik, catalin.marinas, suzuki.poulose,
steven.price, thomas.lendacky, john.allen, ashish.kalra,
suravee.suthikulpanit, linux-coco
In-Reply-To: <20260521175420.GA7702@ziepe.ca>
Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Thu, May 21, 2026 at 09:05:39PM +0530, Aneesh Kumar K.V wrote:
>> I am wondering whether this is better
>>
>> static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>> phys_addr_t phys, size_t size, enum dma_data_direction dir,
>> unsigned long attrs, bool flush)
>> {
>> dma_addr_t dma_addr;
>>
>> /*
>> * For a device requiring unencrypted DMA, MMIO memory is treated
>> * as shared.
>> */
>> if (force_dma_unencrypted(dev) && (attrs & DMA_ATTR_MMIO))
>> attrs |= DMA_ATTR_CC_SHARED;
>
> It is an option, I would be happier if we went and fixed the few
> callers to properly pass the shared. CC did this with the
> pgprot_decrypted() stuff, same reasoning:
>
> diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
> index bfdb9ed7074116..e77f6404caa3db 100644
> --- a/block/blk-mq-dma.c
> +++ b/block/blk-mq-dma.c
> @@ -90,7 +90,7 @@ static bool blk_dma_map_direct(struct request *req, struct device *dma_dev,
> unsigned int attrs = 0;
>
> if (iter->p2pdma.map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
> - attrs |= DMA_ATTR_MMIO;
> + attrs |= iter->p2pdma.mem->dma_mapping_flags;
>
> iter->addr = dma_map_phys(dma_dev, vec->paddr, vec->len,
> rq_dma_dir(req), attrs);
> @@ -115,7 +115,7 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
> iter->len = dma_iova_size(state);
>
> if (iter->p2pdma.map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
> - attrs |= DMA_ATTR_MMIO;
> + attrs |= iter->p2pdma.mem->dma_mapping_flags;
>
> do {
> error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
> diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> index 794acff2546a34..96022fadc48245 100644
> --- a/drivers/dma-buf/dma-buf-mapping.c
> +++ b/drivers/dma-buf/dma-buf-mapping.c
> @@ -147,7 +147,7 @@ struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
> ret = dma_iova_link(attach->dev, dma->state,
> phys_vec[i].paddr, 0,
> phys_vec[i].len, dir,
> - DMA_ATTR_MMIO);
> + provider->dma_mapping_flags);
> if (ret)
> goto err_unmap_dma;
>
> @@ -155,7 +155,7 @@ struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
> } else {
> addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
> phys_vec[i].len, dir,
> - DMA_ATTR_MMIO);
> + provider->dma_mapping_flags);
> ret = dma_mapping_error(attach->dev, addr);
> if (ret)
> goto err_unmap_dma;
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 7c898542af8d5e..e4229b4d35c767 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -282,6 +282,8 @@ int pcim_p2pdma_init(struct pci_dev *pdev)
> continue;
>
> p2p->mem[i].owner = &pdev->dev;
> + p2p->mem[i].dma_mapping_flags =
> + DMA_ATTR_MMIO | DMA_ATTR_CC_SHARED;
> p2p->mem[i].bus_offset =
> pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
> }
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 873de20a224759..402dc5e5d62b0a 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -21,10 +21,12 @@ struct scatterlist;
> *
> * A p2pdma provider is a range of MMIO address space available to the CPU.
> * @owner: Device to which this provider belongs.
> + * @dma_mapping_flags: DMA attributes to use for host bridge mappings.
> * @bus_offset: Bus offset for p2p communication.
> */
> struct p2pdma_provider {
> struct device *owner;
> + unsigned long dma_mapping_flags;
> u64 bus_offset;
> };
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 5955f2f0c83db1..c3f445acddf873 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -811,7 +811,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
> case PCI_P2PDMA_MAP_NONE:
> break;
> case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> - attrs |= DMA_ATTR_MMIO;
> + attrs |= p2pdma_state->mem->dma_mapping_flags;
> pfns[idx] |= HMM_PFN_P2PDMA;
> break;
> case PCI_P2PDMA_MAP_BUS_ADDR:
Can I convert this as an independent patch with your SOB?
-aneesh
^ permalink raw reply
* [ANN] Linux Security Summit Europe 2026 CfP
From: Reshetova, Elena @ 2026-05-22 5:56 UTC (permalink / raw)
To: linux-security-module@vger.kernel.org
Cc: Linux Security Summit Program Committee, Friend,
linux-integrity@vger.kernel.org, lwn@lwn.net,
linux-crypto@vger.kernel.org, keyrings@vger.kernel.org,
linux-coco@lists.linux.dev, kernel-hardening@lists.openwall.com
====================================================================
ANNOUNCEMENT AND CALL FOR PARTICIPATION
LINUX SECURITY SUMMIT EUROPE 2026
Thursday, 8 October
Prague, Czechia
====================================================================
DESCRIPTION
Linux Security Summit Europe (LSS-EU) 2026 is a technical forum for
collaboration between Linux developers, researchers, and end-users.
Its primary aim is to foster community efforts in deeply analyzing and
solving Linux operating system security challenges, including those in the
Linux kernel.
This year LSS-EU is a single day event happening right after Linux Plumbers 2026
https://lpc.events/
Proposals to LSS-EU should be submitted via:
https://events.linuxfoundation.org/linux-security-summit-europe/program/cfp/
SUGGESTED TOPICS
* Access Control
* Case Studies
* Cryptography and Key Management
* Emerging Technologies, Threats & Techniques
* Hardware Security
* IoT and Embedded Security
* Integrity Policy and Enforcement
* Open Source Supply Chain for the Linux OS
* Security Tools
* Security UX
* Linux OS Hardening
* Virtualization and Containers
DATES TO REMEMBER:
* CFP Close: Sunday, 28 June at 11:59 PM CEST (UTC +2) / 2:59 PM PDT (UTC -7)
* CFP Notifications: Tuesday, 14 July
* Schedule Announced: Wednesday, 15 July
* Event Date: Thursday, 8 October
WHO SHOULD ATTEND
We're seeking a diverse range of attendees and welcome participation by
people involved in Linux security development, operations, and research.
LSS is a unique global event that provides the opportunity to present and
discuss your work or research with key Linux security community members and
maintainers. It's also useful for those who wish to keep up with the latest
in Linux security development and to provide input to the development
process.
MASTODON
For event updates and announcements, follow:
https://social.kernel.org/LinuxSecSummit
#linuxsecuritysummit
PROGRAM COMMITTEE
The program committee for LSS 2026 is:
* James Morris, Microsoft
* Serge Hallyn, Geico
* Paul Moore, Microsoft
* Stephen Smalley, NSA
* Elena Reshetova, Intel
* John Johansen, Canonical
* Kees Cook, Google
* Casey Schaufler
* Mimi Zohar, IBM
* David A. Wheeler, Linux Foundation
The program committee may be contacted as a group via email:
lss-pc () lists.linuxfoundation.org
^ permalink raw reply
* Re: [PATCH v14 04/44] arm64: RMI: Add SMC definitions for calling the RMM
From: Marc Zyngier @ 2026-05-22 9:58 UTC (permalink / raw)
To: Steven Price
Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <3261b04f-1a0c-451d-8981-1e2bccc8a9ca@arm.com>
On Thu, 21 May 2026 16:33:09 +0100,
Steven Price <steven.price@arm.com> wrote:
>
> On 21/05/2026 13:40, Marc Zyngier wrote:
> > On Wed, 13 May 2026 14:17:12 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> The RMM (Realm Management Monitor) provides functionality that can be
> >> accessed by SMC calls from the host.
> >>
> >> The SMC definitions are based on DEN0137[1] version 2.0-bet1
> >>
> >> [1] https://developer.arm.com/documentation/den0137/2-0bet1/
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >> Changes since v13:
> >> * Updated to RMM spec v2.0-bet1
> >> Changes since v12:
> >> * Updated to RMM spec v2.0-bet0
> >> Changes since v9:
> >> * Corrected size of 'ripas_value' in struct rec_exit. The spec states
> >> this is an 8-bit type with padding afterwards (rather than a u64).
> >> Changes since v8:
> >> * Added RMI_PERMITTED_GICV3_HCR_BITS to define which bits the RMM
> >> permits to be modified.
> >> Changes since v6:
> >> * Renamed REC_ENTER_xxx defines to include 'FLAG' to make it obvious
> >> these are flag values.
> >> Changes since v5:
> >> * Sorted the SMC #defines by value.
> >> * Renamed SMI_RxI_CALL to SMI_RMI_CALL since the macro is only used for
> >> RMI calls.
> >> * Renamed REC_GIC_NUM_LRS to REC_MAX_GIC_NUM_LRS since the actual
> >> number of available list registers could be lower.
> >> * Provided a define for the reserved fields of FeatureRegister0.
> >> * Fix inconsistent names for padding fields.
> >> Changes since v4:
> >> * Update to point to final released RMM spec.
> >> * Minor rearrangements.
> >> Changes since v3:
> >> * Update to match RMM spec v1.0-rel0-rc1.
> >> Changes since v2:
> >> * Fix specification link.
> >> * Rename rec_entry->rec_enter to match spec.
> >> * Fix size of pmu_ovf_status to match spec.
> >> ---
> >> arch/arm64/include/asm/rmi_smc.h | 448 +++++++++++++++++++++++++++++++
> >> 1 file changed, 448 insertions(+)
> >> create mode 100644 arch/arm64/include/asm/rmi_smc.h
> >>
> >> diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
> >> new file mode 100644
> >> index 000000000000..a09b7a631fef
> >> --- /dev/null
> >> +++ b/arch/arm64/include/asm/rmi_smc.h
> >> @@ -0,0 +1,448 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +/*
> >> + * Copyright (C) 2023-2026 ARM Ltd.
> >> + *
> >> + * The values and structures in this file are from the Realm Management Monitor
> >> + * specification (DEN0137) version 2.0-bet1:
> >> + * https://developer.arm.com/documentation/den0137/2-0bet1/
> >
> > How long is this spec going to be available on the ARM web site, which
> > has a tendency of being reorganised every other week? And there is
> > already a beta2.
>
> Obviously I can't predict the next reorganisation - but at least it's a
> link that could be fed into archive.org or similar.
I found that the PDF spec was less susceptible to creative nonsense,
and people can download it for future reference, whereas ARM has
happily *deleted* specs from the website over time (try to find PSCI
0.1, for example...).
[...]
> >> +struct realm_params {
> >> + union { /* 0x0 */
> >> + struct {
> >> + u64 flags;
> >> + u64 s2sz;
> >> + u64 sve_vl;
> >> + u64 num_bps;
> >> + u64 num_wps;
> >> + u64 pmu_num_ctrs;
> >> + u64 hash_algo;
> >> + u64 num_aux_planes;
> >> + };
> >> + u8 padding0[0x400];
> >
> > SZ_1K? And similarly all over the shop?
>
> I'm a bit less sure that makes the code more readable - these structures
> are a bit of a pain because they are somewhat sparse. I've left a
> comment where the beginning of each union is, and personally I find it
> easier to see 0x0 + 0x400 == 0x400 rather than trying to work out what
> SZ_1K is in hex. This is particularly the case in terms of:
>
> > struct rec_params {
> > union { /* 0x0 */
> > u64 flags;
> > u8 padding0[0x100];
> > };
> > union { /* 0x100 */
> > u64 mpidr;
> > u8 padding1[0x100];
> > };
> > union { /* 0x200 */
> > u64 pc;
> > u8 padding2[0x100];
> > };
> > union { /* 0x300 */
> > u64 gprs[REC_CREATE_NR_GPRS];
> > u8 padding3[0xd00];
> > };
> > };
>
> Where 0xd00 doesn't even have a correspoding SZ_ define.
Indeed, but it is (SZ_4K - SZ_256 * 3). And a lot of these structures
seem to be designed to form a 4kB blob. I'm sure we can make use of
that information (BUILD_BUG_ON?).
>
> The RMM deals with this with macro magic:
>
> > struct rmi_rec_params {
> > /* Flags */
> > SET_MEMBER_RMI(unsigned long flags, 0, 0x100); /* Offset 0 */
> > /* MPIDR of the REC */
> > SET_MEMBER_RMI(unsigned long mpidr, 0x100, 0x200); /* 0x100 */
> > /* Program counter */
> > SET_MEMBER_RMI(unsigned long pc, 0x200, 0x300); /* 0x200 */
> > /* General-purpose registers */
> > SET_MEMBER_RMI(unsigned long gprs[REC_CREATE_NR_GPRS], 0x300, 0x1000); /* 0x300 */
> > };
>
> where the offsets are just directly encoded in the macro - but it's not
> an especially robust macro and I'm not convinced it's more readable.
I think this is just as horrible, but at least it seems to take the
boundaries of the structure into account.
>
> I'm happy to hear other suggestions on how to encode this neatly.
Honestly, I wouldn't mind having the structures described in a more
abstract way and then pre-processed to generate the include files. If
the architectural MRS wasn't so huge, I would have added it to the
kernel and used that directly for KVM.
>
> > I haven't checked the details of the encodings (life is too short),
> > but I wonder how much of this exists as an MRS and could be
> > automatically generated?
>
> Automatically generating this would be good - I'm not sure whether we
> have a (public) source available to generate from at the moment. I have
> tried to methodically work through the spec when updating this file, but
> as Gavin has already pointed out there was at least one mistake (in
> currently unused definitions) this time.
I'm slightly baffled that even the RMM is written this way. Given the
formalism used in the RMM spec, I was expecting that you'd have a
bunch of JSON at hand and able to generate any output from that. Doing
this stuff by hand is both incredibly dull work *and* extremely error
prone.
Thanks,
M.
--
Jazz isn't dead. It just smells funny.
^ permalink raw reply
* Re: [PATCH v14 09/44] arm64: RMI: Provide functions to delegate/undelegate ranges of memory
From: Marc Zyngier @ 2026-05-22 10:02 UTC (permalink / raw)
To: Suzuki K Poulose
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Will Deacon,
James Morse, Oliver Upton, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <a81c3688-30f1-43e4-8d57-1d08a6e563af@arm.com>
On Thu, 21 May 2026 17:01:37 +0100,
Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>
> On 21/05/2026 14:59, Marc Zyngier wrote:
> > On Wed, 13 May 2026 14:17:17 +0100,
> > Steven Price <steven.price@arm.com> wrote:
> >>
> >> The RMM requires memory is 'delegated' to it so that it can be used
> >> either for a realm guest or for various tracking purposes within the RMM
> >> (e.g. for metadata or page tables). Memory that has been delegated
> >> cannot be accessed by the host (it will result in a Granule Protection
> >> Fault).
> >>
> >> Undelegation may fail if the memory is still in use by the RMM. This
> >> shouldn't happen (Linux should ensure it has destroyed the RMM objects
> >> before attempting to undelegate). In the event that it does happen this
> >> points to a programming bug and the only reasonable approach is for the
> >> physical pages to be leaked - it is up to the caller of
> >> rmi_undelegate_range() to handle this.
> >>
> >> Signed-off-by: Steven Price <steven.price@arm.com>
> >> ---
> >> v14:
> >> * Split into separate patch and moved out of KVM
> >> ---
> >> arch/arm64/include/asm/rmi_cmds.h | 13 +++++++++++
> >> arch/arm64/kernel/rmi.c | 36 +++++++++++++++++++++++++++++++
> >> 2 files changed, 49 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
> >> index 9078a2920a7c..eb213c8e6f26 100644
> >> --- a/arch/arm64/include/asm/rmi_cmds.h
> >> +++ b/arch/arm64/include/asm/rmi_cmds.h
> >> @@ -33,6 +33,19 @@ struct rmi_sro_state {
> >> } while (RMI_RETURN_STATUS(res.a0) == RMI_BUSY || \
> >> RMI_RETURN_STATUS(res.a0) == RMI_BLOCKED)
> >> +int rmi_delegate_range(phys_addr_t phys, unsigned long size);
> >> +int rmi_undelegate_range(phys_addr_t phys, unsigned long size);
> >> +
> >> +static inline int rmi_delegate_page(phys_addr_t phys)
> >> +{
> >> + return rmi_delegate_range(phys, PAGE_SIZE);
> >> +}
> >> +
> >> +static inline int rmi_undelegate_page(phys_addr_t phys)
> >> +{
> >> + return rmi_undelegate_range(phys, PAGE_SIZE);
> >> +}
> >> +
> >> bool rmi_is_available(void);
> >> unsigned long rmi_sro_execute(struct rmi_sro_state *sro, gfp_t
> >> gfp);
> >> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
> >> index 52a415e99500..08cef54acadb 100644
> >> --- a/arch/arm64/kernel/rmi.c
> >> +++ b/arch/arm64/kernel/rmi.c
> >> @@ -12,6 +12,42 @@ static bool arm64_rmi_is_available;
> >> unsigned long rmm_feat_reg0;
> >> unsigned long rmm_feat_reg1;
> >> +int rmi_delegate_range(phys_addr_t phys, unsigned long size)
> >> +{
> >> + unsigned long ret = 0;
> >> + unsigned long top = phys + size;
> >> + unsigned long out_top;
> >> +
> >> + while (phys < top) {
> >> + ret = rmi_granule_range_delegate(phys, top, &out_top);
> >> + if (ret == RMI_SUCCESS)
> >> + phys = out_top;
> >> + else if (ret != RMI_BUSY && ret != RMI_BLOCKED)
> >> + return ret;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +int rmi_undelegate_range(phys_addr_t phys, unsigned long size)
> >> +{
> >> + unsigned long ret = 0;
> >> + unsigned long top = phys + size;
> >> + unsigned long out_top;
> >> +
> >> + WARN_ON(size == 0);
> >
> > I find it odd to warn on size = 0. After all, free(NULL) is not an
> > error. But even then, you continue feeding this to the RMM.
> >
> > You also don't seem to be bothered with that on the delegation side...
> >
> >> +
> >> + while (phys < top) {
> >> + ret = rmi_granule_range_undelegate(phys, top, &out_top);
> >> + if (ret == RMI_SUCCESS)
> >> + phys = out_top;
> >
> > and size==0 doesn't violate any of the failure conditions listed in
> > B4.5.18.2 (beta2). Will you end-up looping around forever?
>
> That is not true ? It triggers, top_bound error condition, for both.
>
>
> pre: UInt(top) <= UInt(base)
> post: result.status == RMI_ERROR_INPUT
News flash, I can't read. Ignore me.
M.
--
Jazz isn't dead. It just smells funny.
^ permalink raw reply
* Re: [PATCH v6 21/43] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Sean Christopherson @ 2026-05-22 13:08 UTC (permalink / raw)
To: Ackerley Tng
Cc: Fuad Tabba, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgFB8ydih9JTmsH06H32j38tH-iViZqN_eZ_gQAmXpw+Dw@mail.gmail.com>
On Thu, May 21, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Thu, May 21, 2026, Fuad Tabba wrote:
> >> Hi,
> >>
> >> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> > diff --git include/linux/kvm_host.h include/linux/kvm_host.h
> > index 61a3430957f2..b83cda2870ba 100644
> > --- include/linux/kvm_host.h
> > +++ include/linux/kvm_host.h
> > @@ -2596,7 +2596,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
> > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > struct page *page, void *opaque);
> >
> > -long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> > +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> > + long npages, bool writable,
>
> What do you think of need_writable_src instead of just writable for the
> variable name?
How about "may_write_src" or "may_writeback_src"?
> > kvm_gmem_populate_cb post_populate, void *opaque);
> > #endif
> >
> > diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> > index a35a55571a2d..6553d4e032ce 100644
> > --- virt/kvm/guest_memfd.c
> > +++ virt/kvm/guest_memfd.c
> > @@ -858,7 +858,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > return ret;
> > }
> >
> > -long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> > +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> > + long npages, bool writable,
> > kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > struct kvm_memory_slot *slot;
> > @@ -892,8 +893,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> >
> > if (src) {
> > unsigned long uaddr = (unsigned long)src + i * PAGE_SIZE;
> > + unsigned int flags = writable ? FOLL_WRITE : 0;
>
> How about using FOLL_WRITE | FOLL_NOFAULT so if it weren't writable to
> start with, don't CoW, just error out?
Eh, I don't see any value in value in erroring out if userspace is doing something
unusual. If breaking CoW was actually problematic somehow, then sure. But AFAICT
it's overall harmless.
> Like you said above the CPUID page provided as src_page would have been
> written to before, so it should have been mapped as writable.
^ permalink raw reply
* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Jason Gunthorpe @ 2026-05-22 13:22 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Jiri Pirko, dri-devel, linaro-mm-sig, iommu, linux-media,
sumit.semwal, benjamin.gaignard, Brian.Starkey, jstultz,
tjmercier, christian.koenig, m.szyprowski, robin.murphy, leon,
sean.anderson, ptesarik, catalin.marinas, suzuki.poulose,
steven.price, thomas.lendacky, john.allen, ashish.kalra,
suravee.suthikulpanit, linux-coco
In-Reply-To: <yq5aqzn45a81.fsf@kernel.org>
On Fri, May 22, 2026 at 10:09:26AM +0530, Aneesh Kumar K.V wrote:
> Can I convert this as an independent patch with your SOB?
Sure,
Maybe you can ignore it for your series, the intersection of CC and
P2P is non-existant right now. HW doesn't support it. Make the DMA API
follow the design assuming this patch is applied.
Jason
^ permalink raw reply
* Re: [PATCH v5 02/20] [DO NOT MERGE] s390: Expose protected virtualization through cc_platform_has()
From: JAEHOON KIM @ 2026-05-22 15:35 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Halil Pasic,
Matthew Rosato
In-Reply-To: <20260522042815.370873-3-aneesh.kumar@kernel.org>
On 5/21/2026 11:27 PM, Aneesh Kumar K.V (Arm) wrote:
> Protected virtualization guests use memory encryption, so advertise that to
> the rest of the kernel through cc_platform_has(CC_ATTR_MEM_ENCRYPT).
>
> s390 already forces DMA mappings to be unencrypted for protected
> virtualization guests through force_dma_unencrypted(). Add
> ARCH_HAS_CC_PLATFORM and provide the matching cc_platform_has()
> implementation
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Jaehoon Kim <jhkim@linux.ibm.com>
> ---
> arch/s390/Kconfig | 1 +
> arch/s390/mm/init.c | 14 ++++++++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index ecbcbb781e40..9b5e6029e043 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -87,6 +87,7 @@ config S390
> select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
> select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> select ARCH_HAS_CC_CAN_LINK
> + select ARCH_HAS_CC_PLATFORM
> select ARCH_HAS_CPU_FINALIZE_INIT
> select ARCH_HAS_CURRENT_STACK_POINTER
> select ARCH_HAS_DEBUG_VIRTUAL
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 1f72efc2a579..ad3c6d92b801 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -50,6 +50,7 @@
> #include <linux/virtio_anchor.h>
> #include <linux/virtio_config.h>
> #include <linux/execmem.h>
> +#include <linux/cc_platform.h>
>
> pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(".bss..swapper_pg_dir");
> pgd_t invalid_pg_dir[PTRS_PER_PGD] __section(".bss..invalid_pg_dir");
> @@ -140,6 +141,19 @@ bool force_dma_unencrypted(struct device *dev)
> return is_prot_virt_guest();
> }
>
> +
> +bool cc_platform_has(enum cc_attr attr)
> +{
> + switch (attr) {
> + case CC_ATTR_MEM_ENCRYPT:
> + return is_prot_virt_guest();
> +
> + default:
> + return false;
> + }
> +}
> +EXPORT_SYMBOL_GPL(cc_platform_has);
> +
> /* protected virtualization */
> static void __init pv_init(void)
> {
Hello Aneesh,
Thanks for adding this s390 support patch.
The previous v4 series broke virtio initialization and caused boot
failures on s390. With this patch in v5, the issue is completely
resolved and virtio devices now initialize successfully and are
working well.
I'm going to do some more testing and will let you know if I run
into any issues.
Thanks,
Jaehoon.
^ permalink raw reply
* Re: [PATCH v2 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Kiryl Shutsemau @ 2026-05-22 16:22 UTC (permalink / raw)
To: Dave Hansen
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
Kai Huang, Borys Tsyrulnikov, linux-kernel, linux-coco, kvm,
stable
In-Reply-To: <bf92ebbf-8d70-406a-aea1-c11ca576de90@intel.com>
On Tue, May 12, 2026 at 06:14:13PM -0700, Dave Hansen wrote:
> On 4/28/26 05:56, Kiryl Shutsemau (Meta) wrote:
> > + if (size == 4)
> > + regs->ax = 0;
> > + else
> > + regs->ax &= ~mask;
>
> I haven't thought about this _that_ much, but this feels wrong. Why is
> is 4 so special cased?
>
> Also, what _are_ the limits on the registers that 'in' can be used on?
>
> RAX - n/a, no 64-bit I/O
> EAX - size=4
> AX - size=2
> AH - n/a no encoding for inb
> AL - size=1
>
> I'd find this much easier to grasp if there was a nice table of what the
> registers, sizes, and masks ended up being usable. As usual, x86 is
> "fun" here.
How about this for the comment:
/*
* IN writes the result into a sub-register of RAX. Only the
* 32-bit form zero-extends; the smaller forms leave the upper
* bits untouched:
*
* insn dest size bits written bits preserved
* inb AL 1 RAX[ 7: 0] RAX[63: 8]
* inw AX 2 RAX[15: 0] RAX[63:16]
* inl EAX 4 RAX[63: 0] (none, zero-extended)
*
* 'mask' only covers the low 'size' bytes, which is exactly
* the range affected for size 1 and 2. For size 4 the write
* also clears RAX[63:32], so widen the clear-mask.
*/
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v2 1/4] x86/tdx: Use PFN directly for mapping guest private memory
From: Kiryl Shutsemau @ 2026-05-22 16:31 UTC (permalink / raw)
To: Yan Zhao
Cc: dave.hansen, pbonzini, seanjc, tglx, mingo, bp, x86, linux-kernel,
kvm, linux-coco, kai.huang, rick.p.edgecombe, yilun.xu,
vannapurve, ackerleytng, sagis, binbin.wu, xiaoyao.li,
isaku.yamahata
In-Reply-To: <20260430014929.24210-1-yan.y.zhao@intel.com>
On Thu, Apr 30, 2026 at 09:49:29AM +0800, Yan Zhao wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
> mapping guest private memory and have them take PFN directly.
>
> Having core TDX make assumptions that guest private memory must be backed
> by struct page (and/or folio) will create subtle dependencies on how
> KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
> allocated from core MM, if the memory is refcounted, or if the folio is
> split) that are easily avoided. [1].
>
> KVM's MMUs work with PFNs. This is very much an intentional design choice.
> It ensures that the KVM MMUs remain flexible and are not too tied to the
> regular CPU MMUs and the kernel code around them. Using 'struct page' for
> TDX guest memory is not a good fit anywhere near the KVM MMU code [2].
>
> Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
> since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
> only.
>
> [ Yan: Replace "u64 pfn" with "kvm_pfn_t pfn" ]
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com [1]
> Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com [2]
Acked-by: Kiryl Shutsemau <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Kiryl Shutsemau @ 2026-05-22 16:39 UTC (permalink / raw)
To: Yan Zhao
Cc: dave.hansen, pbonzini, seanjc, tglx, mingo, bp, x86, linux-kernel,
kvm, linux-coco, kai.huang, rick.p.edgecombe, yilun.xu,
vannapurve, ackerleytng, sagis, binbin.wu, xiaoyao.li,
isaku.yamahata
In-Reply-To: <20260430014948.24226-1-yan.y.zhao@intel.com>
On Thu, Apr 30, 2026 at 09:49:48AM +0800, Yan Zhao wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Remove struct page assumptions/constraints in APIs for unmapping guest
> private memory and have them take physical address directly.
>
> Having core TDX make assumptions that guest private memory must be backed
> by struct page (and/or folio) will create subtle dependencies on how
> KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
> allocated from core MM, if the memory is refcounted, or if the folio is
> split) that are easily avoided. [1].
>
> KVM's MMUs work with PFNs. This is very much an intentional design choice.
> It ensures that the KVM MMUs remain flexible and are not too tightly tied
> to the regular CPU MMUs and the kernel code around them. Using
> "struct page" for TDX guest memory is not a good fit anywhere near the KVM
> MMU code [2].
>
> Therefore, for unmapping guest private memory: export
> tdx_quirk_reset_paddr() for direct KVM invocation, and convert the SEAMCALL
> wrapper API tdh_phymem_page_wbinvd_hkid() to take PFN as input (thus
> updating mk_keyed_paddr() and tdh_phymem_page_wbinvd_tdr()).
>
> Intentionally have KVM pass PAGE_SIZE (rather than KVM_HPAGE_SIZE(level))
> to tdx_quirk_reset_paddr() in tdx_sept_remove_private_spte() to avoid
> mixing in huge page changes. The KVM_BUG_ON() check for !PG_LEVEL_4K in
> tdx_sept_remove_private_spte() justifies using PAGE_SIZE.
>
> Do not convert tdx_reclaim_page() to use PFN as input since it currently
> does not remove guest private memory.
>
> Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
> since APIs tdh_phymem_page_wbinvd_hkid() and tdx_quirk_reset_paddr() are
> exported to KVM only.
>
> [Yan: Use kvm_pfn_t,exclude tdx_reclaim_page(),use tdx_quirk_reset_paddr()]
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com [1]
> Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com [2]
Acked-by: Kiryl Shutsemau <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v2 3/4] x86/tdx: Drop exported function tdx_quirk_reset_page()
From: Kiryl Shutsemau @ 2026-05-22 16:39 UTC (permalink / raw)
To: Yan Zhao
Cc: dave.hansen, pbonzini, seanjc, tglx, mingo, bp, x86, linux-kernel,
kvm, linux-coco, kai.huang, rick.p.edgecombe, yilun.xu,
vannapurve, ackerleytng, sagis, binbin.wu, xiaoyao.li,
isaku.yamahata
In-Reply-To: <20260430015001.24242-1-yan.y.zhao@intel.com>
On Thu, Apr 30, 2026 at 09:50:01AM +0800, Yan Zhao wrote:
> KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
> S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
> and thus always have struct page.
>
> However, it's also reasonable for KVM to reset those TDX control pages via
> tdx_quirk_reset_paddr() directly, eliminating the need to export two
> parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
> header file is also unnecessary.
>
> No functional change intended.
>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v2 4/4] x86/virt/tdx: Move mk_keyed_paddr() to tdx.c due to no external users
From: Kiryl Shutsemau @ 2026-05-22 16:41 UTC (permalink / raw)
To: Yan Zhao
Cc: dave.hansen, pbonzini, seanjc, tglx, mingo, bp, x86, linux-kernel,
kvm, linux-coco, kai.huang, rick.p.edgecombe, yilun.xu,
vannapurve, ackerleytng, sagis, binbin.wu, xiaoyao.li,
isaku.yamahata
In-Reply-To: <20260430015014.24261-1-yan.y.zhao@intel.com>
On Thu, Apr 30, 2026 at 09:50:14AM +0800, Yan Zhao wrote:
> Move mk_keyed_paddr() from tdx.h to tdx.c to avoid unnecessary header
> inclusion and improve encapsulation since there are no users outside of
> tdx.c.
>
> No functional change intended.
Add a new line before SoB.
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Otherwise:
Acked-by: Kiryl Shutsemau <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH] x86/tdx: Fix zero-extension for CPUID emulation
From: Kiryl Shutsemau @ 2026-05-22 16:54 UTC (permalink / raw)
To: Dave Hansen
Cc: Edgecombe, Rick P, linux-coco@lists.linux.dev, clopez@suse.de,
x86@kernel.org, ak@linux.intel.com, bp@alien8.de,
dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
linux-kernel@vger.kernel.org, Luck, Tony, tglx@kernel.org,
stable@vger.kernel.org, kvm@vger.kernel.org
In-Reply-To: <7f7b8bfd-f39e-417c-991f-d224d58cb52a@intel.com>
On Tue, May 12, 2026 at 03:14:54PM -0700, Dave Hansen wrote:
> On 5/12/26 14:48, Edgecombe, Rick P wrote:
> >> - regs->ax = args.r12;
> >> - regs->bx = args.r13;
> >> - regs->cx = args.r14;
> >> - regs->dx = args.r15;
> >> + regs->ax = lower_32_bits(args.r12);
> >> + regs->bx = lower_32_bits(args.r13);
> >> + regs->cx = lower_32_bits(args.r14);
> >> + regs->dx = lower_32_bits(args.r15);
> >>
> > Can you explain the impact here? Why should the guest fixup what the VMM
> > emulates?
>
> Oh boy.
>
> args.r12-15 come from the VMM, right? So the VMM Can put whatever it
> wants in there.
>
> CPUID (the instruction) is defined to fill in eax/ebx/ecx/edx. Those are
> 32-bit registers so the normal register rules apply: "32-bit operands
> generate a 32-bit result, zero-extended to a 64-bit result in the
> destination general-purpose register."
>
> So a properly-behaving CPUID implementation will always end up with the
> top 32 bits empty on the four CPUID registers after a CPUID is executed.
>
> The VMM here obviously might be naughty and might put gunk in
> args.r12/r13/r14/r15 that gets copied to ptregs->ax/bx/cx/dx which are
> 'unsigned long' on 64-bit.
>
> The end result is that a TDX guest can use CPUID and end up having bits
> set in rax/rbx/rcx/rdx that are architecturally impossible. This patch
> is effectively fixing up the VMM naughtiness before the guest CPUID
> instance can see it.
>
> Does anybody disagree with any of that?
Not really.
But note that the exposure is minimal as we do not issue hypercalls to
VMM for anything outside of hypervisor range. I am not sure stable@ is
justified, but worth fixing.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v13 07/22] KVM: selftests: Introduce structures for TDX guest boot parameters
From: Yosry Ahmed @ 2026-05-22 17:43 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-7-6983ae4c3a4d@google.com>
> +static void __attribute__((used)) common(void)
> +{
> + OFFSET(TD_BOOT_PARAMETERS_CR0, td_boot_parameters, cr0);
> + OFFSET(TD_BOOT_PARAMETERS_CR3, td_boot_parameters, cr3);
> + OFFSET(TD_BOOT_PARAMETERS_CR4, td_boot_parameters, cr4);
> + OFFSET(TD_BOOT_PARAMETERS_GDT, td_boot_parameters, gdtr);
> + OFFSET(TD_BOOT_PARAMETERS_IDT, td_boot_parameters, idtr);
> + OFFSET(TD_BOOT_PARAMETERS_PER_VCPU, td_boot_parameters, per_vcpu);
> + OFFSET(TD_PER_VCPU_PARAMETERS_ESP_GVA, td_per_vcpu_parameters, esp_gva);
> + OFFSET(TD_PER_VCPU_PARAMETERS_GUEST_CODE, td_per_vcpu_parameters,
> + guest_code);
> + DEFINE(SIZEOF_TD_PER_VCPU_PARAMETERS,
> + sizeof(struct td_per_vcpu_parameters));
> +}
This is neat.
Sean, is this the preferred way to expose offsets to asm files (or asm
code blocks) -- as opposed to say using .equ [*]?
If yes, I can rework my nVMX GPR fixes to use the same approach for
register offsets. I wonder if the non-TDX part of this patch (i.e.
Makefile stuff) can be split, then patch 6 and the Makefile stuff can
land independently and allow development on top.
I can also split them out and include them in the next version of my
series, then whichever series lands first will land the offsets
support.
WDYT?
[*]https://lore.kernel.org/kvm/20260518202514.2037078-2-yosry@kernel.org/
^ permalink raw reply
* [PATCH 2/3] KVM: guest_memfd: Fix possible signed integer overflow
From: Ackerley Tng via B4 Relay @ 2026-05-22 20:45 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Kiryl Shutsemau, Rick Edgecombe, Vishal Annapurve, Yan Zhao,
Michael Roth, Isaku Yamahata, Chao Peng, Xiaoyao Li, Zongyao Chen
Cc: kvm, linux-kernel, linux-coco, Yu Zhang, Fuad Tabba, Ackerley Tng
In-Reply-To: <20260522-fix-sev-gmem-post-populate-v1-0-9fc8d6437b65@google.com>
From: Sean Christopherson <seanjc@google.com>
The caller, kvm_set_memory_region(), checks for an overflow in an unsigned
u64 guest_memfd_offset. When guest_memfd_offset is passed to kvm_gmem_bind,
it is cast into a signed 64-bit integer.
Hence, a large 64-bit offset could result in a negative loff_t, which could
result in the overflow checks failing.
Make kvm_gmem_bind() take u64 instead of loff_t to consistently deal with
unsigned values to avoid this issue.
Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Use size_t for size instead of u64]
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
virt/kvm/guest_memfd.c | 7 +++----
virt/kvm/kvm_mm.h | 2 +-
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 07d8db344872b..d203135969d13 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -640,9 +640,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
}
int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
- unsigned int fd, loff_t offset)
+ unsigned int fd, u64 offset)
{
- loff_t size = slot->npages << PAGE_SHIFT;
+ size_t size = slot->npages << PAGE_SHIFT;
unsigned long start, end;
struct gmem_file *f;
struct inode *inode;
@@ -664,8 +664,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
inode = file_inode(file);
- if (offset < 0 || !PAGE_ALIGNED(offset) ||
- offset + size > i_size_read(inode))
+ if (!PAGE_ALIGNED(offset) || offset + size > i_size_read(inode))
goto err;
filemap_invalidate_lock(inode->i_mapping);
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 9fcc5d5b7f8d0..23813d74ce709 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -72,7 +72,7 @@ int kvm_gmem_init(struct module *module);
void kvm_gmem_exit(void);
int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
- unsigned int fd, loff_t offset);
+ unsigned int fd, u64 offset);
void kvm_gmem_unbind(struct kvm_memory_slot *slot);
#else
static inline int kvm_gmem_init(struct module *module)
--
2.54.0.794.g4f17f83d09-goog
^ permalink raw reply related
* [PATCH 0/3] guest_memfd fixes for bind and populate
From: Ackerley Tng via B4 Relay @ 2026-05-22 20:45 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Kiryl Shutsemau, Rick Edgecombe, Vishal Annapurve, Yan Zhao,
Michael Roth, Isaku Yamahata, Chao Peng, Xiaoyao Li, Zongyao Chen
Cc: kvm, linux-kernel, linux-coco, Yu Zhang, Fuad Tabba, Ackerley Tng
This series is a group of fixes for the bind and populate flows for
guest_memfd, and fixes some issues reported by Sashiko after reviewing the
guest_memfd in-place conversions series [1] and another fixup series Sean
posted [3].
Sashiko pointed out
+ Possible write to read-only page [1]
=> Fixed in patch 1
+ Signed integer overflow in kvm_gmem_bind() twice: [2][3]
=> Fixed in patch 2
+ Unchecked xa_store_range() [3]
=> Fixed in patch 3
[1] https://lore.kernel.org/all/CA+EHjTwrygfMrZZSw4y7-ry8fidW2x0C7iuF2Q=dnPNHUmNtUg@mail.gmail.com/
[2] https://lore.kernel.org/all/CA+EHjTxcadguOfOo7RpJVtAzcY5JAFZTbrAT_wcN6akMi8gCUg@mail.gmail.com/
[3] https://lore.kernel.org/all/20260522180530.EE9101F00A3E@smtp.kernel.org/
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
Ackerley Tng (1):
KVM: guest_memfd: Handle errors from xa_store_range() when binding
Sean Christopherson (2):
KVM: guest_memfd: Use write permissions when GUP-ing source pages
KVM: guest_memfd: Fix possible signed integer overflow
arch/x86/kvm/svm/sev.c | 1 +
arch/x86/kvm/vmx/tdx.c | 2 +-
include/linux/kvm_host.h | 3 ++-
virt/kvm/guest_memfd.c | 18 ++++++++++--------
virt/kvm/kvm_mm.h | 2 +-
5 files changed, 15 insertions(+), 11 deletions(-)
---
base-commit: b7fbe9a1bf9ee6c967ef77d366ca58c35fcf1887
change-id: 20260522-fix-sev-gmem-post-populate-a36bef7f0698
Best regards,
--
Ackerley Tng <ackerleytng@google.com>
^ permalink raw reply
* [PATCH 1/3] KVM: guest_memfd: Use write permissions when GUP-ing source pages
From: Ackerley Tng via B4 Relay @ 2026-05-22 20:45 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Kiryl Shutsemau, Rick Edgecombe, Vishal Annapurve, Yan Zhao,
Michael Roth, Isaku Yamahata, Chao Peng, Xiaoyao Li, Zongyao Chen
Cc: kvm, linux-kernel, linux-coco, Yu Zhang, Fuad Tabba, Ackerley Tng
In-Reply-To: <20260522-fix-sev-gmem-post-populate-v1-0-9fc8d6437b65@google.com>
From: Sean Christopherson <seanjc@google.com>
sev_gmem_post_populate() may write to the source page if there was an error
while performing SNP_LAUNCH_UPDATE.
Since GUP requested only reads, there is a chance sev_gmem_post_populate()
could be writing to some read-only page.
sev_gmem_post_populate() will only ever write the source page if the type
of page being LAUNCH_UPDATEd is a CPUID page. Hence, request a writable
page only when loading the CPUID page.
Since TDX never writes to the source page, always pass false to
kvm_gmem_populate().
With this, even if a read-only mapping or the global zero page was provided
as the source page, GUP will do a copy-on-write, making it writable before
the write happens in gvm_post_populate.
Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
arch/x86/kvm/svm/sev.c | 1 +
arch/x86/kvm/vmx/tdx.c | 2 +-
include/linux/kvm_host.h | 3 ++-
virt/kvm/guest_memfd.c | 6 ++++--
4 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 940b97d4a8523..2f254c447923e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2469,6 +2469,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
sev_populate_args.type = params.type;
count = kvm_gmem_populate(kvm, params.gfn_start, src, npages,
+ params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID,
sev_gmem_post_populate, &sev_populate_args);
if (count < 0) {
argp->error = sev_populate_args.fw_error;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b8c3d3d8bbfe5..00dcfcbc47f68 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3185,7 +3185,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
};
gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
u64_to_user_ptr(region.source_addr),
- 1, tdx_gmem_post_populate, &arg);
+ 1, false, tdx_gmem_post_populate, &arg);
if (gmem_ret < 0) {
ret = gmem_ret;
break;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4c14aee1fb063..2c5ad9a6d5ce8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2596,7 +2596,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
struct page *page, void *opaque);
-long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
+long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
+ long npages, bool may_writeback_src,
kvm_gmem_populate_cb post_populate, void *opaque);
#endif
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 69c9d6d546b28..07d8db344872b 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -858,7 +858,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
return ret;
}
-long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
+long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
+ long npages, bool may_writeback_src,
kvm_gmem_populate_cb post_populate, void *opaque)
{
struct kvm_memory_slot *slot;
@@ -892,8 +893,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
if (src) {
unsigned long uaddr = (unsigned long)src + i * PAGE_SIZE;
+ unsigned int flags = may_writeback_src ? FOLL_WRITE : 0;
- ret = get_user_pages_fast(uaddr, 1, 0, &src_page);
+ ret = get_user_pages_fast(uaddr, 1, flags, &src_page);
if (ret < 0)
break;
if (ret != 1) {
--
2.54.0.794.g4f17f83d09-goog
^ permalink raw reply related
* [PATCH 3/3] KVM: guest_memfd: Handle errors from xa_store_range() when binding
From: Ackerley Tng via B4 Relay @ 2026-05-22 20:45 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Kiryl Shutsemau, Rick Edgecombe, Vishal Annapurve, Yan Zhao,
Michael Roth, Isaku Yamahata, Chao Peng, Xiaoyao Li, Zongyao Chen
Cc: kvm, linux-kernel, linux-coco, Yu Zhang, Fuad Tabba, Ackerley Tng
In-Reply-To: <20260522-fix-sev-gmem-post-populate-v1-0-9fc8d6437b65@google.com>
From: Ackerley Tng <ackerleytng@google.com>
Unhandled errors from xa_store_range() means kvm_gmem_bind() might falsely
reporting success, leading to false assumptions in guest_memfd's lifecycle
later.
Handle these errors by checking and returning the error to the userspace.
Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
virt/kvm/guest_memfd.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d203135969d13..104f0f3d6a0b3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -648,6 +648,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
struct inode *inode;
struct file *file;
int r = -EINVAL;
+ void *result;
BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
@@ -688,7 +689,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
if (kvm_gmem_supports_mmap(inode))
slot->flags |= KVM_MEMSLOT_GMEM_ONLY;
- xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
+ result = xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
filemap_invalidate_unlock(inode->i_mapping);
/*
@@ -696,7 +697,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
* not the other way 'round. Active bindings are invalidated if the
* file is closed before memslots are destroyed.
*/
- r = 0;
+ r = xa_is_err(result) ? xa_err(result) : 0;
err:
fput(file);
return r;
--
2.54.0.794.g4f17f83d09-goog
^ permalink raw reply related
* Re: [PATCH v6 01/43] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Ackerley Tng @ 2026-05-22 21:45 UTC (permalink / raw)
To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-1-91ab5a8b19a4@google.com>
Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:
>
> [...snip...]
>
> +static int kvm_gmem_init_inode(struct inode *inode, loff_t size, u64 flags)
> +{
>
> [...snip...]
>
> + filemap_invalidate_lock(inode->i_mapping);
> + r = mas_store_gfp(&mas, xa_mk_value(attrs), GFP_KERNEL);
Sashiko says using GFP_KERNEL with this attributes maple_tree could
allow a process creating a very fragmented maple tree to consume lots of
memory not charged to some memcg and proposed using GFP_KERNEL_ACCOUNT.
The problem with using GFP_KERNEL_ACCOUNT is that the maple tree nodes
are allocated from a shared kmem_cache maple_node_cache. Allocating the
maple tree nodes using GFP_KERNEL_ACCOUNT would mean that the node could
be reused by other maple trees unrelated to this process, and so the
nodes might long outlive the process using this guest_memfd, keeping the
memcg alive far longer than the VM.
For now I think it's okay to stick with GFP_KERNEL? Does anyone else
have suggestions on how to solve this?
> + filemap_invalidate_unlock(inode->i_mapping);
> +
> + return r;
> +}
>
> [...snip...]
>
^ permalink raw reply
* [PATCH v2 0/5] guest_memfd fixes for bind and populate
From: Ackerley Tng via B4 Relay @ 2026-05-22 22:46 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Kiryl Shutsemau, Rick Edgecombe, Vishal Annapurve, Yan Zhao,
Michael Roth, Isaku Yamahata, Chao Peng, Xiaoyao Li, Zongyao Chen
Cc: kvm, linux-kernel, linux-coco, Yu Zhang, Fuad Tabba, Ackerley Tng
This series is a group of fixes for the bind and populate flows for
guest_memfd, and fixes some issues reported by Sashiko after reviewing the
guest_memfd in-place conversions series [1] and another fixup series Sean
posted [3].
Changes in v2:
+ Add patch 4 and 5 to fix more issues, see below
+ Also update stub for kvm_gmem_bind()
Sashiko pointed out
+ Possible write to read-only page [1]
=> Fixed in patch 1
+ Signed integer overflow in kvm_gmem_bind() twice: [2][3]
=> Fixed in patch 2
+ Unchecked xa_store_range() [3]
=> Fixed in patch 3
+ Ordering issue with kmap_* and kunmap_* in sev_gmem_post_populate() [4]
=> Fixed in patch 4
+ Ordering issue with kmap_* and kunmap_* in sev_gmem_post_populate() [5]
=> Fixed in patch 5
[1] https://lore.kernel.org/all/CA+EHjTwrygfMrZZSw4y7-ry8fidW2x0C7iuF2Q=dnPNHUmNtUg@mail.gmail.com/
[2] https://lore.kernel.org/all/CA+EHjTxcadguOfOo7RpJVtAzcY5JAFZTbrAT_wcN6akMi8gCUg@mail.gmail.com/
[3] https://lore.kernel.org/all/20260522180530.EE9101F00A3E@smtp.kernel.org/
[4] https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=21
[5] https://sashiko.dev/#/patchset/20260522-fix-sev-gmem-post-populate-v1-0-9fc8d6437b65%40google.com?part=1
v1: https://lore.kernel.org/r/20260522-fix-sev-gmem-post-populate-v1-0-9fc8d6437b65@google.com
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
Ackerley Tng (3):
KVM: guest_memfd: Handle errors from xa_store_range() when binding
KVM: SNP: Fix kunmap_local() unmapping order
KVM: SNP: Mark source page dirty in sev_gmem_post_populate
Sean Christopherson (2):
KVM: guest_memfd: Use write permissions when GUP-ing source pages
KVM: guest_memfd: Fix possible signed integer overflow
arch/x86/kvm/svm/sev.c | 6 ++++--
arch/x86/kvm/vmx/tdx.c | 2 +-
include/linux/kvm_host.h | 3 ++-
virt/kvm/guest_memfd.c | 24 ++++++++++++++++--------
virt/kvm/kvm_mm.h | 4 ++--
5 files changed, 25 insertions(+), 14 deletions(-)
---
base-commit: b7fbe9a1bf9ee6c967ef77d366ca58c35fcf1887
change-id: 20260522-fix-sev-gmem-post-populate-a36bef7f0698
Best regards,
--
Ackerley Tng <ackerleytng@google.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox