* Re: [PATCH v6 14/20] dma-direct: return struct page from dma_direct_alloc_from_pool()
From: Petr Tesarik @ 2026-06-09 13:12 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, stable, Michael Kelley
In-Reply-To: <20260604083959.1265923-15-aneesh.kumar@kernel.org>
On Thu, 4 Jun 2026 14:09:53 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> Commit 5b138c534fda ("dma-direct: factor out a dma_direct_alloc_from_pool
> helper") changed dma_direct_alloc_from_pool() to return the CPU address
> from dma_alloc_from_pool(). That fits dma_direct_alloc(), but
> dma_direct_alloc_pages() also uses the helper and expects a struct page *.
>
> Fix this by making dma_direct_alloc_from_pool() return the struct page *
> again, and pass the CPU address back through an out-parameter for the
> dma_direct_alloc() caller.
>
> Fixes: 5b138c534fda ("dma-direct: factor out a dma_direct_alloc_from_pool helper")
> Cc: stable@vger.kernel.org
While I totally agree with the reasoning and the fix, it's interesting
that this bug has been apparently present in the kernel for 5+ years
without anybody hitting nasty memory corruption bugs.
How can it be? Is the buggy code path never actually used in practice?
Does it hint at a missed opportunity to simplify the code?
Anyway, these these thoughts are intended for a possible future
cleanup. For now, let's apply the fix as is, of course.
Petr T
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> kernel/dma/direct.c | 21 ++++++++++++---------
> 1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 4e446aa4130e..e0ab9ff3f1d6 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -157,24 +157,24 @@ static bool dma_direct_use_pool(struct device *dev, gfp_t gfp)
> return !gfpflags_allow_blocking(gfp) && !is_swiotlb_for_alloc(dev);
> }
>
> -static void *dma_direct_alloc_from_pool(struct device *dev, size_t size,
> - dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
> +static struct page *dma_direct_alloc_from_pool(struct device *dev, size_t size,
> + dma_addr_t *dma_handle, void **cpu_addr, gfp_t gfp,
> + unsigned long attrs)
> {
> struct page *page;
> u64 phys_limit;
> - void *ret;
>
> if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_DMA_COHERENT_POOL)))
> return NULL;
>
> gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
> - page = dma_alloc_from_pool(dev, size, &ret, gfp, attrs,
> + page = dma_alloc_from_pool(dev, size, cpu_addr, gfp, attrs,
> dma_coherent_ok);
> if (!page)
> return NULL;
> *dma_handle = phys_to_dma_direct(dev, page_to_phys(page),
> !!(attrs & DMA_ATTR_CC_SHARED));
> - return ret;
> + return page;
> }
>
> static void *dma_direct_alloc_no_mapping(struct device *dev, size_t size,
> @@ -270,9 +270,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> * the atomic pools instead if we aren't allowed block.
> */
> if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
> - dma_direct_use_pool(dev, gfp))
> - return dma_direct_alloc_from_pool(dev, size, dma_handle,
> - gfp, attrs);
> + dma_direct_use_pool(dev, gfp)) {
> + page = dma_direct_alloc_from_pool(dev, size,
> + dma_handle, &cpu_addr,
> + gfp, attrs);
> + return page ? cpu_addr : NULL;
> + }
>
> if (is_swiotlb_for_alloc(dev)) {
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> @@ -445,7 +448,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>
> if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
> return dma_direct_alloc_from_pool(dev, size, dma_handle,
> - gfp, attrs);
> + &cpu_addr, gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
^ permalink raw reply
* [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are
effectively runtime variables in the Book3S64 code. THP swap code uses these
macros for e.g. to size some of its array data structures based on PMD_ORDER.
This patch series makes that usage dependent on the runtime variable.
Sayali did some performance runs of this on Book3S64 with Radix and it gives
40-50% performance improvement. We also plan to run it with Hash, will soon
update the results.
Note that this patch series is based out of linux-next (next-20260608).
Ritesh Harjani (IBM) (4):
include/linux/swap.h: Remove unused leftovers
mm, swap: make SWAPFILE_CLUSTER runtime
mm, swap: make SWAP_NR_ORDERS runtime
powerpc: Kconfig: Enable THP_SWAP on Book3S64
arch/powerpc/platforms/Kconfig.cputype | 1 +
include/linux/swap.h | 17 +---
mm/swap.h | 5 +-
mm/swap_table.h | 6 +-
mm/swapfile.c | 132 ++++++++++++++++++-------
5 files changed, 106 insertions(+), 55 deletions(-)
--
2.39.5
^ permalink raw reply
* [RFC 1/4] include/linux/swap.h: Remove unused leftovers
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
In-Reply-To: <cover.1781000840.git.ritesh.list@gmail.com>
This removed unused leftovers most of them are forward structure
declarations. Also removes SWAP_BATCH macro which isn't used any
where in the code.
Found these during manual code review.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
include/linux/swap.h | 7 -------
1 file changed, 7 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f0f68e245ba..46c25523d7b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -16,10 +16,6 @@
#include <uapi/linux/mempolicy.h>
#include <asm/page.h>
-struct notifier_block;
-
-struct bio;
-
#define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
#define SWAP_FLAG_PRIO_MASK 0x7fff
#define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */
@@ -29,7 +25,6 @@ struct bio;
#define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
SWAP_FLAG_DISCARD_PAGES)
-#define SWAP_BATCH 64
static inline int current_is_kswapd(void)
{
@@ -175,7 +170,6 @@ static inline void mm_account_reclaimed_pages(unsigned long pages)
struct address_space;
struct sysinfo;
-struct writeback_control;
struct zone;
/*
@@ -442,7 +436,6 @@ extern sector_t swapdev_block(int, pgoff_t);
extern int __swap_count(swp_entry_t entry);
extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
extern int swp_swapcount(swp_entry_t entry);
-struct backing_dev_info;
extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
sector_t swap_folio_sector(struct folio *folio);
--
2.39.5
^ permalink raw reply related
* [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
In-Reply-To: <cover.1781000840.git.ritesh.list@gmail.com>
This makes SWAPFILE_CLUSTER as a runtime value. Architectures like powerpc
book3s64 has HPAGE_PMD_NR, which is derived at runtime depending upon which
chosen mmu.
Hence this patch initializes SWAPFILE_CLUSTER at runtime and also
modifies swap_table and swap_memcg_table which were earlier using this
macro for defining the number of table entries.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
mm/swap.h | 5 +++--
mm/swap_table.h | 6 ++----
mm/swapfile.c | 27 ++++++++++++++++++++++-----
3 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index 77d2d14eda42..956879a69ddd 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -26,11 +26,12 @@ extern int page_cluster;
#define SWAP_TABLE_HAS_ZEROFLAG ((BITS_PER_LONG - SWAP_CACHE_PFN_MARK_BITS - \
SWAP_CACHE_PFN_BITS) > SWAP_COUNT_MIN_BITS)
+extern unsigned int swap_slots_in_cluster __read_mostly;
+#define SWAPFILE_CLUSTER swap_slots_in_cluster
+
#ifdef CONFIG_THP_SWAP
-#define SWAPFILE_CLUSTER HPAGE_PMD_NR
#define swap_entry_order(order) (order)
#else
-#define SWAPFILE_CLUSTER 256
#define swap_entry_order(order) 0
#endif
diff --git a/mm/swap_table.h b/mm/swap_table.h
index e6613e62f8d0..90e2a7852300 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -8,16 +8,14 @@
/* A typical flat array in each cluster as swap table */
struct swap_table {
- atomic_long_t entries[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(atomic_long_t, entries);
};
/* For storing memcg private id */
struct swap_memcg_table {
- unsigned short id[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(unsigned short, id);
};
-#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-
/*
* A swap table entry represents the status of a swap slot on a swap
* (physical or virtual) device. The swap table in each cluster is a
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78b49b0658ad..016a5aa0cb93 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -129,6 +129,17 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
.lock = INIT_LOCAL_LOCK(),
};
+unsigned int swap_slots_in_cluster __read_mostly;
+bool swap_table_use_page __read_mostly;
+
+static unsigned int generic_swap_slots_in_clusters(void)
+{
+ if (IS_ENABLED(CONFIG_THP_SWAP))
+ return HPAGE_PMD_NR;
+ else
+ return 256;
+}
+
/* May return NULL on invalid type, caller must check for NULL return */
static struct swap_info_struct *swap_type_to_info(int type)
{
@@ -437,7 +448,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
return;
rcu_assign_pointer(ci->table, NULL);
- if (!SWP_TABLE_USE_PAGE) {
+ if (!swap_table_use_page) {
kmem_cache_free(swap_table_cachep, table);
return;
}
@@ -456,7 +467,7 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
if (rcu_access_pointer(ci->table))
return 0;
- if (SWP_TABLE_USE_PAGE) {
+ if (swap_table_use_page) {
folio = folio_alloc(gfp | __GFP_ZERO, 0);
if (folio)
table = folio_address(folio);
@@ -471,7 +482,8 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
#ifdef CONFIG_MEMCG
if (!mem_cgroup_disabled()) {
VM_WARN_ON_ONCE(ci->memcg_table);
- ci->memcg_table = kzalloc_obj(*ci->memcg_table, gfp);
+ ci->memcg_table = kzalloc_flex(*ci->memcg_table, id,
+ SWAPFILE_CLUSTER, gfp);
if (!ci->memcg_table) {
swap_cluster_free_table(ci);
return -ENOMEM;
@@ -3912,14 +3924,19 @@ static int __init swapfile_init(void)
{
swapfile_maximum_size = arch_max_swapfile_size();
+ swap_slots_in_cluster = generic_swap_slots_in_clusters();
+ swap_table_use_page =
+ (swap_slots_in_cluster * sizeof(atomic_long_t) == PAGE_SIZE);
+
/*
* Once a cluster is freed, it's swap table content is read
* only, and all swap cache readers (swap_cache_*) verifies
* the content before use. So it's safe to use RCU slab here.
*/
- if (!SWP_TABLE_USE_PAGE)
+ if (!swap_table_use_page)
swap_table_cachep = kmem_cache_create("swap_table",
- sizeof(struct swap_table),
+ struct_size_t(struct swap_table, entries,
+ SWAPFILE_CLUSTER),
0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
#ifdef CONFIG_MIGRATION
--
2.39.5
^ permalink raw reply related
* [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
In-Reply-To: <cover.1781000840.git.ritesh.list@gmail.com>
SWAP_NR_ORDERS is currently a compile-time constant defined as PMD_ORDER
+ 1 when CONFIG_THP_SWAP=y, else 1.
This patch converts SWAP_NR_ORDERS and all the relevant code paths to
make it runtime dependent. This is needed for architectures like powerpc
book3s64, where PMD_ORDER is decided at runtime depending upon which MMU
is chosen (Radix / Hash).
One thing to note here is, if any of the allocations required in
swapfile_init() call (which is a subsys_initcall) fails, then we have no
option but to panic. This is inline with how memory allocation failures
in other subsys_initcall() are handled.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
include/linux/swap.h | 10 ++---
mm/swapfile.c | 105 ++++++++++++++++++++++++++++++-------------
2 files changed, 78 insertions(+), 37 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46c25523d7b8..063ab7c4d4a5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,9 +224,9 @@ enum {
#define SWAP_ENTRY_INVALID 0
#ifdef CONFIG_THP_SWAP
-#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#define swap_nr_orders() ((unsigned int)(PMD_ORDER + 1))
#else
-#define SWAP_NR_ORDERS 1
+#define swap_nr_orders() (1U)
#endif
/*
@@ -234,7 +234,7 @@ enum {
* The purpose is to optimize SWAP throughput on these device.
*/
struct swap_sequential_cluster {
- unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+ DECLARE_FLEX_ARRAY(unsigned int, next); /* Likely next allocation offset */
};
/*
@@ -250,9 +250,9 @@ struct swap_info_struct {
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
struct list_head full_clusters; /* full clusters list */
- struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+ struct list_head *nonfull_clusters;
/* list of cluster that contains at least one free slot */
- struct list_head frag_clusters[SWAP_NR_ORDERS];
+ struct list_head *frag_clusters;
/* list of cluster that are fragmented or contented */
unsigned int pages; /* total of usable pages of swap */
atomic_long_t inuse_pages; /* number of those currently in use */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 016a5aa0cb93..0a78802528cf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -118,16 +118,12 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
atomic_t nr_rotate_swap = ATOMIC_INIT(0);
struct percpu_swap_cluster {
- struct swap_info_struct *si[SWAP_NR_ORDERS];
- unsigned long offset[SWAP_NR_ORDERS];
+ struct swap_info_struct **si;
+ unsigned long *offset;
local_lock_t lock;
};
-static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
- .si = { NULL },
- .offset = { SWAP_ENTRY_INVALID },
- .lock = INIT_LOCAL_LOCK(),
-};
+static struct percpu_swap_cluster __percpu *percpu_swap_cluster;
unsigned int swap_slots_in_cluster __read_mostly;
bool swap_table_use_page __read_mostly;
@@ -545,7 +541,7 @@ swap_cluster_populate(struct swap_info_struct *si,
* Only cluster isolation from the allocator does table allocation.
* Swap allocator uses percpu clusters and holds the local lock.
*/
- lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
+ lockdep_assert_held(&this_cpu_ptr(percpu_swap_cluster)->lock);
if (!(si->flags & SWP_SOLIDSTATE))
lockdep_assert_held(&si->global_cluster_lock);
lockdep_assert_held(&ci->lock);
@@ -562,7 +558,7 @@ swap_cluster_populate(struct swap_info_struct *si,
spin_unlock(&ci->lock);
if (!(si->flags & SWP_SOLIDSTATE))
spin_unlock(&si->global_cluster_lock);
- local_unlock(&percpu_swap_cluster.lock);
+ local_unlock(&percpu_swap_cluster->lock);
ret = swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC |
GFP_KERNEL);
@@ -575,7 +571,7 @@ swap_cluster_populate(struct swap_info_struct *si,
* could happen with ignoring the percpu cluster is fragmentation,
* which is acceptable since this fallback and race is rare.
*/
- local_lock(&percpu_swap_cluster.lock);
+ local_lock(&percpu_swap_cluster->lock);
if (!(si->flags & SWP_SOLIDSTATE))
spin_lock(&si->global_cluster_lock);
spin_lock(&ci->lock);
@@ -1016,8 +1012,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
relocate_cluster(si, ci);
swap_cluster_unlock(ci);
if (si->flags & SWP_SOLIDSTATE) {
- this_cpu_write(percpu_swap_cluster.offset[order], next);
- this_cpu_write(percpu_swap_cluster.si[order], si);
+ struct percpu_swap_cluster *pcp_sc = this_cpu_ptr(percpu_swap_cluster);
+
+ pcp_sc->offset[order] = next;
+ pcp_sc->si[order] = si;
} else {
si->global_cluster->next[order] = next;
}
@@ -1178,7 +1176,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
goto done;
/* Order 0 stealing from higher order */
- for (int o = 1; o < SWAP_NR_ORDERS; o++) {
+ for (int o = 1; o < swap_nr_orders(); o++) {
/*
* Clusters here have at least one usable slots and can't fail order 0
* allocation, but reclaim may drop si->lock and race with another user.
@@ -1376,13 +1374,14 @@ static bool swap_alloc_fast(struct folio *folio)
struct swap_cluster_info *ci;
struct swap_info_struct *si;
unsigned int offset;
+ struct percpu_swap_cluster *pcp_sc = this_cpu_ptr(percpu_swap_cluster);
/*
* Once allocated, swap_info_struct will never be completely freed,
* so checking it's liveness by get_swap_device_info is enough.
*/
- si = this_cpu_read(percpu_swap_cluster.si[order]);
- offset = this_cpu_read(percpu_swap_cluster.offset[order]);
+ si = pcp_sc->si[order];
+ offset = pcp_sc->offset[order];
if (!si || !offset || !get_swap_device_info(si))
return false;
@@ -1770,10 +1769,10 @@ int folio_alloc_swap(struct folio *folio)
}
again:
- local_lock(&percpu_swap_cluster.lock);
+ local_lock(&percpu_swap_cluster->lock);
if (!swap_alloc_fast(folio))
swap_alloc_slow(folio);
- local_unlock(&percpu_swap_cluster.lock);
+ local_unlock(&percpu_swap_cluster->lock);
if (!order && unlikely(!folio_test_swapcache(folio))) {
if (swap_sync_discard())
@@ -2166,6 +2165,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
struct swap_cluster_info *ci;
swp_entry_t entry = {0};
+ struct percpu_swap_cluster *pcp_sc;
if (!si)
goto fail;
@@ -2174,9 +2174,10 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
* Try the local cluster first if it matches the device. If
* not, try grab a new cluster and override local cluster.
*/
- local_lock(&percpu_swap_cluster.lock);
- pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
- pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
+ local_lock(&percpu_swap_cluster->lock);
+ pcp_sc = this_cpu_ptr(percpu_swap_cluster);
+ pcp_si = pcp_sc->si[0];
+ pcp_offset = pcp_sc->offset[0];
if (pcp_si == si && pcp_offset) {
ci = swap_cluster_lock(si, pcp_offset);
if (cluster_is_usable(ci, 0))
@@ -2186,7 +2187,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
}
if (!offset)
offset = cluster_alloc_swap_entry(si, NULL);
- local_unlock(&percpu_swap_cluster.lock);
+ local_unlock(&percpu_swap_cluster->lock);
if (offset)
entry = swp_entry(si->type, offset);
@@ -3029,6 +3030,16 @@ static void wait_for_allocation(struct swap_info_struct *si)
}
}
+static void free_swap_info_arrays(struct swap_info_struct *si)
+{
+ kfree(si->global_cluster);
+ si->global_cluster = NULL;
+ kfree(si->nonfull_clusters);
+ si->nonfull_clusters = NULL;
+ kfree(si->frag_clusters);
+ si->frag_clusters = NULL;
+}
+
static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
unsigned long maxpages)
{
@@ -3057,17 +3068,17 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
static void flush_percpu_swap_cluster(struct swap_info_struct *si)
{
int cpu, i;
- struct swap_info_struct **pcp_si;
+ struct percpu_swap_cluster *pcp_sc;
for_each_possible_cpu(cpu) {
- pcp_si = per_cpu_ptr(percpu_swap_cluster.si, cpu);
+ pcp_sc = per_cpu_ptr(percpu_swap_cluster, cpu);
/*
* Invalidate the percpu swap cluster cache, si->users
* is dead, so no new user will point to it, just flush
* any existing user.
*/
- for (i = 0; i < SWAP_NR_ORDERS; i++)
- cmpxchg(&pcp_si[i], si, NULL);
+ for (i = 0; i < swap_nr_orders(); i++)
+ cmpxchg(&pcp_sc->si[i], si, NULL);
}
}
@@ -3179,8 +3190,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
arch_swap_invalidate_area(p->type);
zswap_swapoff(p->type);
mutex_unlock(&swapon_mutex);
- kfree(p->global_cluster);
- p->global_cluster = NULL;
+ free_swap_info_arrays(p);
free_swap_cluster_info(cluster_info, maxpages);
inode = mapping->host;
@@ -3531,6 +3541,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
struct swap_cluster_info *cluster_info;
int err = -ENOMEM;
unsigned long i;
+ unsigned int nr_orders = swap_nr_orders();
cluster_info = kvzalloc_objs(*cluster_info, nr_clusters);
if (!cluster_info)
@@ -3539,11 +3550,19 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
for (i = 0; i < nr_clusters; i++)
spin_lock_init(&cluster_info[i].lock);
+ si->nonfull_clusters = kmalloc_objs(*si->nonfull_clusters, nr_orders);
+ if (!si->nonfull_clusters)
+ goto err;
+
+ si->frag_clusters = kmalloc_objs(*si->frag_clusters, nr_orders);
+ if (!si->frag_clusters)
+ goto err;
+
if (!(si->flags & SWP_SOLIDSTATE)) {
- si->global_cluster = kmalloc_obj(*si->global_cluster);
+ si->global_cluster = kmalloc_flex(*si->global_cluster, next, nr_orders);
if (!si->global_cluster)
goto err;
- for (i = 0; i < SWAP_NR_ORDERS; i++)
+ for (i = 0; i < nr_orders; i++)
si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
spin_lock_init(&si->global_cluster_lock);
}
@@ -3579,7 +3598,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
INIT_LIST_HEAD(&si->full_clusters);
INIT_LIST_HEAD(&si->discard_clusters);
- for (i = 0; i < SWAP_NR_ORDERS; i++) {
+ for (i = 0; i < nr_orders; i++) {
INIT_LIST_HEAD(&si->nonfull_clusters[i]);
INIT_LIST_HEAD(&si->frag_clusters[i]);
}
@@ -3599,6 +3618,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
si->cluster_info = cluster_info;
return 0;
err:
+ free_swap_info_arrays(si);
free_swap_cluster_info(cluster_info, maxpages);
return err;
}
@@ -3807,8 +3827,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
bad_swap_unlock_inode:
inode_unlock(inode);
bad_swap:
- kfree(si->global_cluster);
- si->global_cluster = NULL;
+ free_swap_info_arrays(si);
inode = NULL;
destroy_swap_extents(si, swap_file);
free_swap_cluster_info(si->cluster_info, si->max);
@@ -3922,6 +3941,10 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
static int __init swapfile_init(void)
{
+ unsigned int nr_orders = swap_nr_orders();
+ struct percpu_swap_cluster *pcp_sc;
+ int cpu;
+
swapfile_maximum_size = arch_max_swapfile_size();
swap_slots_in_cluster = generic_swap_slots_in_clusters();
@@ -3939,6 +3962,24 @@ static int __init swapfile_init(void)
SWAPFILE_CLUSTER),
0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
+ percpu_swap_cluster = alloc_percpu(struct percpu_swap_cluster);
+ if (!percpu_swap_cluster)
+ panic("%s: alloc_percpu failed for percpu_swap_cluster\n", __func__);
+
+ for_each_possible_cpu(cpu) {
+ int node = cpu_to_mem(cpu);
+
+ pcp_sc = per_cpu_ptr(percpu_swap_cluster, cpu);
+ local_lock_init(&pcp_sc->lock);
+ pcp_sc->si = kcalloc_node(nr_orders, sizeof(*pcp_sc->si),
+ GFP_KERNEL, node);
+ pcp_sc->offset = kcalloc_node(nr_orders, sizeof(*pcp_sc->offset),
+ GFP_KERNEL, node);
+ if (!pcp_sc->si || !pcp_sc->offset)
+ panic("%s: per-CPU kcalloc failed for cpu:%d, node:%d\n",
+ __func__, cpu, node);
+ }
+
#ifdef CONFIG_MIGRATION
if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
swap_migration_ad_supported = true;
--
2.39.5
^ permalink raw reply related
* [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
In-Reply-To: <cover.1781000840.git.ritesh.list@gmail.com>
This enables THP_SWAP support for Book3S64.
The performance testing of this patch series on Book3S64 with zram has shown
around 40-50% improvement in case of Radix. We will be doing some performance
testing on Hash too and will soon update the results.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
arch/powerpc/platforms/Kconfig.cputype | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..48f74bd22343 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -113,6 +113,7 @@ config PPC_THP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_WANTS_THP_SWAP if TRANSPARENT_HUGEPAGE
choice
prompt "CPU selection"
--
2.39.5
^ permalink raw reply related
* Re: [PATCH v6 16/20] dma: swiotlb: free dynamic pools from process context
From: Petr Tesarik @ 2026-06-09 13:23 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Michael Kelley
In-Reply-To: <20260604083959.1265923-17-aneesh.kumar@kernel.org>
On Thu, 4 Jun 2026 14:09:55 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> swiotlb_dyn_free() is used after removing a dynamic swiotlb pool from
> RCU-protected lists. It can call swiotlb_free_tlb(), which may need to
> restore the encryption state of an unencrypted pool with
> set_memory_encrypted() before freeing the pages.
>
> RCU callbacks run in atomic context, but set_memory_encrypted() is not
> guaranteed to be atomic-safe on all architectures. For example, page
> attribute updates may allocate page tables or take sleeping locks.
Good catch!
> Use queue_rcu_work() for dynamic pool freeing instead. This keeps the RCU
> grace period before freeing a published pool, while running the actual pool
> teardown from workqueue context. Use the same helper for the transient-pool
> error path, since that path may also be reached from atomic DMA mapping
> context.
Strictly speaking, it's not necessary, because this is in the error
path just after allocating a transient pool. There are only two
possible scenarios:
a. The transient buffer was allocated from a sleeping context, and then
it's also OK to decrypt memory.
b. The transient buffer was allocated in atomic context, but then it was
allocated from a coherent pool and it is returned to that pool
rather than decrypted.
However, it's also fine to queue an RCU work. The logic is definitely
cleaner and easier to maintain.
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Petr T
> ---
> include/linux/swiotlb.h | 4 ++--
> kernel/dma/swiotlb.c | 19 +++++++++++--------
> 2 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 4dcbf3931be1..526f82e9da45 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -64,7 +64,7 @@ extern void __init swiotlb_update_mem_attributes(void);
> * @areas: Array of memory area descriptors.
> * @slots: Array of slot descriptors.
> * @node: Member of the IO TLB memory pool list.
> - * @rcu: RCU head for swiotlb_dyn_free().
> + * @dyn_free: RCU work item used to free the pool from process context.
> * @transient: %true if transient memory pool.
> */
> struct io_tlb_pool {
> @@ -79,7 +79,7 @@ struct io_tlb_pool {
> struct io_tlb_slot *slots;
> #ifdef CONFIG_SWIOTLB_DYNAMIC
> struct list_head node;
> - struct rcu_head rcu;
> + struct rcu_work dyn_free;
> bool transient;
> bool unencrypted;
> #endif
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index f4e8b241a1c4..4c56f64602ea 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -774,13 +774,10 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
> add_mem_pool(mem, pool);
> }
>
> -/**
> - * swiotlb_dyn_free() - RCU callback to free a memory pool
> - * @rcu: RCU head in the corresponding struct io_tlb_pool.
> - */
> -static void swiotlb_dyn_free(struct rcu_head *rcu)
> +static void swiotlb_dyn_free_work(struct work_struct *work)
> {
> - struct io_tlb_pool *pool = container_of(rcu, struct io_tlb_pool, rcu);
> + struct io_tlb_pool *pool =
> + container_of(to_rcu_work(work), struct io_tlb_pool, dyn_free);
> size_t slots_size = array_size(sizeof(*pool->slots), pool->nslabs);
> size_t tlb_size = pool->end - pool->start;
>
> @@ -789,6 +786,12 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
> kfree(pool);
> }
>
> +static void swiotlb_schedule_dyn_free(struct io_tlb_pool *pool)
> +{
> + INIT_RCU_WORK(&pool->dyn_free, swiotlb_dyn_free_work);
> + queue_rcu_work(system_wq, &pool->dyn_free);
> +}
> +
> /**
> * __swiotlb_find_pool() - find the IO TLB pool for a physical address
> * @dev: Device which has mapped the DMA buffer.
> @@ -835,7 +838,7 @@ static void swiotlb_del_pool(struct device *dev, struct io_tlb_pool *pool)
> list_del_rcu(&pool->node);
> spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags);
>
> - call_rcu(&pool->rcu, swiotlb_dyn_free);
> + swiotlb_schedule_dyn_free(pool);
> }
>
> #endif /* CONFIG_SWIOTLB_DYNAMIC */
> @@ -1276,7 +1279,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> index = swiotlb_search_pool_area(dev, pool, 0, orig_addr, tbl_dma_addr,
> alloc_size, alloc_align_mask);
> if (index < 0) {
> - swiotlb_dyn_free(&pool->rcu);
> + swiotlb_schedule_dyn_free(pool);
> return -1;
> }
>
^ permalink raw reply
* Re: [PATCH v6 15/20] iommu/dma: Check atomic pool allocation result directly
From: Petr Tesarik @ 2026-06-09 13:13 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Michael Kelley
In-Reply-To: <20260604083959.1265923-16-aneesh.kumar@kernel.org>
On Thu, 4 Jun 2026 14:09:54 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> The non-blocking, non-coherent allocation path uses dma_alloc_from_pool(),
> which returns the allocated page and fills cpu_addr only on success.
>
> Do not rely on cpu_addr to detect allocation failure in this path. Check
> the returned page directly before using it for the IOMMU mapping.
>
> Fixes: 9420139f516d ("dma-pool: fix coherent pool allocations for IOMMU mappings")
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Petr T
> ---
> drivers/iommu/dma-iommu.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 725c7adb0a8d..52c599f4472c 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1671,13 +1671,16 @@ void *iommu_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
> }
>
> if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
> - !gfpflags_allow_blocking(gfp) && !coherent)
> + !gfpflags_allow_blocking(gfp) && !coherent) {
> page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
> gfp, attrs, NULL);
> - else
> + if (!page)
> + return NULL;
> + } else {
> cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
> - if (!cpu_addr)
> - return NULL;
> + if (!cpu_addr)
> + return NULL;
> + }
>
> *handle = __iommu_dma_map(dev, page_to_phys(page), size, ioprot,
> dev->coherent_dma_mask);
^ permalink raw reply
* Re: [PATCH v6 17/20] dma: swiotlb: handle set_memory_decrypted() failures
From: Petr Tesarik @ 2026-06-09 13:32 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Michael Kelley
In-Reply-To: <20260604083959.1265923-18-aneesh.kumar@kernel.org>
On Thu, 4 Jun 2026 14:09:56 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> Check the return value when converting swiotlb pools between encrypted and
> decrypted mappings. If the default pool cannot be decrypted after early
> initialization, mark the pool fully used so it cannot satisfy future bounce
> allocations.
>
> For late initialization, return the `set_memory_decrypted()` failure. For
> restricted DMA pools, fail device initialization if the reserved pool
> cannot be decrypted.
>
> This prevents swiotlb from using pools whose encryption attributes do not
> match their metadata, and avoids returning pages with uncertain encryption
> state back to the allocator.
This works fine, but instead of effectively leaking the memory, we
could return it to the buddy allocator and reset nslabs to zero as if
SWIOTLB was not even initialized.
OTOH I don't want to overthink this, because the system is probably not
too useful after such a boot-time failure, so unless you _want_ to
improve the error path, you can simply add:
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Petr T
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> kernel/dma/swiotlb.c | 80 +++++++++++++++++++++++++++++++++++---------
> 1 file changed, 65 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 4c56f64602ea..14d834ca298b 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -248,6 +248,23 @@ static inline unsigned long nr_slots(u64 val)
> return DIV_ROUND_UP(val, IO_TLB_SIZE);
> }
>
> +static void swiotlb_mark_pool_used(struct io_tlb_pool *pool)
> +{
> + unsigned long i;
> +
> + for (i = 0; i < pool->nareas; i++) {
> + pool->areas[i].index = 0;
> + pool->areas[i].used = pool->area_nslabs;
> + }
> +
> + for (i = 0; i < pool->nslabs; i++) {
> + pool->slots[i].list = 0;
> + pool->slots[i].orig_addr = INVALID_PHYS_ADDR;
> + pool->slots[i].alloc_size = 0;
> + pool->slots[i].pad_slots = 0;
> + }
> +}
> +
> /*
> * Early SWIOTLB allocation may be too early to allow an architecture to
> * perform the desired operations. This function allows the architecture to
> @@ -272,8 +289,16 @@ void __init swiotlb_update_mem_attributes(void)
> return;
> bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
>
> - if (io_tlb_default_mem.unencrypted)
> - set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
> + if (io_tlb_default_mem.unencrypted) {
> + int ret;
> +
> + ret = set_memory_decrypted((unsigned long)mem->vaddr,
> + bytes >> PAGE_SHIFT);
> + if (ret) {
> + pr_warn("Failed to decrypt default memory pool, disabling it\n");
> + swiotlb_mark_pool_used(mem);
> + }
> + }
> }
>
> static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
> @@ -442,9 +467,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> {
> struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
> unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
> + unsigned int order, area_order, slot_order;
> + bool leak_pages = false;
> unsigned int nareas;
> unsigned char *vstart = NULL;
> - unsigned int order, area_order;
> bool retried = false;
> int rc = 0;
>
> @@ -504,6 +530,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> (PAGE_SIZE << order) >> 20);
> }
>
> + rc = -ENOMEM;
> nareas = limit_nareas(default_nareas, nslabs);
> area_order = get_order(array_size(sizeof(*mem->areas), nareas));
> mem->areas = (struct io_tlb_area *)
> @@ -511,14 +538,20 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> if (!mem->areas)
> goto error_area;
>
> + slot_order = get_order(array_size(sizeof(*mem->slots), nslabs));
> mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> - get_order(array_size(sizeof(*mem->slots), nslabs)));
> + slot_order);
> if (!mem->slots)
> goto error_slots;
>
> - if (io_tlb_default_mem.unencrypted)
> - set_memory_decrypted((unsigned long)vstart,
> - (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> + if (io_tlb_default_mem.unencrypted) {
> + rc = set_memory_decrypted((unsigned long)vstart,
> + (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> + if (rc) {
> + leak_pages = true;
> + goto error_decrypt;
> + }
> + }
>
> swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
> nareas);
> @@ -527,16 +560,20 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> swiotlb_print_info();
> return 0;
>
> +error_decrypt:
> + free_pages((unsigned long)mem->slots, slot_order);
> error_slots:
> free_pages((unsigned long)mem->areas, area_order);
> error_area:
> - free_pages((unsigned long)vstart, order);
> - return -ENOMEM;
> + if (!leak_pages)
> + free_pages((unsigned long)vstart, order);
> + return rc;
> }
>
> void __init swiotlb_exit(void)
> {
> struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
> + bool leak_pages = false;
> unsigned long tbl_vaddr;
> size_t tbl_size, slots_size;
> unsigned int area_order;
> @@ -552,19 +589,23 @@ void __init swiotlb_exit(void)
> tbl_size = PAGE_ALIGN(mem->end - mem->start);
> slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>
> - if (io_tlb_default_mem.unencrypted)
> - set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> + if (io_tlb_default_mem.unencrypted) {
> + if (set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT))
> + leak_pages = true;
> + }
>
> if (mem->late_alloc) {
> area_order = get_order(array_size(sizeof(*mem->areas),
> mem->nareas));
> free_pages((unsigned long)mem->areas, area_order);
> - free_pages(tbl_vaddr, get_order(tbl_size));
> + if (!leak_pages)
> + free_pages(tbl_vaddr, get_order(tbl_size));
> free_pages((unsigned long)mem->slots, get_order(slots_size));
> } else {
> memblock_free(mem->areas,
> array_size(sizeof(*mem->areas), mem->nareas));
> - memblock_phys_free(mem->start, tbl_size);
> + if (!leak_pages)
> + memblock_phys_free(mem->start, tbl_size);
> memblock_free(mem->slots, slots_size);
> }
>
> @@ -1938,9 +1979,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> * restricted mem pool is decrypted by default
> */
> if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> + int ret;
> +
> mem->unencrypted = true;
> - set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> - rmem->size >> PAGE_SHIFT);
> + ret = set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> + rmem->size >> PAGE_SHIFT);
> + if (ret) {
> + dev_err(dev, "Failed to decrypt restricted DMA pool\n");
> + kfree(pool->areas);
> + kfree(pool->slots);
> + kfree(mem);
> + return ret;
> + }
> } else {
> mem->unencrypted = false;
> }
^ permalink raw reply
* Re: [PATCH v6 19/20] swiotlb: Preserve allocation virtual address for dynamic pools
From: Petr Tesarik @ 2026-06-09 13:40 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Michael Kelley
In-Reply-To: <20260604083959.1265923-20-aneesh.kumar@kernel.org>
On Thu, 4 Jun 2026 14:09:58 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> swiotlb_alloc_tlb() can allocate from the DMA atomic pool when a decrypted
> pool is needed from atomic context. With CONFIG_DMA_DIRECT_REMAP, the
> atomic pool is backed by remapped virtual addresses, which are not the same
> as the direct-map addresses returned by phys_to_virt().
>
> swiotlb_init_io_tlb_pool() currently reconstructs the pool virtual address
> from the physical start address. For atomic-pool backed allocations this
> stores the wrong address in pool->vaddr. Later, swiotlb_free_tlb() passes
> that address to dma_free_from_pool(), which will fail to recognize the
> chunk
>
> Pass the virtual address returned by the allocation path into
> swiotlb_init_io_tlb_pool(), and store that address in pool->vaddr. This
> keeps the pool free path using the same virtual address as the allocator.
>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Hm, so the old code was broken; you may want to add:
Fixes: 79636caad361 ("swiotlb: if swiotlb is full, fall back to a transient memory pool")
And of course:
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Thank you!
Petr T
> ---
> kernel/dma/swiotlb.c | 32 +++++++++++++++++++-------------
> 1 file changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 14d834ca298b..e4bd8c9eaeda 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -302,9 +302,9 @@ void __init swiotlb_update_mem_attributes(void)
> }
>
> static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
> - unsigned long nslabs, bool late_alloc, unsigned int nareas)
> + void *vaddr, unsigned long nslabs, bool late_alloc,
> + unsigned int nareas)
> {
> - void *vaddr = phys_to_virt(start);
> unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
>
> mem->nslabs = nslabs;
> @@ -445,7 +445,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
> return;
> }
>
> - swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false, nareas);
> + swiotlb_init_io_tlb_pool(mem, __pa(tlb), tlb, nslabs, false, nareas);
> add_mem_pool(&io_tlb_default_mem, mem);
>
> if (flags & SWIOTLB_VERBOSE)
> @@ -553,7 +553,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> }
> }
>
> - swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
> + swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), vstart, nslabs, true,
> nareas);
> add_mem_pool(&io_tlb_default_mem, mem);
>
> @@ -664,25 +664,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
> * @phys_limit: Maximum allowed physical address of the buffer.
> * @attrs: DMA attributes for the allocation.
> * @gfp: GFP flags for the allocation.
> + * @vaddr: Receives the virtual address for the allocated buffer.
> *
> * Return: Allocated pages, or %NULL on allocation failure.
> */
> static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> - u64 phys_limit, unsigned long attrs, gfp_t gfp)
> + u64 phys_limit, unsigned long attrs, gfp_t gfp, void **vaddr)
> {
> struct page *page;
>
> + *vaddr = NULL;
> +
> /*
> * Allocate from the atomic pools if memory is encrypted and
> * the allocation is atomic, because decrypting may block.
> */
> if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
> - void *vaddr;
> -
> if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
> return NULL;
>
> - return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
> + return dma_alloc_from_pool(dev, bytes, vaddr, gfp,
> attrs, dma_coherent_ok);
> }
>
> @@ -705,6 +706,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> return NULL;
> }
>
> + if (page)
> + *vaddr = phys_to_virt(page_to_phys(page));
> return page;
> }
>
> @@ -750,6 +753,7 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> {
> struct io_tlb_pool *pool;
> unsigned int slot_order;
> + void *tlb_vaddr;
> struct page *tlb;
> size_t pool_size;
> size_t tlb_size;
> @@ -767,7 +771,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
>
> tlb_size = nslabs << IO_TLB_SHIFT;
> - while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
> + while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp,
> + &tlb_vaddr))) {
> if (nslabs <= minslabs)
> goto error_tlb;
> nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
> @@ -781,12 +786,12 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> if (!pool->slots)
> goto error_slots;
>
> - swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), nslabs, true, nareas);
> + swiotlb_init_io_tlb_pool(pool, page_to_phys(tlb), tlb_vaddr, nslabs,
> + true, nareas);
> return pool;
>
> error_slots:
> - swiotlb_free_tlb(page_address(tlb), tlb_size,
> - !!(attrs & DMA_ATTR_CC_SHARED));
> + swiotlb_free_tlb(tlb_vaddr, tlb_size, !!(attrs & DMA_ATTR_CC_SHARED));
> error_tlb:
> kfree(pool);
> error:
> @@ -1995,7 +2000,8 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> mem->unencrypted = false;
> }
>
> - swiotlb_init_io_tlb_pool(pool, rmem->base, nslabs,
> + swiotlb_init_io_tlb_pool(pool, rmem->base, phys_to_virt(rmem->base),
> + nslabs,
> false, nareas);
> mem->force_bounce = true;
> mem->for_alloc = true;
^ permalink raw reply
* Re: [PATCH v6 00/20] dma-mapping: Use DMA_ATTR_CC_SHARED through direct, pool and swiotlb paths
From: Catalin Marinas @ 2026-06-09 13:43 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260604083959.1265923-1-aneesh.kumar@kernel.org>
On Thu, Jun 04, 2026 at 02:09:39PM +0530, Aneesh Kumar K.V (Arm) wrote:
> This series propagates DMA_ATTR_CC_SHARED through the dma-direct,
> dma-pool, and swiotlb paths so that encrypted and decrypted DMA buffers
> are handled consistently.
>
> Today, the direct DMA path mostly relies on force_dma_unencrypted() for
> shared/decrypted buffer handling. This series consolidates the
> force_dma_unencrypted() checks in the top-level functions and ensures
> that the remaining DMA interfaces use DMA attributes to make the correct
> decisions.
Please check Sashiko's reports, it has some good points:
https://sashiko.dev/#/patchset/20260604083959.1265923-1-aneesh.kumar@kernel.org
I think the main one is the swiotlb_tbl_map_single() changes which break
AMD SME host support. There cc_platform_has(CC_ATTR_MEM_ENCRYPT) is true
but force_dma_unencrypted() is false. Normally you'd not end up on this
path but you can have swiotlb=force.
> Aneesh Kumar K.V (Arm) (20):
> s390: Expose protected virtualization through cc_platform_has()
> dma-direct: swiotlb: handle swiotlb alloc/free outside
> __dma_direct_alloc_pages
> dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
> dma-pool: track decrypted atomic pools and select them via attrs
> dma: swiotlb: pass mapping attributes by reference
> dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
> dma-mapping: make dma_pgprot() honor DMA_ATTR_CC_SHARED
> dma-direct: pass attrs to dma_capable() for DMA_ATTR_CC_SHARED checks
> dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
> dma-direct: set decrypted flag for remapped DMA allocations
Patch 10 above...
> dma-direct: select DMA address encoding from DMA_ATTR_CC_SHARED
> dma-pool: fix page leak in atomic_pool_expand() cleanup
Patch 12...
> dma-direct: rename ret to cpu_addr in alloc helpers
> dma-direct: return struct page from dma_direct_alloc_from_pool()
> iommu/dma: Check atomic pool allocation result directly
and I think patches 14, 15 are independent fixes. Some of them even have
Fixes: tags and Cc: stable. Please move them to the beginning of the
series to avoid inadvertent dependencies and make them harder to
backport. It's also easier to follow the series without random fixes for
mainline in the middle.
> dma: swiotlb: free dynamic pools from process context
> dma: swiotlb: handle set_memory_decrypted() failures
> dma: free atomic pool pages by physical address
> swiotlb: Preserve allocation virtual address for dynamic pools
> swiotlb: remove unused SWIOTLB_FORCE flag
--
Catalin
^ permalink raw reply
* [PATCH V5 1/2] tools/perf: Fix the check for parameterized field in event term
From: Athira Rajeev @ 2026-06-09 13:43 UTC (permalink / raw)
To: acme, jolsa, adrian.hunter, mpetlan, tmricht, maddy, irogers,
namhyung
Cc: linux-perf-users, linuxppc-dev, atrajeev, hbathini, Tejas.Manhas1,
Tanushree.Shah, shivani, venkat88
The format_alias() function in util/pmu.c has a check to
detect whether the event has parameterized field ( =? ).
The string alias->terms contains the event and if the event
has user configurable parameter, there will be presence of
sub string "=?" in the alias->terms.
Snippet of code:
/* Paramemterized events have the parameters shown. */
if (strstr(alias->terms, "=?")) {
/* No parameters. */
snprintf(buf, len, "%.*s/%s/", (int)pmu_name_len, pmu->name, alias->name);
if "strstr" contains the substring, it returns a pointer
and hence enters the above check which is not the expected
check. And hence "perf list" doesn't have the parameterized
fields in the result.
Fix this check to use:
if (!strstr(alias->terms, "=?")) {
With this change, perf list shows the events correctly with
the strings showing parameters.
Before the fix:
# ./perf list|grep -w PM_PAU_CYC
hv_24x7/PM_PAU_CYC/ [Kernel PMU event]
With this fix:
# ./perf list|grep -w PM_PAU_CYC
hv_24x7/PM_PAU_CYC,chip=?/ [Kernel PMU event]
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
---
Changelog:
v4 -> v5:
Added Reviewed-by from Ian and Namhyung.
Added Tested-by from Venkat
v3 -> v4:
Updated commit message to show real example
addressing review comment from Namhyung.
v2 -> v3:
Split the strstr correction in a single patch
tools/perf/util/pmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 9994709ef12b..e765a7ffb0d6 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -2134,7 +2134,7 @@ static char *format_alias(char *buf, int len, const struct perf_pmu *pmu,
skip_duplicate_pmus);
/* Paramemterized events have the parameters shown. */
- if (strstr(alias->terms, "=?")) {
+ if (!strstr(alias->terms, "=?")) {
/* No parameters. */
snprintf(buf, len, "%.*s/%s/", (int)pmu_name_len, pmu->name, alias->name);
return buf;
--
2.52.0
^ permalink raw reply related
* [PATCH V5 2/2] tools/perf: Use scnprintf in buffer offset calculations
From: Athira Rajeev @ 2026-06-09 13:43 UTC (permalink / raw)
To: acme, jolsa, adrian.hunter, mpetlan, tmricht, maddy, irogers,
namhyung
Cc: linux-perf-users, linuxppc-dev, atrajeev, hbathini, Tejas.Manhas1,
Tanushree.Shah, shivani, venkat88
In-Reply-To: <20260609134332.97954-1-atrajeev@linux.ibm.com>
Replace snprintf with scnprintf in buffer offset calculations to
ensure the 'used' count will not exceed the "len".
The current logic in perf_pmu__for_each_event uses an unconditional
+ 1 increment to buf_used to account for null terminators. This can
cause a stack buffer overflow in the subsequent scnprintf call.
When the local stack buffer buf (1024 bytes) is full, buf_used can
reach 1025. This causes the subsequent remaining space calculation
sizeof(buf) - buf_used to underflow.
Use sub_non_neg() to see if space actually existed, and only
increment the offset if remaining space is present.
Changes includes:
- Use sub_non_neg to check if space exists
- Replacing snprintf with scnprintf to ensure the return value
reflects the actual bytes written into the buffer.
- Only increment buf_used by 1 if space exists
- If a parameterized event uses a built-in perf keyword for its
parameter name (eg, config=?), the lexer parses it as a predefined
term token, which sets term->config to NULL. Add check to use
parse_events__term_type_str() if term->config is NULL.
Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
---
Changelog:
v4 -> v5:
Addressed review comment from Namhyung in buf_used variable
to use return from scnprintf since cannot return a number
greater than or equal to argument
v2 -> v3:
- Split the scnprintf related changes in separate patch
- Handle the overflow issues and unconditional increment
wrapped around sub_non_neg addressing review comment from Sashiko
tools/perf/util/pmu.c | 40 +++++++++++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 11 deletions(-)
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index e765a7ffb0d6..1539960ba23b 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -2146,15 +2146,19 @@ static char *format_alias(char *buf, int len, const struct perf_pmu *pmu,
pr_err("Failure to parse '%s' terms '%s': %d\n",
alias->name, alias->terms, ret);
parse_events_terms__exit(&terms);
- snprintf(buf, len, "%.*s/%s/", (int)pmu_name_len, pmu->name, alias->name);
+ scnprintf(buf, len, "%.*s/%s/", (int)pmu_name_len, pmu->name, alias->name);
return buf;
}
- used = snprintf(buf, len, "%.*s/%s", (int)pmu_name_len, pmu->name, alias->name);
+ used = scnprintf(buf, len, "%.*s/%s", (int)pmu_name_len, pmu->name, alias->name);
list_for_each_entry(term, &terms.terms, list) {
+ const char *name = term->config;
+
+ if (!name)
+ name = parse_events__term_type_str(term->type_term);
if (term->type_val == PARSE_EVENTS__TERM_TYPE_STR)
- used += snprintf(buf + used, sub_non_neg(len, used),
- ",%s=%s", term->config,
+ used += scnprintf(buf + used, sub_non_neg(len, used),
+ ",%s=%s", name,
term->val.str);
}
parse_events_terms__exit(&terms);
@@ -2218,6 +2222,7 @@ int perf_pmu__for_each_event(struct perf_pmu *pmu, bool skip_duplicate_pmus,
int ret = 0;
struct hashmap_entry *entry;
size_t bkt;
+ size_t size_rem;
if (perf_pmu__is_tracepoint(pmu))
return tp_pmu__for_each_event(pmu, state, cb);
@@ -2251,17 +2256,30 @@ int perf_pmu__for_each_event(struct perf_pmu *pmu, bool skip_duplicate_pmus,
}
buf_used = strlen(buf) + 1;
}
+
info.scale_unit = NULL;
if (strlen(event->unit) || event->scale != 1.0) {
- info.scale_unit = buf + buf_used;
- buf_used += snprintf(buf + buf_used, sizeof(buf) - buf_used,
- "%G%s", event->scale, event->unit) + 1;
+ /* Check the remaining space */
+ size_rem = sub_non_neg(sizeof(buf), buf_used);
+
+ if (size_rem > 0) {
+ info.scale_unit = buf + buf_used;
+ buf_used += scnprintf(buf + buf_used, size_rem, "%G%s",
+ event->scale, event->unit) + 1;
+ }
}
info.desc = event->desc;
info.long_desc = event->long_desc;
- info.encoding_desc = buf + buf_used;
- buf_used += snprintf(buf + buf_used, sizeof(buf) - buf_used,
- "%.*s/%s/", (int)pmu_name_len, info.pmu_name, event->terms) + 1;
+ info.encoding_desc = NULL;
+
+ /* Check the remaining space */
+ size_rem = sub_non_neg(sizeof(buf), buf_used);
+ if (size_rem > 0) {
+ info.encoding_desc = buf + buf_used;
+ buf_used += scnprintf(buf + buf_used, size_rem, "%.*s/%s/",
+ (int)pmu_name_len, info.pmu_name, event->terms) + 1;
+ }
+
info.str = event->terms;
info.topic = event->topic;
info.deprecated = perf_pmu_alias__check_deprecated(pmu, event);
@@ -2271,7 +2289,7 @@ int perf_pmu__for_each_event(struct perf_pmu *pmu, bool skip_duplicate_pmus,
}
if (pmu->selectable) {
info.name = buf;
- snprintf(buf, sizeof(buf), "%s//", pmu->name);
+ scnprintf(buf, sizeof(buf), "%s//", pmu->name);
info.alias = NULL;
info.scale_unit = NULL;
info.desc = NULL;
--
2.52.0
^ permalink raw reply related
* Re: [PATCH v6 20/20] swiotlb: remove unused SWIOTLB_FORCE flag
From: Petr Tesarik @ 2026-06-09 13:44 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Mostafa Saleh, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260604083959.1265923-21-aneesh.kumar@kernel.org>
On Thu, 4 Jun 2026 14:09:59 +0530
"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
> SWIOTLB_FORCE has no remaining in-tree users. Forced bouncing is now
> controlled through the swiotlb=force command line option via
> swiotlb_force_bounce.
>
> Remove the unused flag and simplify the force_bounce initialization.
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> include/linux/swiotlb.h | 1 -
> kernel/dma/swiotlb.c | 3 +--
> 2 files changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 526f82e9da45..af88ca7182f4 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -15,7 +15,6 @@ struct page;
> struct scatterlist;
>
> #define SWIOTLB_VERBOSE (1 << 0) /* verbose initialization */
> -#define SWIOTLB_FORCE (1 << 1) /* force bounce buffering */
> #define SWIOTLB_ANY (1 << 2) /* allow any memory for the buffer */
These constants are kernel-internal, so let's not leave a hole in the
bitmask... I mean, what about changing SWIOTLB_ANY to (1 << 1) after
you remove SWIOTLB_FORCE?
Other than that, LGTM.
I consider this whole series a big step towards saner handling of
encrypted/decrypted memory for DMA buffers. Thank you for your effort!
Petr T
>
> /*
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index e4bd8c9eaeda..81cc4928e949 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -400,8 +400,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
> if (swiotlb_force_disable)
> return;
>
> - io_tlb_default_mem.force_bounce =
> - swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
> + io_tlb_default_mem.force_bounce = swiotlb_force_bounce;
>
> #ifdef CONFIG_SWIOTLB_DYNAMIC
> if (!remap)
^ permalink raw reply
* Re: [PATCH v6 01/20] s390: Expose protected virtualization through cc_platform_has()
From: Catalin Marinas @ 2026-06-09 13:44 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Halil Pasic,
Matthew Rosato, Jaehoon Kim
In-Reply-To: <20260604083959.1265923-2-aneesh.kumar@kernel.org>
On Thu, Jun 04, 2026 at 02:09:40PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Protected virtualization guests use memory encryption, so advertise that to
> the rest of the kernel through cc_platform_has(CC_ATTR_MEM_ENCRYPT).
>
> s390 already forces DMA mappings to be unencrypted for protected
> virtualization guests through force_dma_unencrypted(). Add
> ARCH_HAS_CC_PLATFORM and provide the matching cc_platform_has()
> implementation
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
Nit: just drop the --- line if you did intend to cc those people.
Nothing wrong for them to end up in the commit log (proof that they've
been cc'ed if they did not reply ;)).
--
Catalin
^ permalink raw reply
* Re: [PATCH v6 14/20] dma-direct: return struct page from dma_direct_alloc_from_pool()
From: Catalin Marinas @ 2026-06-09 13:45 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, stable, Michael Kelley
In-Reply-To: <20260604083959.1265923-15-aneesh.kumar@kernel.org>
On Thu, Jun 04, 2026 at 02:09:53PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Commit 5b138c534fda ("dma-direct: factor out a dma_direct_alloc_from_pool
> helper") changed dma_direct_alloc_from_pool() to return the CPU address
> from dma_alloc_from_pool(). That fits dma_direct_alloc(), but
> dma_direct_alloc_pages() also uses the helper and expects a struct page *.
>
> Fix this by making dma_direct_alloc_from_pool() return the struct page *
> again, and pass the CPU address back through an out-parameter for the
> dma_direct_alloc() caller.
>
> Fixes: 5b138c534fda ("dma-direct: factor out a dma_direct_alloc_from_pool helper")
> Cc: stable@vger.kernel.org
>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Nit: remove the empty line after Cc: stable. It may confuse tooling.
--
Catalin
^ permalink raw reply
* Re: [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: John Garry @ 2026-06-09 13:03 UTC (permalink / raw)
To: Sumit Saxena, Martin K . Petersen, Jens Axboe
Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
Hannes Reinecke, Juergen E . Fischer, Russell King,
linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
storagedev, HighPoint Linux Team, Tyrel Datwyler,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
Kashyap Desai, Shivasharan S, Chandrakanth Patil,
megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
Eugenio Perez, virtualization, Vishal Bhakta,
bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, xen-devel
In-Reply-To: <20260609121806.2121755-3-sumit.saxena@broadcom.com>
On 09/06/2026 13:18, Sumit Saxena wrote:
> scsi_host_alloc() used kzalloc(), which always picks an arbitrary node.
> Extend the function to accept a 'struct device *dev' parameter and use
> kzalloc_node() with dev_to_node(dev) so the Scsi_Host struct lands on
> the same NUMA node as the HBA, mirroring the treatment already applied
> to struct scsi_device, struct scsi_target, and shost_data.
>
> When dev is NULL (legacy ISA/platform drivers without a dma_dev) the
> allocation falls back to NUMA_NO_NODE, preserving existing behaviour.
>
> Update all in-tree callers:
> - PCI-based HBA drivers pass &pdev->dev (or the equivalent struct
> member such as &phba->pcidev->dev, &h->pdev->dev, &ha->pdev->dev)
> so their host struct is placed on the adapter's node.
> - Non-PCI drivers (ISA, Amiga, ARM PCMCIA, virtio, Hyper-V, PS3, …)
> pass NULL.
> - libfc's libfc_host_alloc() inline helper passes NULL; FC drivers
> that want NUMA awareness can open-code the call with their pdev.
>
> Suggested-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
Wow ... I was not expecting such a large change, but admittedly I did
not consider the implementation.
I did mention that pci-based adapters should already be effectively
doing kzalloc_node() since the adapter driver is probed on the local
NUMA node (and kmalloc first tries local NUMA allocations).
> ---
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index e047747d4ecf..e1f42be79729 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -403,12 +403,14 @@ static const struct device_type scsi_host_type = {
> * Return value:
> * Pointer to a new Scsi_Host
> **/
> -struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize)
> +struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize,
> + struct device *dev)
> {
> struct Scsi_Host *shost;
> int index;
>
> - shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL);
> + shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL,
> + dev ? dev_to_node(dev) : NUMA_NO_NODE);
> if (!shost)
> return NULL;
>
> -extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *, int);
> +extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht,
> + int privsize, struct device *dev);
> extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
> struct device *,
> struct device *);
scsi_add_host_with_dma() and scsi_add_host() do assignment of
shost->dma_dev, so I think that could be moved to scsi_host_alloc().
I can imagine that we always know dev and dma_dev at Scsi_Host alloc
time (and not just scsi_add_host()) time. However those would be very
intrusive changes.
Let me consider this more. Maybe we can have a platform device version
of shost alloc, as I can't imagine that we care about much more. Thanks!
^ permalink raw reply
* Re: [PATCH v6 14/20] dma-direct: return struct page from dma_direct_alloc_from_pool()
From: Jason Gunthorpe @ 2026-06-09 14:15 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, stable, Michael Kelley
In-Reply-To: <20260604083959.1265923-15-aneesh.kumar@kernel.org>
On Thu, Jun 04, 2026 at 02:09:53PM +0530, Aneesh Kumar K.V (Arm) wrote:
> @@ -270,9 +270,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> * the atomic pools instead if we aren't allowed block.
> */
> if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
> - dma_direct_use_pool(dev, gfp))
> - return dma_direct_alloc_from_pool(dev, size, dma_handle,
> - gfp, attrs);
> + dma_direct_use_pool(dev, gfp)) {
> + page = dma_direct_alloc_from_pool(dev, size,
> + dma_handle, &cpu_addr,
> + gfp, attrs);
> + return page ? cpu_addr : NULL;
> + }
You should probably put this at the start of the series so it can be
backported
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
To Petr's question I think this just shows nobody is really stressing
the PCI dma paths on CC VMs today.
if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
For instance the places even calling dma_alloc_pages() don't look like
things people would use in a CC VM.
Jason
^ permalink raw reply
* Re: [PATCH V3] tools/perf/tests: Update test_adding_kernel.sh to handle proper debuginfo check
From: Athira Rajeev @ 2026-06-09 14:25 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers
Cc: jolsa, adrian.hunter, mpetlan, tmricht, maddy, linux-perf-users,
linuxppc-dev, hbathini, Tejas.Manhas1, Tanushree.Shah,
Shivani.Nittor, Venkat
In-Reply-To: <D7E4148E-DC3F-4FCC-8DD8-AA4A085DAF1D@linux.ibm.com>
> On 21 May 2026, at 2:01 PM, Athira Rajeev <atrajeev@linux.ibm.com> wrote:
>
>
>
>> On 29 Apr 2026, at 7:01 PM, Venkat <venkat88@linux.ibm.com> wrote:
>>
>>
>>
>>> On 24 Apr 2026, at 10:54 PM, Athira Rajeev <atrajeev@linux.ibm.com> wrote:
>>>
>>> Perf test perftool-testsuite_probe fails as below:
>>>
>>> Regexp not found: "\s*probe:inode_permission(?:_\d+)?\s+\(on inode_permission(?:[:\+][0-9A-Fa-f]+)?@.+\)"
>>> -- [ FAIL ] -- perf_probe :: test_adding_kernel :: listing added probe :: perf probe -l (output regexp parsing)
>>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: removing multiple probes
>>> Regexp not found: "probe:vfs_mknod"
>>> Regexp not found: "probe:vfs_create"
>>> Regexp not found: "probe:vfs_rmdir"
>>> Regexp not found: "probe:vfs_link"
>>> Regexp not found: "probe:vfs_write"
>>> -- [ FAIL ] -- perf_probe :: test_adding_kernel :: wildcard adding support (command exitcode + output regexp parsing)
>>> Regexp not found: "somenonexistingrandomstuffwhichisalsoprettylongorevenlongertoexceed64"
>>> Regexp not found: "in this function|at this address"
>>> -- [ FAIL ] -- perf_probe :: test_adding_kernel :: non-existing variable (output regexp parsing)
>>> ## [ FAIL ] ## perf_probe :: test_adding_kernel SUMMARY :: 3 failures found
>>>
>>> Further analysing, the failed testcase is for "test_adding_kernel".
>>> If the kernel debuginfo is missing, perf probe fails as below:
>>>
>>> perf probe -nf --max-probes=512 -a 'vfs_* $params'
>>> Failed to find the path for the kernel: No such file or directory
>>> Error: Failed to add events.
>>>
>>> skip_if_no_debuginfo has check to handle whether debuginfo is present
>>> and the testcase checks for debuginfo since this :
>>> commit 90d32e92011e ("tools/perf: Handle perftool-testsuite_probe
>>> testcases fail when kernel debuginfo is not present")
>>>
>>> Recently a change got added in "tests/shell/lib/probe_vfs_getname.sh"
>>> via this another fix:
>>> commit 92b664dcefab ("perf test probe_vfs_getname: Skip if no suitable
>>> line detected")
>>> Since this commit, first add_probe_vfs_getname is used to prevent false
>>> failures. And based on return code of add_probe_vfs_getname, skip_if_no_debuginfo
>>> is used to skip testcase if debuginfo is present. And this modified other
>>> testcases to call add_probe_vfs_getname first and invoke
>>> skip_if_no_debuginfo based on return value.
>>>
>>> The tests in test_adding_kernel.sh which depends on presence of
>>> debuginfo are:
>>> 1. probe add for inode_permission
>>> 2. probe max-probes option using 'vfs_* $params'
>>> 3. non-existing variable probing
>>>
>>> For these tests, probe check for specific line is not required.
>>> So call skip_if_no_debuginfo with argument to say if line check is
>>> needed. This is to convey to skip_if_no_debuginfo() function
>>> that test only needs to check for debuginfo, and not specifically
>>> line number. Update skip_if_no_debuginfo to use simple "perf probe"
>>> check if test only needs to check for debuginfo. And for other
>>> tests which rely on line number, use add_probe_vfs_getname()
>>> Update other places which uses skip_if_no_debuginfo to use argument
>>> as zero.
>>>
>>> With the change, verified that only three which required debuginfo only
>>> is skipped and others ran successfully. Also tested with debuginfo
>>> to make sure tests are not skipped.
>>>
>>> Reported-by: Tejas Manhas <Tejas.Manhas1@ibm.com>
>>> Reviewed-by: Ian Rogers <irogers@google.com>
>>> Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com>
>>> ---
>>
>> Tested this patch, by applying on top of mainline, and it fixes the reported issue.
>>
>> Without this patch:
>>
>> # ./perf test -v perftool-testsuite_probe
>> --- start ---
>> test child forked, pid 15772
>> Probing start_text
>> -- [ PASS ] -- perf_probe :: test_adding_blacklisted :: adding blacklisted function start_text
>> -- [ PASS ] -- perf_probe :: test_adding_blacklisted :: listing blacklisted probe (should NOT be listed)
>> ## [ PASS ] ## perf_probe :: test_adding_blacklisted SUMMARY
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: adding probe inode_permission ::
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: adding probe inode_permission :: -a
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: adding probe inode_permission :: --add
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: listing added probe :: perf list
>> Regexp not found: "\s*probe:inode_permission(?:_\d+)?\s+\(on inode_permission(?:[:\+][0-9A-Fa-f]+)?@.+\)"
>> -- [ FAIL ] -- perf_probe :: test_adding_kernel :: listing added probe :: perf probe -l (output regexp parsing)
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: using added probe
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: deleting added probe
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: listing removed probe (should NOT be listed)
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: dry run :: adding probe
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: force-adding probes :: first probe adding
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: force-adding probes :: second probe adding (without force)
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: force-adding probes :: second probe adding (with force)
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: using doubled probe
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: removing multiple probes
>> Regexp not found: "probe:vfs_mknod"
>> Regexp not found: "probe:vfs_create"
>> Regexp not found: "probe:vfs_rmdir"
>> Regexp not found: "probe:vfs_link"
>> Regexp not found: "probe:vfs_write"
>> -- [ FAIL ] -- perf_probe :: test_adding_kernel :: wildcard adding support (command exitcode + output regexp parsing)
>> Regexp not found: "Failed to find"
>> Regexp not found: "somenonexistingrandomstuffwhichisalsoprettylongorevenlongertoexceed64"
>> Regexp not found: "in this function|at this address"
>> Line did not match any pattern: "The /lib/modules/7.1.0-rc1+/build/vmlinux file has no debug information."
>> Line did not match any pattern: "Rebuild with CONFIG_DEBUG_INFO=y, or install an appropriate debuginfo package."
>> -- [ FAIL ] -- perf_probe :: test_adding_kernel :: non-existing variable (output regexp parsing)
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: function with retval :: add
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: function with retval :: record
>> -- [ PASS ] -- perf_probe :: test_adding_kernel :: function argument probing :: script
>> ## [ FAIL ] ## perf_probe :: test_adding_kernel SUMMARY :: 3 failures found
>> -- [ SKIP ] -- perf_probe :: test_basic :: help message :: testcase skipped
>> -- [ PASS ] -- perf_probe :: test_basic :: usage message
>> -- [ PASS ] -- perf_probe :: test_basic :: quiet switch
>> ## [ PASS ] ## perf_probe :: test_basic SUMMARY
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: missing argument for -a
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: missing argument for -d
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: missing argument for -L
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: missing argument for -V
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: unnecessary argument for -F
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: unnecessary argument for -l
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -a xxx -d xxx
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -a xxx -L foo
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -a xxx -V foo
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -a xxx -l
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -a xxx -F
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -d xxx -L foo
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -d xxx -V foo
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -d xxx -l
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -d xxx -F
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -L foo -V bar
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -L foo -l
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -L foo -F
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -V foo -l
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -V foo -F
>> -- [ PASS ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -l -F
>> ## [ PASS ] ## perf_probe :: test_invalid_options SUMMARY
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: func
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: func:10
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: func:0-10
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: func:2+10
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: func@source.c
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: func@source.c:1
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: source.c:1
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: source.c:1+1
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: acceptable descriptions :: source.c:1-10
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: unacceptable descriptions :: func:foo
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: unacceptable descriptions :: func:1-foo
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: unacceptable descriptions :: func:1+foo
>> -- [ PASS ] -- perf_probe :: test_line_semantics :: unacceptable descriptions :: func;lazy\*pattern
>> ## [ PASS ] ## perf_probe :: test_line_semantics SUMMARY
>> ---- end(-1) ----
>> 137: perftool-testsuite_probe : FAILED!
>>
>> With This patch:
>>
>> # ./perf test -v perftool-testsuite_probe
>> 137: perftool-testsuite_probe : Ok
>>
>> Please add below tag.
>>
>> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
>>
>> Regards,
>> Venkat.
>
> Hi,
>
> Can we please have this pulled in, if the patches looks fine ?
>
> Thanks
> Athira
Hi,
Looking for any further review comments on this patch. Please suggest if any changes needs to be addressed.
Thanks
Athira
>>
>>> Changelog:
>>> v2 -> v3:
>>> - Update other callsites to use "skip_if_no_debuginfo 0"
>>> - Use "perf probe -vn --add inode_permission $params"
>>>
>>> v1 -> v2:
>>> - First version used "perf probe -v -L getname_flags" for debuginfo
>>> check. This will not catch fail string "Debuginfo-analysis is not
>>> supported" which is used in cases when perf is built without dwarf.
>>> So use "perf probe -vn add inode_permission" to capture cases when
>>> tools built with NO_LIBDWARF=1. This will capture debuginfo missing as
>>> well as tool built without dwarf case.
>>>
>>> .../tests/shell/base_probe/test_adding_kernel.sh | 15 ++++++++++++++-
>>> tools/perf/tests/shell/lib/probe_vfs_getname.sh | 13 ++++++++++++-
>>> tools/perf/tests/shell/probe_vfs_getname.sh | 7 ++++++-
>>> .../shell/record+script_probe_vfs_getname.sh | 7 ++++++-
>>> tools/perf/tests/shell/trace+probe_vfs_getname.sh | 7 ++++++-
>>> 5 files changed, 44 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/tools/perf/tests/shell/base_probe/test_adding_kernel.sh b/tools/perf/tests/shell/base_probe/test_adding_kernel.sh
>>> index 555a825d55f2..f3db125c8669 100755
>>> --- a/tools/perf/tests/shell/base_probe/test_adding_kernel.sh
>>> +++ b/tools/perf/tests/shell/base_probe/test_adding_kernel.sh
>>> @@ -23,10 +23,23 @@ TEST_RESULT=0
>>> . "$DIR_PATH/../lib/probe_vfs_getname.sh"
>>>
>>> TEST_PROBE=${TEST_PROBE:-"inode_permission"}
>>> +PROBE_NO_LINE_CHECK=1
>>>
>>> # set NO_DEBUGINFO to skip testcase if debuginfo is not present
>>> # skip_if_no_debuginfo returns 2 if debuginfo is not present
>>> -skip_if_no_debuginfo
>>> +#
>>> +# The perf probe checks which depends on presence of debuginfo and
>>> +# used in this testcase are:
>>> +# 1. probe add for inode_permission
>>> +# 2. probe max-probes option using 'vfs_* $params'
>>> +# 3. non-existing variable probing
>>> +#
>>> +# For these tests, probe check for specific line is not
>>> +# required ( add_probe_vfs_getname does that ). So call
>>> +# skip_if_no_debuginfo with argument as 1. This is to convey
>>> +# that test only needs to check for debuginfo, and not specifically
>>> +# line number
>>> +skip_if_no_debuginfo $PROBE_NO_LINE_CHECK
>>> if [ $? -eq 2 ]; then
>>> NO_DEBUGINFO=1
>>> fi
>>> diff --git a/tools/perf/tests/shell/lib/probe_vfs_getname.sh b/tools/perf/tests/shell/lib/probe_vfs_getname.sh
>>> index 88cd0e26d5f6..2c5252a38ea1 100644
>>> --- a/tools/perf/tests/shell/lib/probe_vfs_getname.sh
>>> +++ b/tools/perf/tests/shell/lib/probe_vfs_getname.sh
>>> @@ -39,7 +39,18 @@ add_probe_vfs_getname() {
>>> }
>>>
>>> skip_if_no_debuginfo() {
>>> - add_probe_vfs_getname -v 2>&1 | grep -E -q "^(Failed to find the path for the kernel|Debuginfo-analysis is not supported)|(file has no debug information)" && return 2
>>> + no_line_check=$1
>>> + debug_str="^(Failed to find the path for the kernel|Debuginfo-analysis is not supported)|(file has no debug information)"
>>> +
>>> + # search for debug_str using simple perf probe if the
>>> + # test only needs to check for debuginfo, and not specifically
>>> + # line number.
>>> + if [ $no_line_check -eq 1 ]; then
>>> + perf probe -vn --add 'inode_permission $params' 2>&1 | grep -E -q "$debug_str" && return 2
>>> + else
>>> + add_probe_vfs_getname -v 2>&1 | grep -E -q "$debug_str" && return 2
>>> + fi
>>> +
>>> return 1
>>> }
>>>
>>> diff --git a/tools/perf/tests/shell/probe_vfs_getname.sh b/tools/perf/tests/shell/probe_vfs_getname.sh
>>> index 5fe5682c28ce..b0878f571449 100755
>>> --- a/tools/perf/tests/shell/probe_vfs_getname.sh
>>> +++ b/tools/perf/tests/shell/probe_vfs_getname.sh
>>> @@ -16,8 +16,13 @@ skip_if_no_perf_probe || exit 2
>>> add_probe_vfs_getname
>>> err=$?
>>>
>>> +# Invoke skip_if_no_debuginfo with argument as 0,
>>> +# since the test needs suitable line number for getname
>>> +# along with debuginfo check.
>>> +# Argument "1" is used when to convey that test only needs to
>>> +# check for debuginfo, and not specifically line number.
>>> if [ $err -eq 1 ] ; then
>>> - skip_if_no_debuginfo
>>> + skip_if_no_debuginfo 0
>>> err=$?
>>> fi
>>>
>>> diff --git a/tools/perf/tests/shell/record+script_probe_vfs_getname.sh b/tools/perf/tests/shell/record+script_probe_vfs_getname.sh
>>> index 002f7037f182..48063fc2b221 100755
>>> --- a/tools/perf/tests/shell/record+script_probe_vfs_getname.sh
>>> +++ b/tools/perf/tests/shell/record+script_probe_vfs_getname.sh
>>> @@ -38,8 +38,13 @@ perf_script_filenames() {
>>> add_probe_vfs_getname
>>> err=$?
>>>
>>> +# Invoke skip_if_no_debuginfo with argument as 0,
>>> +# since the test needs suitable line number for getname
>>> +# along with debuginfo check.
>>> +# Argument "1" is used when to convey that test only needs to
>>> +# check for debuginfo, and not specifically line number.
>>> if [ $err -eq 1 ] ; then
>>> - skip_if_no_debuginfo
>>> + skip_if_no_debuginfo 0
>>> err=$?
>>> fi
>>>
>>> diff --git a/tools/perf/tests/shell/trace+probe_vfs_getname.sh b/tools/perf/tests/shell/trace+probe_vfs_getname.sh
>>> index 7a0b1145d0cd..6833fba12086 100755
>>> --- a/tools/perf/tests/shell/trace+probe_vfs_getname.sh
>>> +++ b/tools/perf/tests/shell/trace+probe_vfs_getname.sh
>>> @@ -28,8 +28,13 @@ trace_open_vfs_getname() {
>>> add_probe_vfs_getname
>>> err=$?
>>>
>>> +# Invoke skip_if_no_debuginfo with argument as 0,
>>> +# since the test needs suitable line number for getname
>>> +# along with debuginfo check.
>>> +# Argument "1" is used when to convey that test only needs to
>>> +# check for debuginfo, and not specifically line number.
>>> if [ $err -eq 1 ] ; then
>>> - skip_if_no_debuginfo
>>> + skip_if_no_debuginfo 0
>>> err=$?
>>> fi
>>>
>>> --
>>> 2.47.3
^ permalink raw reply
* Re: [PATCH 35/60] kvm: Add VCPU plane-scheduling state and helpers
From: Jörg Rödel @ 2026-06-09 14:27 UTC (permalink / raw)
To: James Bottomley
Cc: Paolo Bonzini, Sean Christopherson, Tom Lendacky, ashish.kalra,
michael.roth, nsaenz, anelkz, Melody Wang, kvm, linux-kernel,
kvmarm, loongarch, linux-mips, linuxppc-dev, kvm-riscv, x86,
coconut-svsm, joerg.roedel
In-Reply-To: <51421426e0d4b154281e80d9f1c6c9a628d21c94.camel@HansenPartnership.com>
Hi James,
On Tue, Jun 09, 2026 at 08:59:02AM -0400, James Bottomley wrote:
> Are the details of this anywhere? The last PUCK information I saw on
> the kvm list was the cancellation of the March and April calls.
Here is the calendar link I use, which has the appointments GMeet links:
https://calendar.google.com/calendar/embed?src=c_61a5b1f644739bf5bed7e5ea5fc3669ce32a2544c5db1c7c891702ca5090c7d5%40group.calendar.google.com
-Joerg
^ permalink raw reply
* Re: [PATCH v6 04/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Jason Gunthorpe @ 2026-06-09 14:32 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
Michael Kelley
In-Reply-To: <20260604083959.1265923-5-aneesh.kumar@kernel.org>
On Thu, Jun 04, 2026 at 02:09:43PM +0530, Aneesh Kumar K.V (Arm) wrote:
> struct page *dma_alloc_from_pool(struct device *dev, size_t size,
> - void **cpu_addr, gfp_t gfp,
> + void **cpu_addr, gfp_t gfp, unsigned long attrs,
> bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t))
> {
> - struct gen_pool *pool = NULL;
> + struct dma_gen_pool *dma_pool = NULL;
> struct page *page;
> bool pool_found = false;
>
> - while ((pool = dma_guess_pool(pool, gfp))) {
> + while ((dma_pool = dma_guess_pool(dma_pool, gfp))) {
> +
> + if (dma_pool->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> + continue;
I don't think you should be overloading DMA_ATTR_CC_SHARED like this.
/*
* DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
* attribute. The direct allocator uses it internally after it has
* decided that the backing pages must be shared/decrypted, so the
* rest of the allocation path can consistently select DMA addresses,
* choose compatible pools and restore encryption on free.
*/
if (attrs & DMA_ATTR_CC_SHARED)
return NULL;
if (force_dma_unencrypted(dev)) {
attrs |= DMA_ATTR_CC_SHARED;
mark_mem_decrypt = true;
}
It is fine to have a bit inside the attrs that is only used by the
internal logic, but it needs to have a clearer name
__DMA_ATTR_REQUIRE_CC_SHARED perhaps.
The sashiko note does look legit though:
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
!gfpflags_allow_blocking(gfp) && !coherent) {
page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
gfp, attrs, NULL);
if (!page)
return NULL;
I don't see anything doing the force_dma_unencrypted test along this
callchain..
I guess it should be done one step up in dma_alloc_attrs() instead of
in dma_direct_alloc()?
Jason
^ permalink raw reply
* Re: [PATCH v6 00/20] dma-mapping: Use DMA_ATTR_CC_SHARED through direct, pool and swiotlb paths
From: Jason Gunthorpe @ 2026-06-09 14:47 UTC (permalink / raw)
To: Catalin Marinas, Alexey Kardashevskiy
Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Jiri Pirko,
Mostafa Saleh, Petr Tesarik, Dan Williams, Xu Yilun, linuxppc-dev,
linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <aigYbK12D8uKQvJF@arm.com>
On Tue, Jun 09, 2026 at 02:43:08PM +0100, Catalin Marinas wrote:
> On Thu, Jun 04, 2026 at 02:09:39PM +0530, Aneesh Kumar K.V (Arm) wrote:
> > This series propagates DMA_ATTR_CC_SHARED through the dma-direct,
> > dma-pool, and swiotlb paths so that encrypted and decrypted DMA buffers
> > are handled consistently.
> >
> > Today, the direct DMA path mostly relies on force_dma_unencrypted() for
> > shared/decrypted buffer handling. This series consolidates the
> > force_dma_unencrypted() checks in the top-level functions and ensures
> > that the remaining DMA interfaces use DMA attributes to make the correct
> > decisions.
>
> Please check Sashiko's reports, it has some good points:
>
> https://sashiko.dev/#/patchset/20260604083959.1265923-1-aneesh.kumar@kernel.org
>
> I think the main one is the swiotlb_tbl_map_single() changes which break
> AMD SME host support. There cc_platform_has(CC_ATTR_MEM_ENCRYPT) is true
> but force_dma_unencrypted() is false. Normally you'd not end up on this
> path but you can have swiotlb=force.
IMHO that's an AMD issue, not with the design of this series..
The series is right, a device that is !force_dma_decrypted() must be
considerd to be a trusted device and we must never place any DMA
mappings for a trusted device into shared memory.
That AMD has done somethine insane:
bool force_dma_unencrypted(struct device *dev)
{
/*
* For SEV, all DMA must be to unencrypted addresses.
*/
if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
return true;
/*
* For SME, all DMA must be to unencrypted addresses if the
* device does not support DMA to addresses that include the
* encryption mask.
*/
if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {
u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
dev->bus_dma_limit);
if (dma_dev_mask <= dma_enc_mask)
return true;
}
Is an AMD issue. We already have an address mask limit system built
into the DMA API, arch code should not be co-opting the CC mechanism
to create a special pool for address limited devices.
The correct thing is to ensure the DMA API is checking any address
limits on the actual true dma_addr_t, not on an intermediate like a
phys_addr before it is adjusted with any C bit. Then it is a normal
low address swiotlb bounce like any other.
I think we can ignore this Sashiko remark, in real systems the use of
swiotlb for 64 bit devices is very rare. Though it would be good to
remove this code from AMD...
Jason
^ permalink raw reply
* Re: [PATCH 35/60] kvm: Add VCPU plane-scheduling state and helpers
From: James Bottomley @ 2026-06-09 15:06 UTC (permalink / raw)
To: Jörg Rödel
Cc: Paolo Bonzini, Sean Christopherson, Tom Lendacky, ashish.kalra,
michael.roth, nsaenz, anelkz, Melody Wang, kvm, linux-kernel,
kvmarm, loongarch, linux-mips, linuxppc-dev, kvm-riscv, x86,
coconut-svsm, joerg.roedel
In-Reply-To: <aigifVmRZA0TXIrK@8bytes.org>
On Tue, 2026-06-09 at 16:27 +0200, Jörg Rödel wrote:
> Hi James,
>
> On Tue, Jun 09, 2026 at 08:59:02AM -0400, James Bottomley wrote:
> > Are the details of this anywhere? The last PUCK information I saw
> > on the kvm list was the cancellation of the March and April calls.
>
> Here is the calendar link I use, which has the appointments GMeet
> links:
>
> https://calendar.google.com/calendar/embed?src=c_61a5b1f644739bf5bed7e5ea5fc3669ce32a2544c5db1c7c891702ca5090c7d5%40group.calendar.google.com
Thanks. For people who don't use gmail, google does have a well hidden
ical link:
https://calendar.google.com/calendar/ical/c_61a5b1f644739bf5bed7e5ea5fc3669ce32a2544c5db1c7c891702ca5090c7d5%40group.calendar.google.com/public/basic.ics
Regards,
James
^ permalink raw reply
page: | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox