LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] powerpc/mm/radix: Memory unplug fixes
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao

This is the next version of the fixes for memory unplug on radix.
The issues and the fix are described in the actual patches.

Changes from v1:
- Added back patch to drop split_kernel_mapping
- Most of the split_kernel_mapping related issues are now described
  in the removal patch
- drop pte fragment change
- use lmb size as the max mapping size.
- Radix baremetal now use memory block size of 1G.


Changes from v0:
==============
- Rebased to latest kernel.
- Took care of p4d changes.
- Addressed Aneesh's review feedback:
 - Added comments.
 - Indentation fixed.
- Dropped the 1st patch (setting DRCONF_MEM_HOTREMOVABLE lmb flags) as
  it is debatable if this flag should be set in the device tree by OS
  and not by platform in case of hotplug. This can be looked at separately.
  (The fixes in this patchset remain valid without the dropped patch)
- Dropped the last patch that removed split_kernel_mapping() to ensure
  that spilitting code is available for any radix guest running on
  platforms that don't set DRCONF_MEM_HOTREMOVABLE.


Aneesh Kumar K.V (2):
  powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
    mappings
  powerpc/mm/radix: Create separate mappings for hot-plugged memory

Bharata B Rao (2):
  powerpc/mm/radix: Free PUD table when freeing pagetable
  powerpc/mm/radix: Remove split_kernel_mapping()

 arch/powerpc/include/asm/book3s/64/pgalloc.h |  16 +-
 arch/powerpc/mm/book3s64/pgtable.c           |   5 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c     | 199 +++++++++++--------
 arch/powerpc/mm/pgtable-frag.c               |   3 +
 arch/powerpc/platforms/powernv/setup.c       |  10 +-
 5 files changed, 144 insertions(+), 89 deletions(-)

-- 
2.26.2


^ permalink raw reply

* [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao
In-Reply-To: <20200625064547.228448-1-aneesh.kumar@linux.ibm.com>

We can hit the following BUG_ON during memory unplug:

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

Fixing this includes 3 parts:

- Re-walk the init_mm page tables from mem_init() and initialize
  the PMD and PTE fragment count to 1.
- When freeing PUD, PMD and PTE page table pages, check explicitly
  if they come from memblock and if so free then appropriately.
- When we do early memblock based allocation of PMD and PUD pages,
  allocate in PAGE_SIZE granularity so that we are sure the
  complete page is used as pagetable page.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 16 +++++++++++++++-
 arch/powerpc/mm/book3s64/pgtable.c           |  5 ++++-
 arch/powerpc/mm/book3s64/radix_pgtable.c     | 15 +++++++++++----
 arch/powerpc/mm/pgtable-frag.c               |  3 +++
 4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 69c5b051734f..e1af0b394ceb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -107,9 +107,23 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 	return pud;
 }
 
+static inline void __pud_free(pud_t *pud)
+{
+	struct page *page = virt_to_page(pud);
+
+	/*
+	 * Early pud pages allocated via memblock allocator
+	 * can't be directly freed to slab
+	 */
+	if (PageReserved(page))
+		free_reserved_page(page);
+	else
+		kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+}
+
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+	return __pud_free(pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index c58ad1049909..85de5b574dd7 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -339,6 +339,9 @@ void pmd_fragment_free(unsigned long *pmd)
 {
 	struct page *page = virt_to_page(pmd);
 
+	if (PageReserved(page))
+		return free_reserved_page(page);
+
 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		pgtable_pmd_page_dtor(page);
@@ -356,7 +359,7 @@ static inline void pgtable_free(void *table, int index)
 		pmd_fragment_free(table);
 		break;
 	case PUD_INDEX:
-		kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), table);
+		__pud_free(table);
 		break;
 #if defined(CONFIG_PPC_4K_PAGES) && defined(CONFIG_HUGETLB_PAGE)
 		/* 16M hugepd directory at pud level */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 8acb96de0e48..80be96d3018f 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -57,6 +57,13 @@ static __ref void *early_alloc_pgtable(unsigned long size, int nid,
 	return ptr;
 }
 
+/*
+ * When allocating pud or pmd pointers, we allocate a complete page
+ * of PAGE_SIZE rather than PUD_TABLE_SIZE or PMD_TABLE_SIZE. This
+ * is to ensure that the page obtained from the memblock allocator
+ * can be completely used as page table page and can be freed
+ * correctly when the page table entries are removed.
+ */
 static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 			  pgprot_t flags,
 			  unsigned int map_page_size,
@@ -73,8 +80,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 	pgdp = pgd_offset_k(ea);
 	p4dp = p4d_offset(pgdp, ea);
 	if (p4d_none(*p4dp)) {
-		pudp = early_alloc_pgtable(PUD_TABLE_SIZE, nid,
-						region_start, region_end);
+		pudp = early_alloc_pgtable(PAGE_SIZE, nid,
+					   region_start, region_end);
 		p4d_populate(&init_mm, p4dp, pudp);
 	}
 	pudp = pud_offset(p4dp, ea);
@@ -83,8 +90,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 		goto set_the_pte;
 	}
 	if (pud_none(*pudp)) {
-		pmdp = early_alloc_pgtable(PMD_TABLE_SIZE, nid,
-						region_start, region_end);
+		pmdp = early_alloc_pgtable(PAGE_SIZE, nid, region_start,
+					   region_end);
 		pud_populate(&init_mm, pudp, pmdp);
 	}
 	pmdp = pmd_offset(pudp, ea);
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index ee4bd6d38602..97ae4935da79 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -110,6 +110,9 @@ void pte_fragment_free(unsigned long *table, int kernel)
 {
 	struct page *page = virt_to_page(table);
 
+	if (PageReserved(page))
+		return free_reserved_page(page);
+
 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		if (!kernel)
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K . V, Bharata B Rao
In-Reply-To: <20200625064547.228448-1-aneesh.kumar@linux.ibm.com>

From: Bharata B Rao <bharata@linux.ibm.com>

remove_pagetable() isn't freeing PUD table. This causes memory
leak during memory unplug. Fix this.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 80be96d3018f..af57604f295f 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -708,6 +708,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	pud_clear(pud);
 }
 
+static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (!pud_none(*pud))
+			return;
+	}
+
+	pud_free(&init_mm, pud_start);
+	p4d_clear(p4d);
+}
+
 struct change_mapping_params {
 	pte_t *pte;
 	unsigned long start;
@@ -882,6 +897,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 
 		pud_base = (pud_t *)p4d_page_vaddr(*p4d);
 		remove_pud_table(pud_base, addr, next);
+		free_pud_table(pud_base, p4d);
 	}
 
 	spin_unlock(&init_mm.page_table_lock);
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 3/4] powerpc/mm/radix: Remove split_kernel_mapping()
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K . V, Bharata B Rao
In-Reply-To: <20200625064547.228448-1-aneesh.kumar@linux.ibm.com>

From: Bharata B Rao <bharata@linux.ibm.com>

We split the page table mapping on memory unplug if the
linear range was mapped with huge page mapping (for ex: 1G)
The page table splitting code has a few issues:

1. Recursive locking
--------------------
Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.

2. BUG: sleeping function called from in_atomic() context
---------------------------------------------------------
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()

3. Bad unlock unbalance
-----------------------
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.

These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. The kernel always make sure the minimum unplug request is
SUBSECTION_SIZE for device memory and SECTION_SIZE for regular memory.

In preparation for such a change remove page table splitting support.

This essentially is a revert of
commit 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug")

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 95 +++++-------------------
 1 file changed, 19 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index af57604f295f..78ad11812e0d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,7 +15,6 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/string_helpers.h>
-#include <linux/stop_machine.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -723,32 +722,6 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
 	p4d_clear(p4d);
 }
 
-struct change_mapping_params {
-	pte_t *pte;
-	unsigned long start;
-	unsigned long end;
-	unsigned long aligned_start;
-	unsigned long aligned_end;
-};
-
-static int __meminit stop_machine_change_mapping(void *data)
-{
-	struct change_mapping_params *params =
-			(struct change_mapping_params *)data;
-
-	if (!data)
-		return -1;
-
-	spin_unlock(&init_mm.page_table_lock);
-	pte_clear(&init_mm, params->aligned_start, params->pte);
-	create_physical_mapping(__pa(params->aligned_start),
-				__pa(params->start), -1, PAGE_KERNEL);
-	create_physical_mapping(__pa(params->end), __pa(params->aligned_end),
-				-1, PAGE_KERNEL);
-	spin_lock(&init_mm.page_table_lock);
-	return 0;
-}
-
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 			     unsigned long end)
 {
@@ -777,52 +750,6 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 	}
 }
 
-/*
- * clear the pte and potentially split the mapping helper
- */
-static void __meminit split_kernel_mapping(unsigned long addr, unsigned long end,
-				unsigned long size, pte_t *pte)
-{
-	unsigned long mask = ~(size - 1);
-	unsigned long aligned_start = addr & mask;
-	unsigned long aligned_end = addr + size;
-	struct change_mapping_params params;
-	bool split_region = false;
-
-	if ((end - addr) < size) {
-		/*
-		 * We're going to clear the PTE, but not flushed
-		 * the mapping, time to remap and flush. The
-		 * effects if visible outside the processor or
-		 * if we are running in code close to the
-		 * mapping we cleared, we are in trouble.
-		 */
-		if (overlaps_kernel_text(aligned_start, addr) ||
-			overlaps_kernel_text(end, aligned_end)) {
-			/*
-			 * Hack, just return, don't pte_clear
-			 */
-			WARN_ONCE(1, "Linear mapping %lx->%lx overlaps kernel "
-				  "text, not splitting\n", addr, end);
-			return;
-		}
-		split_region = true;
-	}
-
-	if (split_region) {
-		params.pte = pte;
-		params.start = addr;
-		params.end = end;
-		params.aligned_start = addr & ~(size - 1);
-		params.aligned_end = min_t(unsigned long, aligned_end,
-				(unsigned long)__va(memblock_end_of_DRAM()));
-		stop_machine(stop_machine_change_mapping, &params, NULL);
-		return;
-	}
-
-	pte_clear(&init_mm, addr, pte);
-}
-
 static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 			     unsigned long end)
 {
@@ -838,7 +765,12 @@ static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 			continue;
 
 		if (pmd_is_leaf(*pmd)) {
-			split_kernel_mapping(addr, end, PMD_SIZE, (pte_t *)pmd);
+			if (!IS_ALIGNED(addr, PMD_SIZE) ||
+			    !IS_ALIGNED(next, PMD_SIZE)) {
+				WARN_ONCE(1, "%s: unaligned range\n", __func__);
+				continue;
+			}
+			pte_clear(&init_mm, addr, (pte_t *)pmd);
 			continue;
 		}
 
@@ -863,7 +795,12 @@ static void remove_pud_table(pud_t *pud_start, unsigned long addr,
 			continue;
 
 		if (pud_is_leaf(*pud)) {
-			split_kernel_mapping(addr, end, PUD_SIZE, (pte_t *)pud);
+			if (!IS_ALIGNED(addr, PUD_SIZE) ||
+			    !IS_ALIGNED(next, PUD_SIZE)) {
+				WARN_ONCE(1, "%s: unaligned range\n", __func__);
+				continue;
+			}
+			pte_clear(&init_mm, addr, (pte_t *)pud);
 			continue;
 		}
 
@@ -891,7 +828,13 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 			continue;
 
 		if (p4d_is_leaf(*p4d)) {
-			split_kernel_mapping(addr, end, P4D_SIZE, (pte_t *)p4d);
+			if (!IS_ALIGNED(addr, P4D_SIZE) ||
+			    !IS_ALIGNED(next, P4D_SIZE)) {
+				WARN_ONCE(1, "%s: unaligned range\n", __func__);
+				continue;
+			}
+
+			pte_clear(&init_mm, addr, (pte_t *)pgd);
 			continue;
 		}
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao
In-Reply-To: <20200625064547.228448-1-aneesh.kumar@linux.ibm.com>

To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.

This implies on pseries system, we now end up mapping
memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range.  This was added that as part of

commit b6eca183e23e ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")

But we still don't do that on PowerVM. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 83 ++++++++++++++++++++----
 arch/powerpc/platforms/powernv/setup.c   | 10 ++-
 2 files changed, 81 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 78ad11812e0d..a4179e4da49d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/string_helpers.h>
+#include <linux/memory.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -34,6 +35,7 @@
 
 unsigned int mmu_pid_bits;
 unsigned int mmu_base_pid;
+unsigned int radix_mem_block_size;
 
 static __ref void *early_alloc_pgtable(unsigned long size, int nid,
 			unsigned long region_start, unsigned long region_end)
@@ -266,6 +268,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end)
 
 static int __meminit create_physical_mapping(unsigned long start,
 					     unsigned long end,
+					     unsigned long max_mapping_size,
 					     int nid, pgprot_t _prot)
 {
 	unsigned long vaddr, addr, mapping_size = 0;
@@ -279,6 +282,8 @@ static int __meminit create_physical_mapping(unsigned long start,
 		int rc;
 
 		gap = next_boundary(addr, end) - addr;
+		if (gap > max_mapping_size)
+			gap = max_mapping_size;
 		previous_size = mapping_size;
 		prev_exec = exec;
 
@@ -329,8 +334,9 @@ static void __init radix_init_pgtable(void)
 
 	/* We don't support slb for radix */
 	mmu_slb_size = 0;
+
 	/*
-	 * Create the linear mapping, using standard page size for now
+	 * Create the linear mapping
 	 */
 	for_each_memblock(memory, reg) {
 		/*
@@ -346,6 +352,7 @@ static void __init radix_init_pgtable(void)
 
 		WARN_ON(create_physical_mapping(reg->base,
 						reg->base + reg->size,
+						radix_mem_block_size,
 						-1, PAGE_KERNEL));
 	}
 
@@ -486,6 +493,49 @@ static int __init radix_dt_scan_page_sizes(unsigned long node,
 	return 1;
 }
 
+static int __init probe_memory_block_size(unsigned long node, const char *uname, int
+					  depth, void *data)
+{
+	const __be32 *block_size;
+	int len;
+
+	if (depth != 1)
+		return 0;
+
+	if (!strcmp(uname, "ibm,dynamic-reconfiguration-memory")) {
+
+		block_size = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
+		if (!block_size || len < dt_root_size_cells * sizeof(__be32))
+			/*
+			 * Nothing in the device tree
+			 */
+			return MIN_MEMORY_BLOCK_SIZE;
+
+		return dt_mem_next_cell(dt_root_size_cells, &block_size);
+
+	}
+
+	return 0;
+}
+
+static unsigned long radix_memory_block_size(void)
+{
+	unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;
+
+	if (firmware_has_feature(FW_FEATURE_OPAL)) {
+
+		mem_block_size = 1UL * 1024 * 1024 * 1024;
+
+	} else if (firmware_has_feature(FW_FEATURE_LPAR)) {
+		mem_block_size = of_scan_flat_dt(probe_memory_block_size, NULL);
+		if (!mem_block_size)
+			mem_block_size = MIN_MEMORY_BLOCK_SIZE;
+	}
+
+	return mem_block_size;
+}
+
+
 void __init radix__early_init_devtree(void)
 {
 	int rc;
@@ -494,17 +544,27 @@ void __init radix__early_init_devtree(void)
 	 * Try to find the available page sizes in the device-tree
 	 */
 	rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL);
-	if (rc != 0)  /* Found */
-		goto found;
+	if (rc == 0) {
+		/*
+		 * no page size details found in device tree
+		 * let's assume we have page 4k and 64k support
+		 */
+		mmu_psize_defs[MMU_PAGE_4K].shift = 12;
+		mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
+
+		mmu_psize_defs[MMU_PAGE_64K].shift = 16;
+		mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
+	}
+
 	/*
-	 * let's assume we have page 4k and 64k support
+	 * Max mapping size used when mapping pages. We don't use
+	 * ppc_md.memory_block_size() here because this get called
+	 * early and we don't have machine probe called yet. Also
+	 * the pseries implementation only check for ibm,lmb-size.
+	 * All hypervisor supporting radix do expose that device
+	 * tree node.
 	 */
-	mmu_psize_defs[MMU_PAGE_4K].shift = 12;
-	mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
-
-	mmu_psize_defs[MMU_PAGE_64K].shift = 16;
-	mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
-found:
+	radix_mem_block_size = radix_memory_block_size();
 	return;
 }
 
@@ -856,7 +916,8 @@ int __meminit radix__create_section_mapping(unsigned long start,
 		return -1;
 	}
 
-	return create_physical_mapping(__pa(start), __pa(end), nid, prot);
+	return create_physical_mapping(__pa(start), __pa(end),
+				       radix_mem_block_size, nid, prot);
 }
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 3bc188da82ba..6e2f614590a3 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -399,7 +399,15 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 static unsigned long pnv_memory_block_size(void)
 {
-	return 256UL * 1024 * 1024;
+	/*
+	 * We map the kernel linear region with 1GB large pages on radix. For
+	 * memory hot unplug to work our memory block size must be at least
+	 * this size.
+	 */
+	if (radix_enabled())
+		return 1UL * 1024 * 1024 * 1024;
+	else
+		return 256UL * 1024 * 1024;
 }
 #endif
 
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v4 6/8] arm: Break cyclic percpu include
From: Will Deacon @ 2020-06-25  7:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-s390, bigeasy, x86, heiko.carstens, linux-kernel, rostedt,
	davem, a.darwish, sparclinux, linux, tglx, linuxppc-dev, mingo
In-Reply-To: <20200624175320.GN4781@hirez.programming.kicks-ass.net>

On Wed, Jun 24, 2020 at 07:53:20PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 10:02:57AM +0100, Will Deacon wrote:
> > On Tue, Jun 23, 2020 at 10:36:51AM +0200, Peter Zijlstra wrote:
> > > In order to use <asm/percpu.h> in irqflags.h, we need to make sure
> > > asm/percpu.h does not itself depend on irqflags.h.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  arch/arm/include/asm/percpu.h |    2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > --- a/arch/arm/include/asm/percpu.h
> > > +++ b/arch/arm/include/asm/percpu.h
> > > @@ -10,6 +10,8 @@
> > >   * in the TPIDRPRW. TPIDRPRW only exists on V6K and V7
> > >   */
> > >  #if defined(CONFIG_SMP) && !defined(CONFIG_CPU_V6)
> > > +register unsigned long current_stack_pointer asm ("sp");
> > 
> > If you define this unconditionally, then we can probably get rid of the
> > copy in asm/thread_info.h, rather than duplicate the same #define.
> 
> The below delta seems to build arm-allnoconfig, arm-defconfig and
> arm-allmodconfig.
> 
> Although please don't ask me how asm/thread_info.h includes asm/percpu.h
> 
> Does that work for you?

Yes, thanks! I can't believe you removed the helpful comment.

> -/*
> - * how to get the current stack pointer in C
> - */

Will

^ permalink raw reply

* Re: [PATCH v5 01/13] powerpc: Remove Xilinx PPC405/PPC440 support
From: Joel Stanley @ 2020-06-25  8:39 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Arnd Bergmann, Nick Desaulniers, Michal Simek, LKML,
	clang-built-linux, Paul Mackerras, Nathan Chancellor,
	linuxppc-dev
In-Reply-To: <87eeqbco82.fsf@mpe.ellerman.id.au>

On Fri, 19 Jun 2020 at 11:02, Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Nathan Chancellor <natechancellor@gmail.com> writes:
> >> It's kind of nuts that the zImage points to some arbitrary image
> >> depending on what's configured and the order of things in the Makefile.
> >> But I'm not sure how we make it less nuts without risking breaking
> >> people's existing setups.
> >
> > Hi Michael,
> >
> > For what it's worth, this is squared this away in terms of our CI by
> > just building and booting the uImage directly, rather than implicitly
> > using the zImage:
> >
> > https://github.com/ClangBuiltLinux/continuous-integration/pull/282
> > https://github.com/ClangBuiltLinux/boot-utils/pull/22
>
> Great.
>
> > We were only using the zImage because that is what Joel Stanley intially
> > set us up with when PowerPC 32-bit was added to our CI:
> >
> > https://github.com/ClangBuiltLinux/continuous-integration/pull/100
>
> Ah, so Joel owes us all beers then ;)

Hey, you owe me beers for finding broken machines!

This machine was picked from a vague discussion on an internal chat.
The two requirements were that it would build, and boot in qemu.

If there's a better supported 32 bit machine then we should switch the
CI over. We don't want the Clang CI to be the only user and give the
false impression that someone out there is still booting upstream
kernels on it.

> > Admittedly, we really do not have many PowerPC experts in our
> > organization so we are supporting it on a "best effort" basis, which
> > often involves using whatever knowledge is floating around or can be
> > gained from interactions such as this :) so thank you for that!
>
> No worries. I definitely don't expect you folks to invest much effort in
> powerpc, especially the old 32-bit stuff, so always happy to help debug
> things, and really appreciate the testing you do.

+1

Cheers,

Joel

^ permalink raw reply

* Re: [PATCH v2 2/2] cpufreq: Specify default governor on command line
From: Viresh Kumar @ 2020-06-25  8:50 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Juri Lelli, Cc: Android Kernel, Vincent Guittot, Arnd Bergmann,
	Rafael J. Wysocki, Peter Zijlstra, adharmap, Linux PM,
	Rafael J. Wysocki, Linux Kernel Mailing List, Ingo Molnar,
	Paul Mackerras, linuxppc-dev, Todd Kjos
In-Reply-To: <20200624153259.GA2844@google.com>

On 24-06-20, 16:32, Quentin Perret wrote:
> Right, but I must admit that, looking at this more, I'm getting a bit
> confused with the overall locking for governors :/
> 
> When in cpufreq_init_policy() we find a governor using
> find_governor(policy->last_governor), what guarantees this governor is
> not concurrently unregistered? That is, what guarantees this governor
> doesn't go away between that find_governor() call, and the subsequent
> call to try_module_get() in cpufreq_set_policy() down the line?
> 
> Can we somewhat assume that whatever governor is referred to by
> policy->last_governor will have a non-null refcount? Or are the
> cpufreq_online() and cpufreq_unregister_governor() path mutually
> exclusive? Or is there something else?

This should be sufficient to fix pending issues I believe. Based over your
patches.

-- 
viresh

-------------------------8<-------------------------
From: Viresh Kumar <viresh.kumar@linaro.org>
Date: Thu, 25 Jun 2020 13:15:23 +0530
Subject: [PATCH] cpufreq: Fix locking issues with governors

The locking around governors handling isn't adequate currently. The list
of governors should never be traversed without locking in place. Also we
must make sure the governor isn't removed while it is still referenced
by code.

Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 drivers/cpufreq/cpufreq.c | 59 ++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 23 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 4b1a5c0173cf..dad6b85f4c89 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -624,6 +624,24 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
 	return NULL;
 }
 
+static struct cpufreq_governor *get_governor(const char *str_governor)
+{
+	struct cpufreq_governor *t;
+
+	mutex_lock(&cpufreq_governor_mutex);
+	t = find_governor(str_governor);
+	if (!t)
+		goto unlock;
+
+	if (!try_module_get(t->owner))
+		t = NULL;
+
+unlock:
+	mutex_unlock(&cpufreq_governor_mutex);
+
+	return t;
+}
+
 static unsigned int cpufreq_parse_policy(char *str_governor)
 {
 	if (!strncasecmp(str_governor, "performance", CPUFREQ_NAME_LEN))
@@ -643,28 +661,14 @@ static struct cpufreq_governor *cpufreq_parse_governor(char *str_governor)
 {
 	struct cpufreq_governor *t;
 
-	mutex_lock(&cpufreq_governor_mutex);
-
-	t = find_governor(str_governor);
-	if (!t) {
-		int ret;
-
-		mutex_unlock(&cpufreq_governor_mutex);
-
-		ret = request_module("cpufreq_%s", str_governor);
-		if (ret)
-			return NULL;
-
-		mutex_lock(&cpufreq_governor_mutex);
+	t = get_governor(str_governor);
+	if (t)
+		return t;
 
-		t = find_governor(str_governor);
-	}
-	if (t && !try_module_get(t->owner))
-		t = NULL;
-
-	mutex_unlock(&cpufreq_governor_mutex);
+	if (request_module("cpufreq_%s", str_governor))
+		return NULL;
 
-	return t;
+	return get_governor(str_governor);
 }
 
 /**
@@ -818,12 +822,14 @@ static ssize_t show_scaling_available_governors(struct cpufreq_policy *policy,
 		goto out;
 	}
 
+	mutex_lock(&cpufreq_governor_mutex);
 	for_each_governor(t) {
 		if (i >= (ssize_t) ((PAGE_SIZE / sizeof(char))
 		    - (CPUFREQ_NAME_LEN + 2)))
-			goto out;
+			break;
 		i += scnprintf(&buf[i], CPUFREQ_NAME_PLEN, "%s ", t->name);
 	}
+	mutex_unlock(&cpufreq_governor_mutex);
 out:
 	i += sprintf(&buf[i], "\n");
 	return i;
@@ -1060,11 +1066,14 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
 {
 	struct cpufreq_governor *gov = NULL;
 	unsigned int pol = CPUFREQ_POLICY_UNKNOWN;
+	bool put_governor = false;
+	int ret;
 
 	if (has_target()) {
 		/* Update policy governor to the one used before hotplug. */
-		gov = find_governor(policy->last_governor);
+		gov = get_governor(policy->last_governor);
 		if (gov) {
+			put_governor = true;
 			pr_debug("Restoring governor %s for cpu %d\n",
 				 policy->governor->name, policy->cpu);
 		} else if (default_governor) {
@@ -1091,7 +1100,11 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
 			return -ENODATA;
 	}
 
-	return cpufreq_set_policy(policy, gov, pol);
+	ret = cpufreq_set_policy(policy, gov, pol);
+	if (put_governor)
+		module_put(gov->owner);
+
+	return ret;
 }
 
 static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)

^ permalink raw reply related

* Re: [PATCH 17/17] arch: rename copy_thread_tls() back to copy_thread()
From: Thomas Bogendoerfer @ 2020-06-25  8:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Rich Felker, linux-sh, Peter Zijlstra (Intel), Catalin Marinas,
	linux-mips, James E.J. Bottomley, Max Filippov, Guo Ren,
	Matthew Wilcox (Oracle), H. Peter Anvin, sparclinux,
	linux-hexagon, linux-riscv, Vincent Chen, Will Deacon,
	Thomas Gleixner, Anton Ivanov, Jonas Bonn, linux-s390, linux-ia64,
	linux-c6x-dev, Brian Cain, linux-xtensa, Helge Deller, x86,
	Russell King, Ley Foon Tan, Christian Borntraeger, Ingo Molnar,
	Geert Uytterhoeven, linux-parisc, Mark Salter, Matt Turner,
	linux-snps-arc, uclinux-h8-devel, Fenghua Yu, Albert Ou,
	Kees Cook, Jeff Dike, linux-alpha, linux-um, linuxppc-dev,
	Aurelien Jacquiot, linux-m68k, linux-csky, Ivan Kokshaysky,
	Greentime Hu, Paul Walmsley, Stafford Horne, Stefan Kristiansson,
	Guan Xuetao, linux-arm-kernel, Richard Henderson, Chris Zankel,
	Michal Simek, Tony Luck, Yoshinori Sato, Nick Hu, Vineet Gupta,
	linux-kernel, openrisc, Palmer Dabbelt, Richard Weinberger,
	Paul Mackerras, Linus Torvalds, David S. Miller, Al Viro
In-Reply-To: <20200622234326.906346-18-christian.brauner@ubuntu.com>

On Tue, Jun 23, 2020 at 01:43:26AM +0200, Christian Brauner wrote:
> Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
> back simply copy_thread(). It's a simpler name, and doesn't imply that only
> tls is copied here. This finishes an outstanding chunk of internal process
> creation work since we've added clone3().
> [..]
>  arch/mips/kernel/process.c       | 2 +-

Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply

* Re: [PATCH 16/17] arch: remove HAVE_COPY_THREAD_TLS
From: Thomas Bogendoerfer @ 2020-06-25  8:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Rich Felker, linux-sh, Peter Zijlstra, Catalin Marinas,
	Heiko Carstens, linux-mips, James E.J. Bottomley, Guo Ren,
	linux-csky, sparclinux, linux-hexagon, linux-riscv, Vincent Chen,
	Will Deacon, Thomas Gleixner, Anton Ivanov, Jonas Bonn,
	linux-s390, linux-ia64, linux-c6x-dev, Brian Cain, linux-xtensa,
	Helge Deller, x86, Russell King, Ley Foon Tan, Mike Rapoport,
	Christian Borntraeger, Ingo Molnar, Geert Uytterhoeven,
	linux-parisc, Mark Salter, Matt Turner, linux-snps-arc,
	uclinux-h8-devel, Fenghua Yu, Albert Ou, Kees Cook, Vasily Gorbik,
	Jeff Dike, linux-alpha, linux-um, linuxppc-dev, Aurelien Jacquiot,
	linux-m68k, openrisc, Ivan Kokshaysky, Greentime Hu,
	Paul Walmsley, Stafford Horne, Stefan Kristiansson, Guan Xuetao,
	linux-arm-kernel, Richard Henderson, Michal Simek, Tony Luck,
	Yoshinori Sato, Nick Hu, Vineet Gupta, linux-kernel,
	Palmer Dabbelt, Richard Weinberger, Paul Mackerras,
	Linus Torvalds, David S. Miller, Al Viro
In-Reply-To: <20200622234326.906346-17-christian.brauner@ubuntu.com>

On Tue, Jun 23, 2020 at 01:43:25AM +0200, Christian Brauner wrote:
> All architectures support copy_thread_tls() now, so remove the legacy
> copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
> uses the same process creation calling convention based on
> copy_thread_tls() and struct kernel_clone_args. This will make it easier to
> maintain the core process creation code under kernel/, simplifies the
> callpaths and makes the identical for all architectures.
> [..] 
>  arch/mips/Kconfig          |  1 -

Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply

* Re: PowerPC KVM-PR issue
From: Christian Zigotzky @ 2020-06-25  9:38 UTC (permalink / raw)
  To: Nicholas Piggin, kvm-ppc@vger.kernel.org, linuxppc-dev
  Cc: Darren Stevens, mad skateman, R.T.Dickinson, Christian Zigotzky
In-Reply-To: <292cba7f-ca2b-efb0-db3d-ecd7ee5f1fad@xenosoft.de>

On 15 June 2020 at 01:39 pm, Christian Zigotzky wrote:
> On 14 June 2020 at 04:52 pm, Christian Zigotzky wrote:
>> On 14 June 2020 at 02:53 pm, Nicholas Piggin wrote:
>>> Excerpts from Christian Zigotzky's message of June 12, 2020 11:01 pm:
>>>> On 11 June 2020 at 04:47 pm, Christian Zigotzky wrote:
>>>>> On 10 June 2020 at 01:23 pm, Christian Zigotzky wrote:
>>>>>> On 10 June 2020 at 11:06 am, Christian Zigotzky wrote:
>>>>>>> On 10 June 2020 at 00:18 am, Christian Zigotzky wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> KVM-PR doesn't work anymore on my Nemo board [1]. I figured out
>>>>>>>> that the Git kernels and the kernel 5.7 are affected.
>>>>>>>>
>>>>>>>> Error message: Fienix kernel: kvmppc_exit_pr_progint: emulation at
>>>>>>>> 700 failed (00000000)
>>>>>>>>
>>>>>>>> I can boot virtual QEMU PowerPC machines with KVM-PR with the
>>>>>>>> kernel 5.6 without any problems on my Nemo board.
>>>>>>>>
>>>>>>>> I tested it with QEMU 2.5.0 and QEMU 5.0.0 today.
>>>>>>>>
>>>>>>>> Could you please check KVM-PR on your PowerPC machine?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Christian
>>>>>>>>
>>>>>>>> [1] https://en.wikipedia.org/wiki/AmigaOne_X1000
>>>>>>> I figured out that the PowerPC updates 5.7-1 [1] are responsible 
>>>>>>> for
>>>>>>> the KVM-PR issue. Please test KVM-PR on your PowerPC machines and
>>>>>>> check the PowerPC updates 5.7-1 [1].
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> [1]
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d38c07afc356ddebaa3ed8ecb3f553340e05c969 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I tested the latest Git kernel with Mac-on-Linux/KVM-PR today.
>>>>>> Unfortunately I can't use KVM-PR with MoL anymore because of this
>>>>>> issue (see screenshots [1]). Please check the PowerPC updates 5.7-1.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> [1]
>>>>>> -
>>>>>> https://i.pinimg.com/originals/0c/b3/64/0cb364a40241fa2b7f297d4272bbb8b7.png 
>>>>>>
>>>>>> -
>>>>>> https://i.pinimg.com/originals/9a/61/d1/9a61d170b1c9f514f7a78a3014ffd18f.png 
>>>>>>
>>>>>>
>>>>> Hi All,
>>>>>
>>>>> I bisected today because of the KVM-PR issue.
>>>>>
>>>>> Result:
>>>>>
>>>>> 9600f261acaaabd476d7833cec2dd20f2919f1a0 is the first bad commit
>>>>> commit 9600f261acaaabd476d7833cec2dd20f2919f1a0
>>>>> Author: Nicholas Piggin <npiggin@gmail.com>
>>>>> Date:   Wed Feb 26 03:35:21 2020 +1000
>>>>>
>>>>>      powerpc/64s/exception: Move KVM test to common code
>>>>>
>>>>>      This allows more code to be moved out of unrelocated regions. 
>>>>> The
>>>>>      system call KVMTEST is changed to be open-coded and remain in 
>>>>> the
>>>>>      tramp area to avoid having to move it to entry_64.S. The custom
>>>>> nature
>>>>>      of the system call entry code means the hcall case can be 
>>>>> made more
>>>>>      streamlined than regular interrupt handlers.
>>>>>
>>>>>      mpe: Incorporate fix from Nick:
>>>>>
>>>>>      Moving KVM test to the common entry code missed the case of 
>>>>> HMI and
>>>>>      MCE, which do not do __GEN_COMMON_ENTRY (because they don't 
>>>>> want to
>>>>>      switch to virt mode).
>>>>>
>>>>>      This means a MCE or HMI exception that is taken while KVM is
>>>>> running a
>>>>>      guest context will not be switched out of that context, and 
>>>>> KVM won't
>>>>>      be notified. Found by running sigfuz in guest with patched 
>>>>> host on
>>>>>      POWER9 DD2.3, which causes some TM related HMI interrupts 
>>>>> (which are
>>>>>      expected and supposed to be handled by KVM).
>>>>>
>>>>>      This fix adds a __GEN_REALMODE_COMMON_ENTRY for those 
>>>>> handlers to add
>>>>>      the KVM test. This makes them look a little more like other 
>>>>> handlers
>>>>>      that all use __GEN_COMMON_ENTRY.
>>>>>
>>>>>      Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>>>>      Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>>>>>      Link:
>>>>> https://lore.kernel.org/r/20200225173541.1549955-13-npiggin@gmail.com
>>>>>
>>>>> :040000 040000 ec21cec22d165f8696d69532734cb2985d532cb0
>>>>> 87dd49a9cd7202ec79350e8ca26cea01f1dbd93d M    arch
>>>>>
>>>>> -----
>>>>>
>>>>> The following commit is the problem: powerpc/64s/exception: Move KVM
>>>>> test to common code [1]
>>>>>
>>>>> These changes were included in the PowerPC updates 5.7-1. [2]
>>>>>
>>>>> Another test:
>>>>>
>>>>> git checkout d38c07afc356ddebaa3ed8ecb3f553340e05c969 (PowerPC 
>>>>> updates
>>>>> 5.7-1 [2] ) -> KVM-PR doesn't work.
>>>>>
>>>>> After that: git revert d38c07afc356ddebaa3ed8ecb3f553340e05c969 -m 1
>>>>> -> KVM-PR works.
>>>>>
>>>>> Could you please check the first bad commit? [1]
>>>>>
>>>>> Thanks,
>>>>> Christian
>>>>>
>>>>>
>>>>> [1]
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9600f261acaaabd476d7833cec2dd20f2919f1a0 
>>>>>
>>>>> [2]
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d38c07afc356ddebaa3ed8ecb3f553340e05c969 
>>>>>
>>>> Hi All,
>>>>
>>>> I tried to revert the __GEN_REALMODE_COMMON_ENTRY fix for the 
>>>> latest Git
>>>> kernel and for the stable kernel 5.7.2 but without any success. There
>>>> was lot of restructuring work during the kernel 5.7 development 
>>>> time in
>>>> the PowerPC area so it isn't possible reactivate the old code. That
>>>> means we have lost the whole KVM-PR support. I also reported this 
>>>> issue
>>>> to Alexander Graf two days ago. He wrote: "Howdy :). It looks pretty
>>>> broken. Have you ever made a bisect to see where the problem comes 
>>>> from?"
>>>>
>>>> Please check the KVM-PR code.
>>> Does this patch fix it for you?
>>>
>>> The CTR register reload in the KVM interrupt path used the wrong save
>>> area for SLB (and NMI) interrupts.
>>>
>>> Fixes: 9600f261acaaa ("powerpc/64s/exception: Move KVM test to 
>>> common code")
>>> Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>>   arch/powerpc/kernel/exceptions-64s.S | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>>> b/arch/powerpc/kernel/exceptions-64s.S
>>> index e70ebb5c318c..fa080694e581 100644
>>> --- a/arch/powerpc/kernel/exceptions-64s.S
>>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>>> @@ -270,7 +270,7 @@ BEGIN_FTR_SECTION
>>>   END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
>>>       .endif
>>>   -    ld    r10,PACA_EXGEN+EX_CTR(r13)
>>> +    ld    r10,IAREA+EX_CTR(r13)
>>>       mtctr    r10
>>>   BEGIN_FTR_SECTION
>>>       ld    r10,IAREA+EX_PPR(r13)
>>> @@ -298,7 +298,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>>>         .if IKVM_SKIP
>>>   89:    mtocrf    0x80,r9
>>> -    ld    r10,PACA_EXGEN+EX_CTR(r13)
>>> +    ld    r10,IAREA+EX_CTR(r13)
>>>       mtctr    r10
>>>       ld    r9,IAREA+EX_R9(r13)
>>>       ld    r10,IAREA+EX_R10(r13)
>> Many thanks for the fix! I will test it with the RC1 tomorrow.
>>
>> -- Christian
>
> It works! :-) Thanks a lot! Screenshot: 
> https://i.pinimg.com/originals/5d/5f/e5/5d5fe584db474dc88bcc641450b2a7e0.png
>
> -- Christian

Just for info: I successfully tested KVM-PR with the stable kernel 5.7.6 
and with the RC2 of kernel 5.8 today. Thanks a lot for fixing the issue.

-- Christian

^ permalink raw reply

* [PATCH] powerpc: Warn about use of smt_snooze_delay
From: Joel Stanley @ 2020-06-25 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: ego

It's not done anything for a long time. Save the percpu variable, and
emit a warning to remind users to not expect it to do anything.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 arch/powerpc/kernel/sysfs.c | 41 +++++++++++++------------------------
 1 file changed, 14 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 571b3259697e..530ae92bc46d 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -32,29 +32,25 @@
 
 static DEFINE_PER_CPU(struct cpu, cpu_devices);
 
-/*
- * SMT snooze delay stuff, 64-bit only for now
- */
-
 #ifdef CONFIG_PPC64
 
-/* Time in microseconds we delay before sleeping in the idle loop */
-static DEFINE_PER_CPU(long, smt_snooze_delay) = { 100 };
+/*
+ * Snooze delay has not been hooked up since 3fa8cad82b94 ("powerpc/pseries/cpuidle:
+ * smt-snooze-delay cleanup.") and has been broken even longer. As was foretold in
+ * 2014:
+ *
+ *  "ppc64_util currently utilises it. Once we fix ppc64_util, propose to clean
+ *  up the kernel code."
+ *
+ * At some point in the future this code should be removed.
+ */
 
 static ssize_t store_smt_snooze_delay(struct device *dev,
 				      struct device_attribute *attr,
 				      const char *buf,
 				      size_t count)
 {
-	struct cpu *cpu = container_of(dev, struct cpu, dev);
-	ssize_t ret;
-	long snooze;
-
-	ret = sscanf(buf, "%ld", &snooze);
-	if (ret != 1)
-		return -EINVAL;
-
-	per_cpu(smt_snooze_delay, cpu->dev.id) = snooze;
+	WARN_ON_ONCE("smt_snooze_delay sysfs file has no effect\n");
 	return count;
 }
 
@@ -62,9 +58,9 @@ static ssize_t show_smt_snooze_delay(struct device *dev,
 				     struct device_attribute *attr,
 				     char *buf)
 {
-	struct cpu *cpu = container_of(dev, struct cpu, dev);
+	WARN_ON_ONCE("smt_snooze_delay sysfs file has no effect\n");
 
-	return sprintf(buf, "%ld\n", per_cpu(smt_snooze_delay, cpu->dev.id));
+	return sprintf(buf, "100\n");
 }
 
 static DEVICE_ATTR(smt_snooze_delay, 0644, show_smt_snooze_delay,
@@ -72,16 +68,7 @@ static DEVICE_ATTR(smt_snooze_delay, 0644, show_smt_snooze_delay,
 
 static int __init setup_smt_snooze_delay(char *str)
 {
-	unsigned int cpu;
-	long snooze;
-
-	if (!cpu_has_feature(CPU_FTR_SMT))
-		return 1;
-
-	snooze = simple_strtol(str, NULL, 10);
-	for_each_possible_cpu(cpu)
-		per_cpu(smt_snooze_delay, cpu) = snooze;
-
+	WARN_ON_ONCE("smt-snooze-delay command line option has no effect\n");
 	return 1;
 }
 __setup("smt-snooze-delay=", setup_smt_snooze_delay);
-- 
2.27.0


^ permalink raw reply related

* Re: FSL P5020/P5040: DPAA Ethernet issue with the latest Git kernel
From: Alexander Gordeev @ 2020-06-25 10:22 UTC (permalink / raw)
  To: Christian Zigotzky
  Cc: Darren Stevens, mad skateman, Madalin Bucur, Sascha Hauer,
	R.T.Dickinson, linuxppc-dev, Christian Zigotzky
In-Reply-To: <004794fb-370c-c165-38e6-a451dc50c294@xenosoft.de>

On Thu, Jun 25, 2020 at 01:01:52AM +0200, Christian Zigotzky wrote:
[...]
> I compiled a test kernel with the option "CONFIG_TEST_BITMAP=y"
> yesterday. After that Skateman and I booted it and looked for the
> bitmap tests with "dmesg | grep -i bitmap".
> 
> Results:
> 
> FSL P5020:
> 
> [    0.297756] test_bitmap: loaded.
> [    0.298113] test_bitmap: parselist: 14: input is '0-2047:128/256'
> OK, Time: 562
> [    0.298142] test_bitmap: parselist_user: 14: input is
> '0-2047:128/256' OK, Time: 761
> [    0.301553] test_bitmap: all 1663 tests passed
> 
> FSL P5040:
> 
> [    0.296563] test_bitmap: loaded.
> [    0.296894] test_bitmap: parselist: 14: input is '0-2047:128/256'
> OK, Time: 540
> [    0.296920] test_bitmap: parselist_user: 14: input is
> '0-2047:128/256' OK, Time: 680
> [    0.299994] test_bitmap: all 1663 tests passed

Thanks for the test! So it works as expected.

I would suggest to compare what is going on on the device probing
with and without the bisected commit.

There seems to be MAC and PHY mode initialization issue that might
resulted from the bitmap format change.

I put Madalin and Sascha on CC as they have done some works on
this part recently.

Thanks!


> Thanks,
> Christian

^ permalink raw reply

* Re: [PATCH] powerpc: Warn about use of smt_snooze_delay
From: Christophe Leroy @ 2020-06-25 10:29 UTC (permalink / raw)
  To: Joel Stanley, linuxppc-dev; +Cc: ego
In-Reply-To: <20200625100349.2408899-1-joel@jms.id.au>



Le 25/06/2020 à 12:03, Joel Stanley a écrit :
> It's not done anything for a long time. Save the percpu variable, and
> emit a warning to remind users to not expect it to do anything.

Why not just drop the file entirely  if it is useless ?

Christophe

> 
> Signed-off-by: Joel Stanley <joel@jms.id.au>
> ---
>   arch/powerpc/kernel/sysfs.c | 41 +++++++++++++------------------------
>   1 file changed, 14 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 571b3259697e..530ae92bc46d 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -32,29 +32,25 @@
>   
>   static DEFINE_PER_CPU(struct cpu, cpu_devices);
>   
> -/*
> - * SMT snooze delay stuff, 64-bit only for now
> - */
> -
>   #ifdef CONFIG_PPC64
>   
> -/* Time in microseconds we delay before sleeping in the idle loop */
> -static DEFINE_PER_CPU(long, smt_snooze_delay) = { 100 };
> +/*
> + * Snooze delay has not been hooked up since 3fa8cad82b94 ("powerpc/pseries/cpuidle:
> + * smt-snooze-delay cleanup.") and has been broken even longer. As was foretold in
> + * 2014:
> + *
> + *  "ppc64_util currently utilises it. Once we fix ppc64_util, propose to clean
> + *  up the kernel code."
> + *
> + * At some point in the future this code should be removed.
> + */
>   
>   static ssize_t store_smt_snooze_delay(struct device *dev,
>   				      struct device_attribute *attr,
>   				      const char *buf,
>   				      size_t count)
>   {
> -	struct cpu *cpu = container_of(dev, struct cpu, dev);
> -	ssize_t ret;
> -	long snooze;
> -
> -	ret = sscanf(buf, "%ld", &snooze);
> -	if (ret != 1)
> -		return -EINVAL;
> -
> -	per_cpu(smt_snooze_delay, cpu->dev.id) = snooze;
> +	WARN_ON_ONCE("smt_snooze_delay sysfs file has no effect\n");
>   	return count;
>   }
>   
> @@ -62,9 +58,9 @@ static ssize_t show_smt_snooze_delay(struct device *dev,
>   				     struct device_attribute *attr,
>   				     char *buf)
>   {
> -	struct cpu *cpu = container_of(dev, struct cpu, dev);
> +	WARN_ON_ONCE("smt_snooze_delay sysfs file has no effect\n");
>   
> -	return sprintf(buf, "%ld\n", per_cpu(smt_snooze_delay, cpu->dev.id));
> +	return sprintf(buf, "100\n");
>   }
>   
>   static DEVICE_ATTR(smt_snooze_delay, 0644, show_smt_snooze_delay,
> @@ -72,16 +68,7 @@ static DEVICE_ATTR(smt_snooze_delay, 0644, show_smt_snooze_delay,
>   
>   static int __init setup_smt_snooze_delay(char *str)
>   {
> -	unsigned int cpu;
> -	long snooze;
> -
> -	if (!cpu_has_feature(CPU_FTR_SMT))
> -		return 1;
> -
> -	snooze = simple_strtol(str, NULL, 10);
> -	for_each_possible_cpu(cpu)
> -		per_cpu(smt_snooze_delay, cpu) = snooze;
> -
> +	WARN_ON_ONCE("smt-snooze-delay command line option has no effect\n");
>   	return 1;
>   }
>   __setup("smt-snooze-delay=", setup_smt_snooze_delay);
> 

^ permalink raw reply

* Re: [PATCH v2 2/2] cpufreq: Specify default governor on command line
From: Rafael J. Wysocki @ 2020-06-25 10:52 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Juri Lelli, Cc: Android Kernel, Vincent Guittot, Arnd Bergmann,
	Rafael J. Wysocki, Peter Zijlstra, Linux PM, Quentin Perret,
	Rafael J. Wysocki, Linux Kernel Mailing List, Ingo Molnar,
	Paul Mackerras, linuxppc-dev, adharmap, Todd Kjos
In-Reply-To: <20200625085052.4ah4wbog3guj74v4@vireshk-i7>

On Thu, Jun 25, 2020 at 10:50 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 24-06-20, 16:32, Quentin Perret wrote:
> > Right, but I must admit that, looking at this more, I'm getting a bit
> > confused with the overall locking for governors :/
> >
> > When in cpufreq_init_policy() we find a governor using
> > find_governor(policy->last_governor), what guarantees this governor is
> > not concurrently unregistered? That is, what guarantees this governor
> > doesn't go away between that find_governor() call, and the subsequent
> > call to try_module_get() in cpufreq_set_policy() down the line?
> >
> > Can we somewhat assume that whatever governor is referred to by
> > policy->last_governor will have a non-null refcount? Or are the
> > cpufreq_online() and cpufreq_unregister_governor() path mutually
> > exclusive? Or is there something else?
>
> This should be sufficient to fix pending issues I believe. Based over your
> patches.

LGTM, but can you post it in a new thread to let Patchwork pick it up?

> -------------------------8<-------------------------
> From: Viresh Kumar <viresh.kumar@linaro.org>
> Date: Thu, 25 Jun 2020 13:15:23 +0530
> Subject: [PATCH] cpufreq: Fix locking issues with governors
>
> The locking around governors handling isn't adequate currently. The list
> of governors should never be traversed without locking in place. Also we
> must make sure the governor isn't removed while it is still referenced
> by code.
>
> Reported-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  drivers/cpufreq/cpufreq.c | 59 ++++++++++++++++++++++++---------------
>  1 file changed, 36 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 4b1a5c0173cf..dad6b85f4c89 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -624,6 +624,24 @@ static struct cpufreq_governor *find_governor(const char *str_governor)
>         return NULL;
>  }
>
> +static struct cpufreq_governor *get_governor(const char *str_governor)
> +{
> +       struct cpufreq_governor *t;
> +
> +       mutex_lock(&cpufreq_governor_mutex);
> +       t = find_governor(str_governor);
> +       if (!t)
> +               goto unlock;
> +
> +       if (!try_module_get(t->owner))
> +               t = NULL;
> +
> +unlock:
> +       mutex_unlock(&cpufreq_governor_mutex);
> +
> +       return t;
> +}
> +
>  static unsigned int cpufreq_parse_policy(char *str_governor)
>  {
>         if (!strncasecmp(str_governor, "performance", CPUFREQ_NAME_LEN))
> @@ -643,28 +661,14 @@ static struct cpufreq_governor *cpufreq_parse_governor(char *str_governor)
>  {
>         struct cpufreq_governor *t;
>
> -       mutex_lock(&cpufreq_governor_mutex);
> -
> -       t = find_governor(str_governor);
> -       if (!t) {
> -               int ret;
> -
> -               mutex_unlock(&cpufreq_governor_mutex);
> -
> -               ret = request_module("cpufreq_%s", str_governor);
> -               if (ret)
> -                       return NULL;
> -
> -               mutex_lock(&cpufreq_governor_mutex);
> +       t = get_governor(str_governor);
> +       if (t)
> +               return t;
>
> -               t = find_governor(str_governor);
> -       }
> -       if (t && !try_module_get(t->owner))
> -               t = NULL;
> -
> -       mutex_unlock(&cpufreq_governor_mutex);
> +       if (request_module("cpufreq_%s", str_governor))
> +               return NULL;
>
> -       return t;
> +       return get_governor(str_governor);
>  }
>
>  /**
> @@ -818,12 +822,14 @@ static ssize_t show_scaling_available_governors(struct cpufreq_policy *policy,
>                 goto out;
>         }
>
> +       mutex_lock(&cpufreq_governor_mutex);
>         for_each_governor(t) {
>                 if (i >= (ssize_t) ((PAGE_SIZE / sizeof(char))
>                     - (CPUFREQ_NAME_LEN + 2)))
> -                       goto out;
> +                       break;
>                 i += scnprintf(&buf[i], CPUFREQ_NAME_PLEN, "%s ", t->name);
>         }
> +       mutex_unlock(&cpufreq_governor_mutex);
>  out:
>         i += sprintf(&buf[i], "\n");
>         return i;
> @@ -1060,11 +1066,14 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
>  {
>         struct cpufreq_governor *gov = NULL;
>         unsigned int pol = CPUFREQ_POLICY_UNKNOWN;
> +       bool put_governor = false;
> +       int ret;
>
>         if (has_target()) {
>                 /* Update policy governor to the one used before hotplug. */
> -               gov = find_governor(policy->last_governor);
> +               gov = get_governor(policy->last_governor);
>                 if (gov) {
> +                       put_governor = true;
>                         pr_debug("Restoring governor %s for cpu %d\n",
>                                  policy->governor->name, policy->cpu);
>                 } else if (default_governor) {
> @@ -1091,7 +1100,11 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
>                         return -ENODATA;
>         }
>
> -       return cpufreq_set_policy(policy, gov, pol);
> +       ret = cpufreq_set_policy(policy, gov, pol);
> +       if (put_governor)
> +               module_put(gov->owner);
> +
> +       return ret;
>  }
>
>  static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)

^ permalink raw reply

* Re: [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings
From: Aneesh Kumar K.V @ 2020-06-25 11:30 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Bharata B Rao
In-Reply-To: <20200625064547.228448-2-aneesh.kumar@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:


> Fixing this includes 3 parts:
>
> - Re-walk the init_mm page tables from mem_init() and initialize
>   the PMD and PTE fragment count to 1.
> - When freeing PUD, PMD and PTE page table pages, check explicitly
>   if they come from memblock and if so free then appropriately.
> - When we do early memblock based allocation of PMD and PUD pages,
>   allocate in PAGE_SIZE granularity so that we are sure the
>   complete page is used as pagetable page.
>
> Since we now do PAGE_SIZE allocations for both PUD table and
> PMD table (Note that PTE table allocation is already of PAGE_SIZE),
> we end up allocating more memory for the same amount of system RAM.
> Here is a comparision of how much more we need for a 64T and 2G
> system after this patch:
>

Missed updating the commit message w.r.t page table fragments.  Updated
one below.

powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

We can hit the following BUG_ON during memory unplug:

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

This can be fixed by allocating memory in PAGE_SIZE granularity
during early page table allocation. This makes sure a specific
page is not shared for another memblock allocation and we can
free them correctly on removing page-table pages.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

-aneesh

^ permalink raw reply

* Re: [PATCH v2 2/2] cpufreq: Specify default governor on command line
From: Viresh Kumar @ 2020-06-25 11:36 UTC (permalink / raw)
  To: Quentin Perret
  Cc: juri.lelli, kernel-team, vincent.guittot, arnd, rafael, peterz,
	adharmap, linux-pm, rjw, linux-kernel, mingo, paulus,
	linuxppc-dev, tkjos
In-Reply-To: <20200623142138.209513-3-qperret@google.com>

After your last email (reply to my patch), I noticed a change which
isn't required. :)

On 23-06-20, 15:21, Quentin Perret wrote:
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 0128de3603df..4b1a5c0173cf 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -50,6 +50,9 @@ static LIST_HEAD(cpufreq_governor_list);
>  #define for_each_governor(__governor)				\
>  	list_for_each_entry(__governor, &cpufreq_governor_list, governor_list)
>  
> +static char cpufreq_param_governor[CPUFREQ_NAME_LEN];
> +static struct cpufreq_governor *default_governor;
> +
>  /**
>   * The "cpufreq driver" - the arch- or hardware-dependent low
>   * level driver of CPUFreq support, and its spinlock. This lock
> @@ -1055,7 +1058,6 @@ __weak struct cpufreq_governor *cpufreq_default_governor(void)
>  
>  static int cpufreq_init_policy(struct cpufreq_policy *policy)
>  {
> -	struct cpufreq_governor *def_gov = cpufreq_default_governor();
>  	struct cpufreq_governor *gov = NULL;
>  	unsigned int pol = CPUFREQ_POLICY_UNKNOWN;
>  
> @@ -1065,8 +1067,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
>  		if (gov) {
>  			pr_debug("Restoring governor %s for cpu %d\n",
>  				 policy->governor->name, policy->cpu);
> -		} else if (def_gov) {
> -			gov = def_gov;
> +		} else if (default_governor) {
> +			gov = default_governor;
>  		} else {
>  			return -ENODATA;
>  		}


> @@ -1074,8 +1076,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
>  		/* Use the default policy if there is no last_policy. */
>  		if (policy->last_policy) {
>  			pol = policy->last_policy;
> -		} else if (def_gov) {
> -			pol = cpufreq_parse_policy(def_gov->name);
> +		} else if (default_governor) {
> +			pol = cpufreq_parse_policy(default_governor->name);

This change is not right IMO. This part handles the set-policy case,
where there are no governors. Right now this code, for some reasons
unknown to me, forcefully uses the default governor set to indicate
the policy, which is not a great idea in my opinion TBH. This doesn't
and shouldn't care about governor modules and should only be looking
at strings instead of governor pointer.

Rafael, I even think we should remove this code completely and just
rely on what the driver has sent to us. Using the selected governor
for set policy drivers is very confusing and also we shouldn't be
forced to compiling any governor for the set-policy case.

-- 
viresh

^ permalink raw reply

* Re: [PATCH v2 2/2] cpufreq: Specify default governor on command line
From: Rafael J. Wysocki @ 2020-06-25 11:44 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Juri Lelli, Cc: Android Kernel, Vincent Guittot, Arnd Bergmann,
	Rafael J. Wysocki, Peter Zijlstra, Linux PM, Quentin Perret,
	Rafael J. Wysocki, Linux Kernel Mailing List, Ingo Molnar,
	Paul Mackerras, linuxppc-dev, adharmap, Todd Kjos
In-Reply-To: <20200625113602.z2xrwebd2gngbww3@vireshk-i7>

On Thu, Jun 25, 2020 at 1:36 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> After your last email (reply to my patch), I noticed a change which
> isn't required. :)
>
> On 23-06-20, 15:21, Quentin Perret wrote:
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 0128de3603df..4b1a5c0173cf 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -50,6 +50,9 @@ static LIST_HEAD(cpufreq_governor_list);
> >  #define for_each_governor(__governor)                                \
> >       list_for_each_entry(__governor, &cpufreq_governor_list, governor_list)
> >
> > +static char cpufreq_param_governor[CPUFREQ_NAME_LEN];
> > +static struct cpufreq_governor *default_governor;
> > +
> >  /**
> >   * The "cpufreq driver" - the arch- or hardware-dependent low
> >   * level driver of CPUFreq support, and its spinlock. This lock
> > @@ -1055,7 +1058,6 @@ __weak struct cpufreq_governor *cpufreq_default_governor(void)
> >
> >  static int cpufreq_init_policy(struct cpufreq_policy *policy)
> >  {
> > -     struct cpufreq_governor *def_gov = cpufreq_default_governor();
> >       struct cpufreq_governor *gov = NULL;
> >       unsigned int pol = CPUFREQ_POLICY_UNKNOWN;
> >
> > @@ -1065,8 +1067,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
> >               if (gov) {
> >                       pr_debug("Restoring governor %s for cpu %d\n",
> >                                policy->governor->name, policy->cpu);
> > -             } else if (def_gov) {
> > -                     gov = def_gov;
> > +             } else if (default_governor) {
> > +                     gov = default_governor;
> >               } else {
> >                       return -ENODATA;
> >               }
>
>
> > @@ -1074,8 +1076,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
> >               /* Use the default policy if there is no last_policy. */
> >               if (policy->last_policy) {
> >                       pol = policy->last_policy;
> > -             } else if (def_gov) {
> > -                     pol = cpufreq_parse_policy(def_gov->name);
> > +             } else if (default_governor) {
> > +                     pol = cpufreq_parse_policy(default_governor->name);
>
> This change is not right IMO. This part handles the set-policy case,
> where there are no governors. Right now this code, for some reasons
> unknown to me, forcefully uses the default governor set to indicate
> the policy, which is not a great idea in my opinion TBH. This doesn't
> and shouldn't care about governor modules and should only be looking
> at strings instead of governor pointer.

Sounds right.

> Rafael, I even think we should remove this code completely and just
> rely on what the driver has sent to us. Using the selected governor
> for set policy drivers is very confusing and also we shouldn't be
> forced to compiling any governor for the set-policy case.

Well, AFAICS the idea was to use the default governor as a kind of
default policy proxy, but I agree that strings should be sufficient
for that.

I'll have a look at what to do with that code.

^ permalink raw reply

* Re: [PATCH v2 2/2] cpufreq: Specify default governor on command line
From: Quentin Perret @ 2020-06-25 11:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Juri Lelli, Cc: Android Kernel, Vincent Guittot, Arnd Bergmann,
	Linux PM, Peter Zijlstra, Viresh Kumar, adharmap,
	Rafael J. Wysocki, Linux Kernel Mailing List, Ingo Molnar,
	Paul Mackerras, linuxppc-dev, Todd Kjos
In-Reply-To: <CAJZ5v0g=+2OFKVk2ZnmK-33knUwqcaOOQ+q9ZWnmeoBD9KOX9g@mail.gmail.com>

On Thursday 25 Jun 2020 at 13:44:34 (+0200), Rafael J. Wysocki wrote:
> On Thu, Jun 25, 2020 at 1:36 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > This change is not right IMO. This part handles the set-policy case,
> > where there are no governors. Right now this code, for some reasons
> > unknown to me, forcefully uses the default governor set to indicate
> > the policy, which is not a great idea in my opinion TBH. This doesn't
> > and shouldn't care about governor modules and should only be looking
> > at strings instead of governor pointer.
> 
> Sounds right.
> 
> > Rafael, I even think we should remove this code completely and just
> > rely on what the driver has sent to us. Using the selected governor
> > for set policy drivers is very confusing and also we shouldn't be
> > forced to compiling any governor for the set-policy case.
> 
> Well, AFAICS the idea was to use the default governor as a kind of
> default policy proxy, but I agree that strings should be sufficient
> for that.

I agree with all the above. I'd much rather not rely on the default
governor name to populate the default policy, too, so +1 from me.

Thanks,
Quentin

^ permalink raw reply

* [PATCH 03/13] iommu/msm: Use dev_iommu_priv_get/set()
From: Joerg Roedel @ 2020-06-25 13:08 UTC (permalink / raw)
  To: iommu
  Cc: linux-ia64, Heiko Stuebner, David Airlie, Joonas Lahtinen,
	Thierry Reding, Paul Mackerras, Will Deacon, Marek Szyprowski,
	Joerg Roedel, x86, Russell King, Catalin Marinas, Fenghua Yu,
	Joerg Roedel, intel-gfx, Jani Nikula, Rodrigo Vivi,
	Matthias Brugger, linux-arm-kernel, Tony Luck, linuxppc-dev,
	linux-kernel, Daniel Vetter, David Woodhouse, Lu Baolu
In-Reply-To: <20200625130836.1916-1-joro@8bytes.org>

From: Joerg Roedel <jroedel@suse.de>

Remove the use of dev->archdata.iommu and use the private per-device
pointer provided by IOMMU core code instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/msm_iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 3d8a63555c25..f773cc85f311 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -593,14 +593,14 @@ static void insert_iommu_master(struct device *dev,
 				struct msm_iommu_dev **iommu,
 				struct of_phandle_args *spec)
 {
-	struct msm_iommu_ctx_dev *master = dev->archdata.iommu;
+	struct msm_iommu_ctx_dev *master = dev_iommu_priv_get(dev);
 	int sid;
 
 	if (list_empty(&(*iommu)->ctx_list)) {
 		master = kzalloc(sizeof(*master), GFP_ATOMIC);
 		master->of_node = dev->of_node;
 		list_add(&master->list, &(*iommu)->ctx_list);
-		dev->archdata.iommu = master;
+		dev_iommu_priv_set(dev, master);
 	}
 
 	for (sid = 0; sid < master->num_mids; sid++)
-- 
2.27.0


^ permalink raw reply related

* [PATCH 00/13] iommu: Remove usage of dev->archdata.iommu
From: Joerg Roedel @ 2020-06-25 13:08 UTC (permalink / raw)
  To: iommu
  Cc: linux-ia64, Heiko Stuebner, David Airlie, Joonas Lahtinen,
	Thierry Reding, Paul Mackerras, Will Deacon, Marek Szyprowski,
	Joerg Roedel, x86, Russell King, Catalin Marinas, Fenghua Yu,
	Joerg Roedel, intel-gfx, Jani Nikula, Rodrigo Vivi,
	Matthias Brugger, linux-arm-kernel, Tony Luck, linuxppc-dev,
	linux-kernel, Daniel Vetter, David Woodhouse, Lu Baolu

From: Joerg Roedel <jroedel@suse.de>

Hi,

here is a patch-set to remove the usage of dev->archdata.iommu from
the IOMMU code in the kernel and replace its uses by the iommu per-device
private data field. The changes also remove the field entirely from
the architectures which no longer need it.

On PowerPC the field is called dev->archdata.iommu_domain and was only
used by the PAMU IOMMU driver. It gets removed as well.

The patches have been runtime tested on Intel VT-d and compile tested
with allyesconfig for:

	* x86 (32 and 64 bit)
	* arm and arm64
	* ia64 (only drivers/ because build failed for me in
	        arch/ia64)
	* PPC64

Besides that the changes also survived my IOMMU tree compile tests.

Please review.

Regards,

	Joerg

Joerg Roedel (13):
  iommu/exynos: Use dev_iommu_priv_get/set()
  iommu/vt-d: Use dev_iommu_priv_get/set()
  iommu/msm: Use dev_iommu_priv_get/set()
  iommu/omap: Use dev_iommu_priv_get/set()
  iommu/rockchip: Use dev_iommu_priv_get/set()
  iommu/tegra: Use dev_iommu_priv_get/set()
  iommu/pamu: Use dev_iommu_priv_get/set()
  iommu/mediatek: Do no use dev->archdata.iommu
  x86: Remove dev->archdata.iommu pointer
  ia64: Remove dev->archdata.iommu pointer
  arm: Remove dev->archdata.iommu pointer
  arm64: Remove dev->archdata.iommu pointer
  powerpc/dma: Remove dev->archdata.iommu_domain

 arch/arm/include/asm/device.h                 |  3 ---
 arch/arm64/include/asm/device.h               |  3 ---
 arch/ia64/include/asm/device.h                |  3 ---
 arch/powerpc/include/asm/device.h             |  3 ---
 arch/x86/include/asm/device.h                 |  3 ---
 .../gpu/drm/i915/selftests/mock_gem_device.c  | 10 ++++++++--
 drivers/iommu/exynos-iommu.c                  | 20 +++++++++----------
 drivers/iommu/fsl_pamu_domain.c               |  8 ++++----
 drivers/iommu/intel/iommu.c                   | 18 ++++++++---------
 drivers/iommu/msm_iommu.c                     |  4 ++--
 drivers/iommu/mtk_iommu.h                     |  2 ++
 drivers/iommu/mtk_iommu_v1.c                  | 10 ++++------
 drivers/iommu/omap-iommu.c                    | 20 +++++++++----------
 drivers/iommu/rockchip-iommu.c                |  8 ++++----
 drivers/iommu/tegra-gart.c                    |  8 ++++----
 drivers/iommu/tegra-smmu.c                    |  8 ++++----
 .../media/platform/s5p-mfc/s5p_mfc_iommu.h    |  4 +++-
 17 files changed, 64 insertions(+), 71 deletions(-)

-- 
2.27.0


^ permalink raw reply

* [PATCH 02/13] iommu/vt-d: Use dev_iommu_priv_get/set()
From: Joerg Roedel @ 2020-06-25 13:08 UTC (permalink / raw)
  To: iommu
  Cc: linux-ia64, Heiko Stuebner, David Airlie, Joonas Lahtinen,
	Thierry Reding, Paul Mackerras, Will Deacon, Marek Szyprowski,
	Joerg Roedel, x86, Russell King, Catalin Marinas, Fenghua Yu,
	Joerg Roedel, intel-gfx, Jani Nikula, Rodrigo Vivi,
	Matthias Brugger, linux-arm-kernel, Tony Luck, linuxppc-dev,
	linux-kernel, Daniel Vetter, David Woodhouse, Lu Baolu
In-Reply-To: <20200625130836.1916-1-joro@8bytes.org>

From: Joerg Roedel <jroedel@suse.de>

Remove the use of dev->archdata.iommu and use the private per-device
pointer provided by IOMMU core code instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 .../gpu/drm/i915/selftests/mock_gem_device.c   | 10 ++++++++--
 drivers/iommu/intel/iommu.c                    | 18 +++++++++---------
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/selftests/mock_gem_device.c b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
index 9b105b811f1f..e08601905a64 100644
--- a/drivers/gpu/drm/i915/selftests/mock_gem_device.c
+++ b/drivers/gpu/drm/i915/selftests/mock_gem_device.c
@@ -24,6 +24,7 @@
 
 #include <linux/pm_domain.h>
 #include <linux/pm_runtime.h>
+#include <linux/iommu.h>
 
 #include <drm/drm_managed.h>
 
@@ -118,6 +119,9 @@ struct drm_i915_private *mock_gem_device(void)
 {
 	struct drm_i915_private *i915;
 	struct pci_dev *pdev;
+#if IS_ENABLED(CONFIG_IOMMU_API) && defined(CONFIG_INTEL_IOMMU)
+	struct dev_iommu iommu;
+#endif
 	int err;
 
 	pdev = kzalloc(sizeof(*pdev), GFP_KERNEL);
@@ -136,8 +140,10 @@ struct drm_i915_private *mock_gem_device(void)
 	dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
 
 #if IS_ENABLED(CONFIG_IOMMU_API) && defined(CONFIG_INTEL_IOMMU)
-	/* hack to disable iommu for the fake device; force identity mapping */
-	pdev->dev.archdata.iommu = (void *)-1;
+	/* HACK HACK HACK to disable iommu for the fake device; force identity mapping */
+	memset(&iommu, 0, sizeof(iommu));
+	iommu.priv = (void *)-1;
+	pdev->dev.iommu = &iommu;
 #endif
 
 	pci_set_drvdata(pdev, i915);
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d759e7234e98..2ce490c2eab8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -372,7 +372,7 @@ struct device_domain_info *get_domain_info(struct device *dev)
 	if (!dev)
 		return NULL;
 
-	info = dev->archdata.iommu;
+	info = dev_iommu_priv_get(dev);
 	if (unlikely(info == DUMMY_DEVICE_DOMAIN_INFO ||
 		     info == DEFER_DEVICE_DOMAIN_INFO))
 		return NULL;
@@ -743,12 +743,12 @@ struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
 
 static int iommu_dummy(struct device *dev)
 {
-	return dev->archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO;
+	return dev_iommu_priv_get(dev) == DUMMY_DEVICE_DOMAIN_INFO;
 }
 
 static bool attach_deferred(struct device *dev)
 {
-	return dev->archdata.iommu == DEFER_DEVICE_DOMAIN_INFO;
+	return dev_iommu_priv_get(dev) == DEFER_DEVICE_DOMAIN_INFO;
 }
 
 /**
@@ -2420,7 +2420,7 @@ static inline void unlink_domain_info(struct device_domain_info *info)
 	list_del(&info->link);
 	list_del(&info->global);
 	if (info->dev)
-		info->dev->archdata.iommu = NULL;
+		dev_iommu_priv_set(info->dev, NULL);
 }
 
 static void domain_remove_dev_info(struct dmar_domain *domain)
@@ -2453,7 +2453,7 @@ static void do_deferred_attach(struct device *dev)
 {
 	struct iommu_domain *domain;
 
-	dev->archdata.iommu = NULL;
+	dev_iommu_priv_set(dev, NULL);
 	domain = iommu_get_domain_for_dev(dev);
 	if (domain)
 		intel_iommu_attach_device(domain, dev);
@@ -2599,7 +2599,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
 	list_add(&info->link, &domain->devices);
 	list_add(&info->global, &device_domain_list);
 	if (dev)
-		dev->archdata.iommu = info;
+		dev_iommu_priv_set(dev, info);
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 
 	/* PASID table is mandatory for a PCI device in scalable mode. */
@@ -4004,7 +4004,7 @@ static void quirk_ioat_snb_local_iommu(struct pci_dev *pdev)
 	if (!drhd || drhd->reg_base_addr - vtbar != 0xa000) {
 		pr_warn_once(FW_BUG "BIOS assigned incorrect VT-d unit for Intel(R) QuickData Technology device\n");
 		add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK);
-		pdev->dev.archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
+		dev_iommu_priv_set(&pdev->dev, DUMMY_DEVICE_DOMAIN_INFO);
 	}
 }
 DECLARE_PCI_FIXUP_ENABLE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IOAT_SNB, quirk_ioat_snb_local_iommu);
@@ -4043,7 +4043,7 @@ static void __init init_no_remapping_devices(void)
 			drhd->ignored = 1;
 			for_each_active_dev_scope(drhd->devices,
 						  drhd->devices_cnt, i, dev)
-				dev->archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
+				dev_iommu_priv_set(dev, DUMMY_DEVICE_DOMAIN_INFO);
 		}
 	}
 }
@@ -5665,7 +5665,7 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
 		return ERR_PTR(-ENODEV);
 
 	if (translation_pre_enabled(iommu))
-		dev->archdata.iommu = DEFER_DEVICE_DOMAIN_INFO;
+		dev_iommu_priv_set(dev, DEFER_DEVICE_DOMAIN_INFO);
 
 	return &iommu->iommu;
 }
-- 
2.27.0


^ permalink raw reply related

* [PATCH 01/13] iommu/exynos: Use dev_iommu_priv_get/set()
From: Joerg Roedel @ 2020-06-25 13:08 UTC (permalink / raw)
  To: iommu
  Cc: linux-ia64, Heiko Stuebner, David Airlie, Joonas Lahtinen,
	Thierry Reding, Paul Mackerras, Will Deacon, Marek Szyprowski,
	Joerg Roedel, x86, Russell King, Catalin Marinas, Fenghua Yu,
	Joerg Roedel, intel-gfx, Jani Nikula, Rodrigo Vivi,
	Matthias Brugger, linux-arm-kernel, Tony Luck, linuxppc-dev,
	linux-kernel, Daniel Vetter, David Woodhouse, Lu Baolu
In-Reply-To: <20200625130836.1916-1-joro@8bytes.org>

From: Joerg Roedel <jroedel@suse.de>

Remove the use of dev->archdata.iommu and use the private per-device
pointer provided by IOMMU core code instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/exynos-iommu.c                  | 20 +++++++++----------
 .../media/platform/s5p-mfc/s5p_mfc_iommu.h    |  4 +++-
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 60c8a56e4a3f..6a9b67302369 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -173,7 +173,7 @@ static u32 lv2ent_offset(sysmmu_iova_t iova)
 #define REG_V5_FAULT_AR_VA	0x070
 #define REG_V5_FAULT_AW_VA	0x080
 
-#define has_sysmmu(dev)		(dev->archdata.iommu != NULL)
+#define has_sysmmu(dev)		(dev_iommu_priv_get(dev) != NULL)
 
 static struct device *dma_dev;
 static struct kmem_cache *lv2table_kmem_cache;
@@ -226,7 +226,7 @@ static const struct sysmmu_fault_info sysmmu_v5_faults[] = {
 };
 
 /*
- * This structure is attached to dev.archdata.iommu of the master device
+ * This structure is attached to dev->iommu->priv of the master device
  * on device add, contains a list of SYSMMU controllers defined by device tree,
  * which are bound to given master device. It is usually referenced by 'owner'
  * pointer.
@@ -670,7 +670,7 @@ static int __maybe_unused exynos_sysmmu_suspend(struct device *dev)
 	struct device *master = data->master;
 
 	if (master) {
-		struct exynos_iommu_owner *owner = master->archdata.iommu;
+		struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
 		mutex_lock(&owner->rpm_lock);
 		if (data->domain) {
@@ -688,7 +688,7 @@ static int __maybe_unused exynos_sysmmu_resume(struct device *dev)
 	struct device *master = data->master;
 
 	if (master) {
-		struct exynos_iommu_owner *owner = master->archdata.iommu;
+		struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
 		mutex_lock(&owner->rpm_lock);
 		if (data->domain) {
@@ -837,8 +837,8 @@ static void exynos_iommu_domain_free(struct iommu_domain *iommu_domain)
 static void exynos_iommu_detach_device(struct iommu_domain *iommu_domain,
 				    struct device *dev)
 {
-	struct exynos_iommu_owner *owner = dev->archdata.iommu;
 	struct exynos_iommu_domain *domain = to_exynos_domain(iommu_domain);
+	struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
 	phys_addr_t pagetable = virt_to_phys(domain->pgtable);
 	struct sysmmu_drvdata *data, *next;
 	unsigned long flags;
@@ -875,8 +875,8 @@ static void exynos_iommu_detach_device(struct iommu_domain *iommu_domain,
 static int exynos_iommu_attach_device(struct iommu_domain *iommu_domain,
 				   struct device *dev)
 {
-	struct exynos_iommu_owner *owner = dev->archdata.iommu;
 	struct exynos_iommu_domain *domain = to_exynos_domain(iommu_domain);
+	struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
 	struct sysmmu_drvdata *data;
 	phys_addr_t pagetable = virt_to_phys(domain->pgtable);
 	unsigned long flags;
@@ -1237,7 +1237,7 @@ static phys_addr_t exynos_iommu_iova_to_phys(struct iommu_domain *iommu_domain,
 
 static struct iommu_device *exynos_iommu_probe_device(struct device *dev)
 {
-	struct exynos_iommu_owner *owner = dev->archdata.iommu;
+	struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
 	struct sysmmu_drvdata *data;
 
 	if (!has_sysmmu(dev))
@@ -1263,7 +1263,7 @@ static struct iommu_device *exynos_iommu_probe_device(struct device *dev)
 
 static void exynos_iommu_release_device(struct device *dev)
 {
-	struct exynos_iommu_owner *owner = dev->archdata.iommu;
+	struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
 	struct sysmmu_drvdata *data;
 
 	if (!has_sysmmu(dev))
@@ -1287,8 +1287,8 @@ static void exynos_iommu_release_device(struct device *dev)
 static int exynos_iommu_of_xlate(struct device *dev,
 				 struct of_phandle_args *spec)
 {
-	struct exynos_iommu_owner *owner = dev->archdata.iommu;
 	struct platform_device *sysmmu = of_find_device_by_node(spec->np);
+	struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
 	struct sysmmu_drvdata *data, *entry;
 
 	if (!sysmmu)
@@ -1305,7 +1305,7 @@ static int exynos_iommu_of_xlate(struct device *dev,
 
 		INIT_LIST_HEAD(&owner->controllers);
 		mutex_init(&owner->rpm_lock);
-		dev->archdata.iommu = owner;
+		dev_iommu_priv_set(dev, owner);
 	}
 
 	list_for_each_entry(entry, &owner->controllers, owner_node)
diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc_iommu.h b/drivers/media/platform/s5p-mfc/s5p_mfc_iommu.h
index 152a713fff78..1a32266b7ddc 100644
--- a/drivers/media/platform/s5p-mfc/s5p_mfc_iommu.h
+++ b/drivers/media/platform/s5p-mfc/s5p_mfc_iommu.h
@@ -9,9 +9,11 @@
 
 #if defined(CONFIG_EXYNOS_IOMMU)
 
+#include <linux/iommu.h>
+
 static inline bool exynos_is_iommu_available(struct device *dev)
 {
-	return dev->archdata.iommu != NULL;
+	return dev_iommu_priv_get(dev) != NULL;
 }
 
 #else
-- 
2.27.0


^ permalink raw reply related

* [PATCH 04/13] iommu/omap: Use dev_iommu_priv_get/set()
From: Joerg Roedel @ 2020-06-25 13:08 UTC (permalink / raw)
  To: iommu
  Cc: linux-ia64, Heiko Stuebner, David Airlie, Joonas Lahtinen,
	Thierry Reding, Paul Mackerras, Will Deacon, Marek Szyprowski,
	Joerg Roedel, x86, Russell King, Catalin Marinas, Fenghua Yu,
	Joerg Roedel, intel-gfx, Jani Nikula, Rodrigo Vivi,
	Matthias Brugger, linux-arm-kernel, Tony Luck, linuxppc-dev,
	linux-kernel, Daniel Vetter, David Woodhouse, Lu Baolu
In-Reply-To: <20200625130836.1916-1-joro@8bytes.org>

From: Joerg Roedel <jroedel@suse.de>

Remove the use of dev->archdata.iommu and use the private per-device
pointer provided by IOMMU core code instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/omap-iommu.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index c8282cc212cb..e84ead6fb234 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -71,7 +71,7 @@ static struct omap_iommu_domain *to_omap_domain(struct iommu_domain *dom)
  **/
 void omap_iommu_save_ctx(struct device *dev)
 {
-	struct omap_iommu_arch_data *arch_data = dev->archdata.iommu;
+	struct omap_iommu_arch_data *arch_data = dev_iommu_priv_get(dev);
 	struct omap_iommu *obj;
 	u32 *p;
 	int i;
@@ -101,7 +101,7 @@ EXPORT_SYMBOL_GPL(omap_iommu_save_ctx);
  **/
 void omap_iommu_restore_ctx(struct device *dev)
 {
-	struct omap_iommu_arch_data *arch_data = dev->archdata.iommu;
+	struct omap_iommu_arch_data *arch_data = dev_iommu_priv_get(dev);
 	struct omap_iommu *obj;
 	u32 *p;
 	int i;
@@ -1398,7 +1398,7 @@ static size_t omap_iommu_unmap(struct iommu_domain *domain, unsigned long da,
 
 static int omap_iommu_count(struct device *dev)
 {
-	struct omap_iommu_arch_data *arch_data = dev->archdata.iommu;
+	struct omap_iommu_arch_data *arch_data = dev_iommu_priv_get(dev);
 	int count = 0;
 
 	while (arch_data->iommu_dev) {
@@ -1459,8 +1459,8 @@ static void omap_iommu_detach_fini(struct omap_iommu_domain *odomain)
 static int
 omap_iommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
+	struct omap_iommu_arch_data *arch_data = dev_iommu_priv_get(dev);
 	struct omap_iommu_domain *omap_domain = to_omap_domain(domain);
-	struct omap_iommu_arch_data *arch_data = dev->archdata.iommu;
 	struct omap_iommu_device *iommu;
 	struct omap_iommu *oiommu;
 	int ret = 0;
@@ -1524,7 +1524,7 @@ omap_iommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static void _omap_iommu_detach_dev(struct omap_iommu_domain *omap_domain,
 				   struct device *dev)
 {
-	struct omap_iommu_arch_data *arch_data = dev->archdata.iommu;
+	struct omap_iommu_arch_data *arch_data = dev_iommu_priv_get(dev);
 	struct omap_iommu_device *iommu = omap_domain->iommus;
 	struct omap_iommu *oiommu;
 	int i;
@@ -1650,7 +1650,7 @@ static struct iommu_device *omap_iommu_probe_device(struct device *dev)
 	int num_iommus, i;
 
 	/*
-	 * Allocate the archdata iommu structure for DT-based devices.
+	 * Allocate the per-device iommu structure for DT-based devices.
 	 *
 	 * TODO: Simplify this when removing non-DT support completely from the
 	 * IOMMU users.
@@ -1698,7 +1698,7 @@ static struct iommu_device *omap_iommu_probe_device(struct device *dev)
 		of_node_put(np);
 	}
 
-	dev->archdata.iommu = arch_data;
+	dev_iommu_priv_set(dev, arch_data);
 
 	/*
 	 * use the first IOMMU alone for the sysfs device linking.
@@ -1712,19 +1712,19 @@ static struct iommu_device *omap_iommu_probe_device(struct device *dev)
 
 static void omap_iommu_release_device(struct device *dev)
 {
-	struct omap_iommu_arch_data *arch_data = dev->archdata.iommu;
+	struct omap_iommu_arch_data *arch_data = dev_iommu_priv_get(dev);
 
 	if (!dev->of_node || !arch_data)
 		return;
 
-	dev->archdata.iommu = NULL;
+	dev_iommu_priv_set(dev, NULL);
 	kfree(arch_data);
 
 }
 
 static struct iommu_group *omap_iommu_device_group(struct device *dev)
 {
-	struct omap_iommu_arch_data *arch_data = dev->archdata.iommu;
+	struct omap_iommu_arch_data *arch_data = dev_iommu_priv_get(dev);
 	struct iommu_group *group = ERR_PTR(-EINVAL);
 
 	if (!arch_data)
-- 
2.27.0


^ permalink raw reply related

* [PATCH 07/13] iommu/pamu: Use dev_iommu_priv_get/set()
From: Joerg Roedel @ 2020-06-25 13:08 UTC (permalink / raw)
  To: iommu
  Cc: linux-ia64, Heiko Stuebner, David Airlie, Joonas Lahtinen,
	Thierry Reding, Paul Mackerras, Will Deacon, Marek Szyprowski,
	Joerg Roedel, x86, Russell King, Catalin Marinas, Fenghua Yu,
	Joerg Roedel, intel-gfx, Jani Nikula, Rodrigo Vivi,
	Matthias Brugger, linux-arm-kernel, Tony Luck, linuxppc-dev,
	linux-kernel, Daniel Vetter, David Woodhouse, Lu Baolu
In-Reply-To: <20200625130836.1916-1-joro@8bytes.org>

From: Joerg Roedel <jroedel@suse.de>

Remove the use of dev->archdata.iommu_domain and use the private
per-device pointer provided by IOMMU core code instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/fsl_pamu_domain.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index 928d37771ece..b2110767caf4 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -323,7 +323,7 @@ static void remove_device_ref(struct device_domain_info *info, u32 win_cnt)
 	pamu_disable_liodn(info->liodn);
 	spin_unlock_irqrestore(&iommu_lock, flags);
 	spin_lock_irqsave(&device_domain_lock, flags);
-	info->dev->archdata.iommu_domain = NULL;
+	dev_iommu_priv_set(info->dev, NULL);
 	kmem_cache_free(iommu_devinfo_cache, info);
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
@@ -352,7 +352,7 @@ static void attach_device(struct fsl_dma_domain *dma_domain, int liodn, struct d
 	 * Check here if the device is already attached to domain or not.
 	 * If the device is already attached to a domain detach it.
 	 */
-	old_domain_info = dev->archdata.iommu_domain;
+	old_domain_info = dev_iommu_priv_get(dev);
 	if (old_domain_info && old_domain_info->domain != dma_domain) {
 		spin_unlock_irqrestore(&device_domain_lock, flags);
 		detach_device(dev, old_domain_info->domain);
@@ -371,8 +371,8 @@ static void attach_device(struct fsl_dma_domain *dma_domain, int liodn, struct d
 	 * the info for the first LIODN as all
 	 * LIODNs share the same domain
 	 */
-	if (!dev->archdata.iommu_domain)
-		dev->archdata.iommu_domain = info;
+	if (!dev_iommu_priv_get(dev))
+		dev_iommu_priv_set(dev, info);
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
 
-- 
2.27.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox