LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] fix & prevent the missing preemption disabling
From: 王贇 @ 2021-10-26  3:14 UTC (permalink / raw)
  To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
	Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
	Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
	Joe Lawrence, Masami Hiramatsu, Peter Zijlstra (Intel),
	Michael Wang, Nicholas Piggin, Jisheng Zhang, linux-csky,
	linux-kernel, linux-parisc, linuxppc-dev, linux-riscv,
	live-patching

The testing show that perf_ftrace_function_call() are using smp_processor_id()
with preemption enabled, all the checking on CPU could be wrong after preemption.

As Peter point out, the section between ftrace_test_recursion_trylock/unlock()
pair require the preemption to be disabled as 'Documentation/trace/ftrace-uses.rst'
explained, but currently the work is done outside of the helpers.

And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption to be disabled, we
can just merge the logical together.

Patch 1/2 will make sure preemption disabled when recursion lock succeed,
patch 2/2 will do smp_processor_id() checking after trylock() to address the
issue.

v1: https://lore.kernel.org/all/8c7de46d-9869-aa5e-2bb9-5dbc2eda395e@linux.alibaba.com/
v2: https://lore.kernel.org/all/b1d7fe43-ce84-0ed7-32f7-ea1d12d0b716@linux.alibaba.com/
v3: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
V4: https://lore.kernel.org/all/32a36348-69ee-6464-390c-3a8d6e9d2b53@linux.alibaba.com/

Michael Wang (2):
  ftrace: disable preemption when recursion locked
  ftrace: do CPU checking after preemption disabled

 arch/csky/kernel/probes/ftrace.c     |  2 --
 arch/parisc/kernel/ftrace.c          |  2 --
 arch/powerpc/kernel/kprobes-ftrace.c |  2 --
 arch/riscv/kernel/probes/ftrace.c    |  2 --
 arch/x86/kernel/kprobes/ftrace.c     |  2 --
 include/linux/trace_recursion.h      | 11 ++++++++++-
 kernel/livepatch/patch.c             | 13 +++++++------
 kernel/trace/ftrace.c                | 15 +++++----------
 kernel/trace/trace_event_perf.c      |  6 +++---
 kernel/trace/trace_functions.c       |  5 -----
 10 files changed, 25 insertions(+), 35 deletions(-)

-- 
1.8.3.1


^ permalink raw reply

* Re: [PATCH v4 0/2] fix & prevent the missing preemption disabling
From: 王贇 @ 2021-10-26  2:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra (Intel), Paul Walmsley, James E.J. Bottomley,
	Guo Ren, Jisheng Zhang, H. Peter Anvin, live-patching,
	linux-riscv, Miroslav Benes, Paul Mackerras, Joe Lawrence,
	Helge Deller, x86, linux-csky, Ingo Molnar, Petr Mladek,
	Albert Ou, Jiri Kosina, Nicholas Piggin, Borislav Petkov,
	Josh Poimboeuf, Thomas Gleixner, linux-parisc, linux-kernel,
	Palmer Dabbelt, Masami Hiramatsu, Colin Ian King, linuxppc-dev
In-Reply-To: <20211025224233.61b8e088@rorschach.local.home>



On 2021/10/26 上午10:42, Steven Rostedt wrote:
> On Tue, 26 Oct 2021 10:09:12 +0800
> 王贇 <yun.wang@linux.alibaba.com> wrote:
> 
>> Just a ping, to see if there are any more comments :-P
> 
> I guess you missed what went into mainline (and your name found a bug
> in my perl script for importing patches ;-)
> 
>   https://lore.kernel.org/all/20211019091344.65629198@gandalf.local.home/

Cool~ Missing some chinese font maybe, that's fine :-)

> 
> Which means patch 1 needs to change:
>> +	/*
>> +	 * Disable preemption to fulfill the promise.
>> +	 *
>> +	 * Don't worry about the bit 0 cases, they indicate
>> +	 * the disabling behaviour has already been done by
>> +	 * internal call previously.
>> +	 */
>> +	preempt_disable_notrace();
>> +
>>  	return bit + 1;
>>  }
>>
>> +/*
>> + * Preemption will be enabled (if it was previously enabled).
>> + */
>>  static __always_inline void trace_clear_recursion(int bit)
>>  {
>>  	if (!bit)
>>  		return;
>>
>> +	if (bit > 0)
>> +		preempt_enable_notrace();
>> +
> 
> Where this wont work anymore.
> 
> Need to preempt disable and enable always.

Yup, will rebase on the latest changes~

Regards,
Michael Wang

> 
> -- Steve
> 
> 
>>  	barrier();
>>  	bit--;
>>  	trace_recursion_clear(bit);
>> @@ -209,7 +227,7 @@ static __always_inline void trace_clear_recursion(int bit)
>>   * tracing recursed in the same context (normal vs interrupt),
>>   *

^ permalink raw reply

* Re: [PATCH v4 0/2] fix & prevent the missing preemption disabling
From: Steven Rostedt @ 2021-10-26  2:42 UTC (permalink / raw)
  To: 王贇
  Cc: Peter Zijlstra (Intel), Paul Walmsley, James E.J. Bottomley,
	Guo Ren, Jisheng Zhang, H. Peter Anvin, live-patching,
	linux-riscv, Miroslav Benes, Paul Mackerras, Joe Lawrence,
	Helge Deller, x86, linux-csky, Ingo Molnar, Petr Mladek,
	Albert Ou, Jiri Kosina, Nicholas Piggin, Borislav Petkov,
	Josh Poimboeuf, Thomas Gleixner, linux-parisc, linux-kernel,
	Palmer Dabbelt, Masami Hiramatsu, Colin Ian King, linuxppc-dev
In-Reply-To: <71c21f78-9c44-fdb2-f8e2-d8544b3421bd@linux.alibaba.com>

On Tue, 26 Oct 2021 10:09:12 +0800
王贇 <yun.wang@linux.alibaba.com> wrote:

> Just a ping, to see if there are any more comments :-P

I guess you missed what went into mainline (and your name found a bug
in my perl script for importing patches ;-)

  https://lore.kernel.org/all/20211019091344.65629198@gandalf.local.home/

Which means patch 1 needs to change:

> +	/*
> +	 * Disable preemption to fulfill the promise.
> +	 *
> +	 * Don't worry about the bit 0 cases, they indicate
> +	 * the disabling behaviour has already been done by
> +	 * internal call previously.
> +	 */
> +	preempt_disable_notrace();
> +
>  	return bit + 1;
>  }
> 
> +/*
> + * Preemption will be enabled (if it was previously enabled).
> + */
>  static __always_inline void trace_clear_recursion(int bit)
>  {
>  	if (!bit)
>  		return;
> 
> +	if (bit > 0)
> +		preempt_enable_notrace();
> +

Where this wont work anymore.

Need to preempt disable and enable always.

-- Steve


>  	barrier();
>  	bit--;
>  	trace_recursion_clear(bit);
> @@ -209,7 +227,7 @@ static __always_inline void trace_clear_recursion(int bit)
>   * tracing recursed in the same context (normal vs interrupt),
>   *

^ permalink raw reply

* Re: [PATCH] mm/migrate.c: Remove MIGRATE_PFN_LOCKED
From: Ralph Campbell @ 2021-10-26  0:57 UTC (permalink / raw)
  To: Alistair Popple, akpm
  Cc: amd-gfx, nouveau, Felix.Kuehling, linux-kernel, kvm-ppc, linux-mm,
	jglisse, dri-devel, ziy, jhubbard, alexander.deucher,
	linuxppc-dev, hch, bskeggs
In-Reply-To: <20211025041608.289017-1-apopple@nvidia.com>


On 10/24/21 21:16, Alistair Popple wrote:
> MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
> source page was already locked during migrate_vma_collect(). If it
> wasn't then the a second attempt is made to lock the page. However if
> the first attempt failed it's unlikely a second attempt will succeed,
> and the retry adds complexity. So clean this up by removing the retry
> and MIGRATE_PFN_LOCKED flag.
>
> Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
> set, but nothing actually checks that.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>

You can add:
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   Documentation/vm/hmm.rst                 |   2 +-
>   arch/powerpc/kvm/book3s_hv_uvmem.c       |   4 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   2 -
>   drivers/gpu/drm/nouveau/nouveau_dmem.c   |   4 +-
>   include/linux/migrate.h                  |   1 -
>   lib/test_hmm.c                           |   5 +-
>   mm/migrate.c                             | 145 +++++------------------
>   7 files changed, 35 insertions(+), 128 deletions(-)
>
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index a14c2938e7af..f2a59ed82ed3 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -360,7 +360,7 @@ between device driver specific code and shared common code:
>      system memory page, locks the page with ``lock_page()``, and fills in the
>      ``dst`` array entry with::
>   
> -     dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +     dst[i] = migrate_pfn(page_to_pfn(dpage));
>   
>      Now that the driver knows that this page is being migrated, it can
>      invalidate device private MMU mappings and copy device private memory
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index a7061ee3b157..28c436df9935 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -560,7 +560,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
>   				  gpa, 0, page_shift);
>   
>   	if (ret == U_SUCCESS)
> -		*mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
> +		*mig.dst = migrate_pfn(pfn);
>   	else {
>   		unlock_page(dpage);
>   		__free_page(dpage);
> @@ -774,7 +774,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>   		}
>   	}
>   
> -	*mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +	*mig.dst = migrate_pfn(page_to_pfn(dpage));
>   	migrate_vma_pages(&mig);
>   out_finalize:
>   	migrate_vma_finalize(&mig);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index 4a16e3c257b9..41d9417f182b 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -300,7 +300,6 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
>   			migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
>   			svm_migrate_get_vram_page(prange, migrate->dst[i]);
>   			migrate->dst[i] = migrate_pfn(migrate->dst[i]);
> -			migrate->dst[i] |= MIGRATE_PFN_LOCKED;
>   			src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
>   					      DMA_TO_DEVICE);
>   			r = dma_mapping_error(dev, src[i]);
> @@ -580,7 +579,6 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
>   			      dst[i] >> PAGE_SHIFT, page_to_pfn(dpage));
>   
>   		migrate->dst[i] = migrate_pfn(page_to_pfn(dpage));
> -		migrate->dst[i] |= MIGRATE_PFN_LOCKED;
>   		j++;
>   	}
>   
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 92987daa5e17..3828aafd3ac4 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -166,7 +166,7 @@ static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
>   		goto error_dma_unmap;
>   	mutex_unlock(&svmm->mutex);
>   
> -	args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +	args->dst[0] = migrate_pfn(page_to_pfn(dpage));
>   	return 0;
>   
>   error_dma_unmap:
> @@ -602,7 +602,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>   		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
>   	if (src & MIGRATE_PFN_WRITE)
>   		*pfn |= NVIF_VMM_PFNMAP_V0_W;
> -	return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +	return migrate_pfn(page_to_pfn(dpage));
>   
>   out_dma_unmap:
>   	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index c8077e936691..479b861ae490 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -119,7 +119,6 @@ static inline int migrate_misplaced_page(struct page *page,
>    */
>   #define MIGRATE_PFN_VALID	(1UL << 0)
>   #define MIGRATE_PFN_MIGRATE	(1UL << 1)
> -#define MIGRATE_PFN_LOCKED	(1UL << 2)
>   #define MIGRATE_PFN_WRITE	(1UL << 3)
>   #define MIGRATE_PFN_SHIFT	6
>   
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index c259842f6d44..e2ce8f9b7605 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -613,8 +613,7 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
>   		 */
>   		rpage->zone_device_data = dmirror;
>   
> -		*dst = migrate_pfn(page_to_pfn(dpage)) |
> -			    MIGRATE_PFN_LOCKED;
> +		*dst = migrate_pfn(page_to_pfn(dpage));
>   		if ((*src & MIGRATE_PFN_WRITE) ||
>   		    (!spage && args->vma->vm_flags & VM_WRITE))
>   			*dst |= MIGRATE_PFN_WRITE;
> @@ -1137,7 +1136,7 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
>   		lock_page(dpage);
>   		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
>   		copy_highpage(dpage, spage);
> -		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +		*dst = migrate_pfn(page_to_pfn(dpage));
>   		if (*src & MIGRATE_PFN_WRITE)
>   			*dst |= MIGRATE_PFN_WRITE;
>   	}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a6a7743ee98f..915e969811d0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2369,7 +2369,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   		 * can't be dropped from it).
>   		 */
>   		get_page(page);
> -		migrate->cpages++;

Why not move the get_page() into the "if (trylock_page())" instead
of calling put_page() in the else case.

>   
>   		/*
>   		 * Optimize for the common case where page is only mapped once
> @@ -2379,7 +2378,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   		if (trylock_page(page)) {
>   			pte_t swp_pte;
>   
> -			mpfn |= MIGRATE_PFN_LOCKED;
> +			migrate->cpages++;
>   			ptep_get_and_clear(mm, addr, ptep);

I was looking at try_to_migrate_one() and looking at the differences with
the code here to insert the migration PTE and noticed that instead of
ptet_get_and_clear() it has:
	pteval = ptep_clear_flush(vma, address, pvmw.pte);

	/* Move the dirty bit to the page. Now the pte is gone. */
	if (pte_dirty(pteval))
		set_page_dirty(page);
	update_hiwater_rss(mm);

I know that is pre-existing, probably a separate patch if it is an issue.

>   
>   			/* Setup special migration page table entry */
> @@ -2413,6 +2412,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   
>   			if (pte_present(pte))
>   				unmapped++;
> +		} else {
> +			put_page(page);
> +			mpfn = 0;
>   		}
>   
>   next:
> @@ -2517,15 +2519,17 @@ static bool migrate_vma_check_page(struct page *page)
>   }
>   
>   /*
> - * migrate_vma_prepare() - lock pages and isolate them from the lru
> + * migrate_vma_unmap() - replace page mapping with special migration pte entry
>    * @migrate: migrate struct containing all migration information
>    *
> - * This locks pages that have been collected by migrate_vma_collect(). Once each
> - * page is locked it is isolated from the lru (for non-device pages). Finally,
> - * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
> - * migrated by concurrent kernel threads.
> + * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
> + * special migration pte entry and check if it has been pinned. Pinned pages are
> + * restored because we cannot migrate them.
> + *
> + * This is the last step before we call the device driver callback to allocate
> + * destination memory and copy contents of original page over to new page.
>    */
> -static void migrate_vma_prepare(struct migrate_vma *migrate)
> +static void migrate_vma_unmap(struct migrate_vma *migrate)
>   {
>   	const unsigned long npages = migrate->npages;
>   	const unsigned long start = migrate->start;
> @@ -2534,32 +2538,12 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>   
>   	lru_add_drain();
>   
> -	for (i = 0; (i < npages) && migrate->cpages; i++) {
> +	for (i = 0; i < npages; i++) {
>   		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> -		bool remap = true;
>   
>   		if (!page)
>   			continue;
>   
> -		if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
> -			/*
> -			 * Because we are migrating several pages there can be
> -			 * a deadlock between 2 concurrent migration where each
> -			 * are waiting on each other page lock.
> -			 *
> -			 * Make migrate_vma() a best effort thing and backoff
> -			 * for any page we can not lock right away.
> -			 */
> -			if (!trylock_page(page)) {
> -				migrate->src[i] = 0;
> -				migrate->cpages--;
> -				put_page(page);
> -				continue;
> -			}
> -			remap = false;
> -			migrate->src[i] |= MIGRATE_PFN_LOCKED;
> -		}
> -
>   		/* ZONE_DEVICE pages are not on LRU */
>   		if (!is_zone_device_page(page)) {
>   			if (!PageLRU(page) && allow_drain) {
> @@ -2569,16 +2553,9 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>   			}
>   
>   			if (isolate_lru_page(page)) {
> -				if (remap) {
> -					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> -					migrate->cpages--;
> -					restore++;
> -				} else {
> -					migrate->src[i] = 0;
> -					unlock_page(page);
> -					migrate->cpages--;
> -					put_page(page);
> -				}
> +				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> +				migrate->cpages--;
> +				restore++;
>   				continue;
>   			}
>   
> @@ -2586,80 +2563,20 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
>   			put_page(page);
>   		}
>   
> -		if (!migrate_vma_check_page(page)) {
> -			if (remap) {
> -				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> -				migrate->cpages--;
> -				restore++;
> -
> -				if (!is_zone_device_page(page)) {
> -					get_page(page);
> -					putback_lru_page(page);
> -				}
> -			} else {
> -				migrate->src[i] = 0;
> -				unlock_page(page);
> -				migrate->cpages--;
> +		if (page_mapped(page))
> +			try_to_migrate(page, 0);
>   
> -				if (!is_zone_device_page(page))
> -					putback_lru_page(page);
> -				else
> -					put_page(page);
> +		if (page_mapped(page) || !migrate_vma_check_page(page)) {
> +			if (!is_zone_device_page(page)) {
> +				get_page(page);
> +				putback_lru_page(page);
>   			}
> -		}
> -	}
> -
> -	for (i = 0, addr = start; i < npages && restore; i++, addr += PAGE_SIZE) {
> -		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> -
> -		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
> -			continue;
>   
> -		remove_migration_pte(page, migrate->vma, addr, page);
> -
> -		migrate->src[i] = 0;
> -		unlock_page(page);
> -		put_page(page);
> -		restore--;
> -	}
> -}
> -
> -/*
> - * migrate_vma_unmap() - replace page mapping with special migration pte entry
> - * @migrate: migrate struct containing all migration information
> - *
> - * Replace page mapping (CPU page table pte) with a special migration pte entry
> - * and check again if it has been pinned. Pinned pages are restored because we
> - * cannot migrate them.
> - *
> - * This is the last step before we call the device driver callback to allocate
> - * destination memory and copy contents of original page over to new page.
> - */
> -static void migrate_vma_unmap(struct migrate_vma *migrate)
> -{
> -	const unsigned long npages = migrate->npages;
> -	const unsigned long start = migrate->start;
> -	unsigned long addr, i, restore = 0;
> -
> -	for (i = 0; i < npages; i++) {
> -		struct page *page = migrate_pfn_to_page(migrate->src[i]);
> -
> -		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
> +			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> +			migrate->cpages--;
> +			restore++;
>   			continue;
> -
> -		if (page_mapped(page)) {
> -			try_to_migrate(page, 0);
> -			if (page_mapped(page))
> -				goto restore;
>   		}
> -
> -		if (migrate_vma_check_page(page))
> -			continue;
> -
> -restore:
> -		migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> -		migrate->cpages--;
> -		restore++;
>   	}
>   
>   	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
> @@ -2672,12 +2589,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>   
>   		migrate->src[i] = 0;
>   		unlock_page(page);
> +		put_page(page);
>   		restore--;
> -
> -		if (is_zone_device_page(page))
> -			put_page(page);
> -		else
> -			putback_lru_page(page);
>   	}
>   }
>   
> @@ -2700,8 +2613,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
>    * it for all those entries (ie with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE
>    * flag set).  Once these are allocated and copied, the caller must update each
>    * corresponding entry in the dst array with the pfn value of the destination
> - * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_LOCKED flags set
> - * (destination pages must have their struct pages locked, via lock_page()).
> + * page and with MIGRATE_PFN_VALID. Destination pages must be locked via
> + * lock_page().
>    *
>    * Note that the caller does not have to migrate all the pages that are marked
>    * with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration from
> @@ -2770,8 +2683,6 @@ int migrate_vma_setup(struct migrate_vma *args)
>   
>   	migrate_vma_collect(args);
>   
> -	if (args->cpages)
> -		migrate_vma_prepare(args);
>   	if (args->cpages)
>   		migrate_vma_unmap(args);
>   

^ permalink raw reply

* linux-next: manual merge of the audit tree with the powerpc tree
From: Stephen Rothwell @ 2021-10-26  2:31 UTC (permalink / raw)
  To: Paul Moore, Michael Ellerman, PowerPC
  Cc: Richard Guy Briggs, Linux Next Mailing List,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 906 bytes --]

Hi all,

Today's linux-next merge of the audit tree got conflicts in:

  arch/powerpc/kernel/audit.c
  arch/powerpc/kernel/compat_audit.c

between commit:

  566af8cda399 ("powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC")

from the powerpc tree and commits:

  42f355ef59a2 ("audit: replace magic audit syscall class numbers with macros")
  1c30e3af8a79 ("audit: add support for the openat2 syscall")

from the audit tree.

I fixed it up (I just removed the files like the former commit) and can
carry the fix as necessary. This is now fixed as far as linux-next is
concerned, but any non trivial conflicts should be mentioned to your
upstream maintainer when your tree is submitted for merging.  You may
also want to consider cooperating with the maintainer of the conflicting
tree to minimise any particularly complex conflicts.



-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v4 0/2] fix & prevent the missing preemption disabling
From: 王贇 @ 2021-10-26  2:09 UTC (permalink / raw)
  To: Guo Ren, Steven Rostedt, Ingo Molnar, James E.J. Bottomley,
	Helge Deller, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Thomas Gleixner, Borislav Petkov, x86, H. Peter Anvin,
	Josh Poimboeuf, Jiri Kosina, Miroslav Benes, Petr Mladek,
	Joe Lawrence, Colin Ian King, Masami Hiramatsu,
	Peter Zijlstra (Intel), Nicholas Piggin, Jisheng Zhang,
	linux-csky, linux-kernel, linux-parisc, linuxppc-dev, linux-riscv,
	live-patching
In-Reply-To: <32a36348-69ee-6464-390c-3a8d6e9d2b53@linux.alibaba.com>

Just a ping, to see if there are any more comments :-P

Regards,
Michael Wang

On 2021/10/18 上午11:38, 王贇 wrote:
> The testing show that perf_ftrace_function_call() are using smp_processor_id()
> with preemption enabled, all the checking on CPU could be wrong after preemption.
> 
> As Peter point out, the section between ftrace_test_recursion_trylock/unlock()
> pair require the preemption to be disabled as 'Documentation/trace/ftrace-uses.rst'
> explained, but currently the work is done outside of the helpers.
> 
> And since the internal using of trace_test_and_set_recursion()
> and trace_clear_recursion() also require preemption to be disabled, we
> can just merge the logical together.
> 
> Patch 1/2 will make sure preemption disabled when recursion lock succeed,
> patch 2/2 will do smp_processor_id() checking after trylock() to address the
> issue.
> 
> v1: https://lore.kernel.org/all/8c7de46d-9869-aa5e-2bb9-5dbc2eda395e@linux.alibaba.com/
> v2: https://lore.kernel.org/all/b1d7fe43-ce84-0ed7-32f7-ea1d12d0b716@linux.alibaba.com/
> v3: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
> 
> Michael Wang (2):
>   ftrace: disable preemption when recursion locked
>   ftrace: do CPU checking after preemption disabled
> 
>  arch/csky/kernel/probes/ftrace.c     |  2 --
>  arch/parisc/kernel/ftrace.c          |  2 --
>  arch/powerpc/kernel/kprobes-ftrace.c |  2 --
>  arch/riscv/kernel/probes/ftrace.c    |  2 --
>  arch/x86/kernel/kprobes/ftrace.c     |  2 --
>  include/linux/trace_recursion.h      | 20 +++++++++++++++++++-
>  kernel/livepatch/patch.c             | 13 +++++++------
>  kernel/trace/ftrace.c                | 15 +++++----------
>  kernel/trace/trace_event_perf.c      |  6 +++---
>  kernel/trace/trace_functions.c       |  5 -----
>  10 files changed, 34 insertions(+), 35 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH] macintosh/via-pmu-led: make disk activity usage a parameter.
From: kernel test robot @ 2021-10-26  0:35 UTC (permalink / raw)
  To: Hill Ma, benh, linuxppc-dev; +Cc: Hill Ma, kbuild-all, linux-kernel, linux-doc
In-Reply-To: <20211025072508.8667-1-maahiuzeon@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4320 bytes --]

Hi Hill,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc7 next-20211025]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Hill-Ma/macintosh-via-pmu-led-make-disk-activity-usage-a-parameter/20211025-152845
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 87066fdd2e30fe9dd531125d95257c118a74617e
config: powerpc-pmac32_defconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/9f9763c0836766055225087cee4126f8d2974252
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hill-Ma/macintosh-via-pmu-led-make-disk-activity-usage-a-parameter/20211025-152845
        git checkout 9f9763c0836766055225087cee4126f8d2974252
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from drivers/macintosh/via-pmu-led.c:28:
   drivers/macintosh/via-pmu-led.c: In function '__check_adb_pmu_led_disk':
>> include/linux/moduleparam.h:329:34: error: returning 'int *' from a function with incompatible return type 'bool *' {aka '_Bool *'} [-Werror=incompatible-pointer-types]
     329 |         param_check_##type(name, &(var));                               \
         |                                  ^
   include/linux/moduleparam.h:409:75: note: in definition of macro '__param_check'
     409 |         static inline type __always_unused *__check_##name(void) { return(p); }
         |                                                                           ^
   include/linux/moduleparam.h:329:9: note: in expansion of macro 'param_check_bool'
     329 |         param_check_##type(name, &(var));                               \
         |         ^~~~~~~~~~~~
   drivers/macintosh/via-pmu-led.c:120:1: note: in expansion of macro 'core_param'
     120 | core_param(adb_pmu_led_disk, adb_pmu_led_disk, bool, 0644);
         | ^~~~~~~~~~
   cc1: some warnings being treated as errors


vim +329 include/linux/moduleparam.h

907b29eb41aa604 Rusty Russell   2010-08-11  314  
67e67ceaac5bf55 Rusty Russell   2008-10-22  315  #ifndef MODULE
67e67ceaac5bf55 Rusty Russell   2008-10-22  316  /**
67e67ceaac5bf55 Rusty Russell   2008-10-22  317   * core_param - define a historical core kernel parameter.
67e67ceaac5bf55 Rusty Russell   2008-10-22  318   * @name: the name of the cmdline and sysfs parameter (often the same as var)
67e67ceaac5bf55 Rusty Russell   2008-10-22  319   * @var: the variable
546970bc6afc7fb Rusty Russell   2010-08-11  320   * @type: the type of the parameter
67e67ceaac5bf55 Rusty Russell   2008-10-22  321   * @perm: visibility in sysfs
67e67ceaac5bf55 Rusty Russell   2008-10-22  322   *
67e67ceaac5bf55 Rusty Russell   2008-10-22  323   * core_param is just like module_param(), but cannot be modular and
67e67ceaac5bf55 Rusty Russell   2008-10-22  324   * doesn't add a prefix (such as "printk.").  This is for compatibility
67e67ceaac5bf55 Rusty Russell   2008-10-22  325   * with __setup(), and it makes sense as truly core parameters aren't
67e67ceaac5bf55 Rusty Russell   2008-10-22  326   * tied to the particular file they're in.
67e67ceaac5bf55 Rusty Russell   2008-10-22  327   */
67e67ceaac5bf55 Rusty Russell   2008-10-22  328  #define core_param(name, var, type, perm)				\
67e67ceaac5bf55 Rusty Russell   2008-10-22 @329  	param_check_##type(name, &(var));				\
91f9d330cc14932 Jani Nikula     2014-08-27  330  	__module_param_call("", name, &param_ops_##type, &var, perm, -1, 0)
ec0ccc16a09fc32 Dmitry Torokhov 2015-03-30  331  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25952 bytes --]

^ permalink raw reply

* [powerpc:fixes-test] BUILD SUCCESS d853adc7adf601d7d6823afe3ed396065a3e08d1
From: kernel test robot @ 2021-10-26  0:09 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test
branch HEAD: d853adc7adf601d7d6823afe3ed396065a3e08d1  powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is present

elapsed time: 735m

configs tested: 38
configs skipped: 100

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm                                 defconfig
arm64                            allyesconfig
arm64                               defconfig
i386                 randconfig-c001-20211025
mips                           ip28_defconfig
sh                           se7206_defconfig
sh                                  defconfig
sh                             espt_defconfig
powerpc                         wii_defconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                                defconfig
nios2                               defconfig
nds32                             allnoconfig
nds32                               defconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
h8300                            allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
parisc                           allyesconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
i386                              debian-10.3
mips                             allyesconfig
mips                             allmodconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
x86_64                    rhel-8.3-kselftests
um                           x86_64_defconfig
um                             i386_defconfig

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH] powerpc/bpf: fix write protecting JIT code
From: Daniel Borkmann @ 2021-10-25 22:54 UTC (permalink / raw)
  To: Naveen N. Rao, ast, christophe.leroy, Hari Bathini, jniethe5, mpe
  Cc: songliubraving, netdev, john.fastabend, kpsingh, stable, andrii,
	paulus, yhs, bpf, linuxppc-dev, kafai
In-Reply-To: <1635142354.46h6w5c2rx.naveen@linux.ibm.com>

On 10/25/21 8:15 AM, Naveen N. Rao wrote:
> Hari Bathini wrote:
>> Running program with bpf-to-bpf function calls results in data access
>> exception (0x300) with the below call trace:
>>
>>     [c000000000113f28] bpf_int_jit_compile+0x238/0x750 (unreliable)
>>     [c00000000037d2f8] bpf_check+0x2008/0x2710
>>     [c000000000360050] bpf_prog_load+0xb00/0x13a0
>>     [c000000000361d94] __sys_bpf+0x6f4/0x27c0
>>     [c000000000363f0c] sys_bpf+0x2c/0x40
>>     [c000000000032434] system_call_exception+0x164/0x330
>>     [c00000000000c1e8] system_call_vectored_common+0xe8/0x278
>>
>> as bpf_int_jit_compile() tries writing to write protected JIT code
>> location during the extra pass.
>>
>> Fix it by holding off write protection of JIT code until the extra
>> pass, where branch target addresses fixup happens.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 62e3d4210ac9 ("powerpc/bpf: Write protect JIT code")
>> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
>> ---
>>  arch/powerpc/net/bpf_jit_comp.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Thanks for the fix!
> 
> Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

LGTM, I presume this fix will be routed via Michael.

BPF selftests have plenty of BPF-to-BPF calls in there, too bad this was
caught so late. :/

^ permalink raw reply

* Re: [PATCH v1 3/8] powerpc/fsl_booke: Take exec flag into account when setting TLBCAMs
From: Christophe Leroy @ 2021-10-25 21:57 UTC (permalink / raw)
  To: kernel test robot, linuxppc-dev@lists.ozlabs.org; +Cc: kbuild-all
In-Reply-To: <202110221445.ZMrc3M2c-lkp@intel.com>



On 22/10/2021 08:36, kernel test robot wrote:
> Hi Christophe,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v5.15-rc6 next-20211021]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-booke-Disable-STRICT_KERNEL_RWX-DEBUG_PAGEALLOC-and-KFENCE/20211015-180535
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-tqm8541_defconfig (attached as .config)
> compiler: powerpc-linux-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # https://github.com/0day-ci/linux/commit/159ed9a0b39712475dfebed64d1bb9387a0b9ad5
>          git remote add linux-review https://github.com/0day-ci/linux
>          git fetch --no-tags linux-review Christophe-Leroy/powerpc-booke-Disable-STRICT_KERNEL_RWX-DEBUG_PAGEALLOC-and-KFENCE/20211015-180535
>          git checkout 159ed9a0b39712475dfebed64d1bb9387a0b9ad5
>          # save the attached .config to linux build tree
>          mkdir build_dir
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=powerpc SHELL=/bin/bash arch/powerpc/mm/nohash/
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>     arch/powerpc/mm/nohash/fsl_book3e.c:63:15: error: no previous prototype for 'tlbcam_sz' [-Werror=missing-prototypes]
>        63 | unsigned long tlbcam_sz(int idx)
>           |               ^~~~~~~~~
>     arch/powerpc/mm/nohash/fsl_book3e.c: In function 'settlbcam':
>>> arch/powerpc/mm/nohash/fsl_book3e.c:126:40: error: '_PAGE_BAP_SX' undeclared (first use in this function)
>       126 |         TLBCAM[index].MAS3 |= (flags & _PAGE_BAP_SX) ? MAS3_SX : 0;
>           |                                        ^~~~~~~~~~~~
>     arch/powerpc/mm/nohash/fsl_book3e.c:126:40: note: each undeclared identifier is reported only once for each function it appears in
>     cc1: all warnings being treated as errors
> 

Thanks Robot for reporting that.

The problem is not trivial and is in fact deeper, we have a 
misdefinition of _PAGE_EXEC on book3e.

I sent a v2 which adds two patches at the begining of the series to 
clear that problem, then I fixed this patch 3 (which has become patch 5) 
to use _PAGE_EXEC instead of _PAGE_BAP_SX.

Christophe

> 
> vim +/_PAGE_BAP_SX +126 arch/powerpc/mm/nohash/fsl_book3e.c
> 
>     114	
>     115		TLBCAM[index].MAS0 = MAS0_TLBSEL(1) | MAS0_ESEL(index) | MAS0_NV(index+1);
>     116		TLBCAM[index].MAS1 = MAS1_VALID | MAS1_IPROT | MAS1_TSIZE(tsize) | MAS1_TID(pid);
>     117		TLBCAM[index].MAS2 = virt & PAGE_MASK;
>     118	
>     119		TLBCAM[index].MAS2 |= (flags & _PAGE_WRITETHRU) ? MAS2_W : 0;
>     120		TLBCAM[index].MAS2 |= (flags & _PAGE_NO_CACHE) ? MAS2_I : 0;
>     121		TLBCAM[index].MAS2 |= (flags & _PAGE_COHERENT) ? MAS2_M : 0;
>     122		TLBCAM[index].MAS2 |= (flags & _PAGE_GUARDED) ? MAS2_G : 0;
>     123		TLBCAM[index].MAS2 |= (flags & _PAGE_ENDIAN) ? MAS2_E : 0;
>     124	
>     125		TLBCAM[index].MAS3 = (phys & MAS3_RPN) | MAS3_SR;
>   > 126		TLBCAM[index].MAS3 |= (flags & _PAGE_BAP_SX) ? MAS3_SX : 0;
>     127		TLBCAM[index].MAS3 |= (flags & _PAGE_RW) ? MAS3_SW : 0;
>     128		if (mmu_has_feature(MMU_FTR_BIG_PHYS))
>     129			TLBCAM[index].MAS7 = (u64)phys >> 32;
>     130	
>     131		/* Below is unlikely -- only for large user pages or similar */
>     132		if (pte_user(__pte(flags))) {
>     133			TLBCAM[index].MAS3 |= MAS3_UR;
>     134			TLBCAM[index].MAS3 |= (flags & _PAGE_EXEC) ? MAS3_UX : 0;
>     135			TLBCAM[index].MAS3 |= (flags & _PAGE_RW) ? MAS3_UW : 0;
>     136		}
>     137	
>     138		tlbcam_addrs[index].start = virt;
>     139		tlbcam_addrs[index].limit = virt + size - 1;
>     140		tlbcam_addrs[index].phys = phys;
>     141	}
>     142	
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> 

^ permalink raw reply

* Re: [PATCH v2 02/10] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Christophe Leroy @ 2021-10-25 21:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <7e7b0688c907e54f3b11ddfb9a8f44511d475fd7.1634983809.git.christophe.leroy@csgroup.eu>



On 23/10/2021 13:47, Christophe Leroy wrote:
> set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
> set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.
> 
> Book3e has 2 bits, UX and SX, which defines the exec rights
> resp. for user (PR=1) and for kernel (PR=0).
> 
> _PAGE_EXEC is defined as UX only.
> 
> An executable kernel page is set with either _PAGE_KERNEL_RWX
> or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.
> 
> So set_memory_nx() call for an executable kernel page does
> nothing because UX is already cleared.
> 
> And set_memory_x() on a non-executable kernel page makes it
> executable for the user and keeps it non-executable for kernel.
> 
> Also, pte_exec() always returns 'false' on kernel pages, because
> it checks _PAGE_EXEC which doesn't include SX, so for instance
> the W+X check doesn't work.
> 
> To fix this:
> - change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
> - sets both UX and SX in _PAGE_EXEC so that pte_user() returns
> true whenever one of the two bits is set and pte_exprotect()
> clears both bits.
> - Define a book3e specific version of pte_mkexec() which sets
> either SX or UX based on UR.
> 
> Fixes: 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> v2: New

pte_mkexec() in nohash/64/pgtable.h conflicts with the one in 
nohash/pte_book3e.h

Should guard it with  #ifndef pte_mkexec(), but as pte_book3e is the 
only user in 64 bits, then just remove it from there.

Send v3 with only that change compared to v2.

Christophe

> ---
>   arch/powerpc/include/asm/nohash/32/pgtable.h |  2 ++
>   arch/powerpc/include/asm/nohash/pte-book3e.h | 18 ++++++++++++++----
>   arch/powerpc/mm/nohash/tlb_low_64e.S         |  8 ++++----
>   3 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
> index ac0a5ff48c3a..d6ba821a56ce 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
> @@ -193,10 +193,12 @@ static inline pte_t pte_wrprotect(pte_t pte)
>   }
>   #endif
>   
> +#ifndef pte_mkexec
>   static inline pte_t pte_mkexec(pte_t pte)
>   {
>   	return __pte(pte_val(pte) | _PAGE_EXEC);
>   }
> +#endif
>   
>   #define pmd_none(pmd)		(!pmd_val(pmd))
>   #define	pmd_bad(pmd)		(pmd_val(pmd) & _PMD_BAD)
> diff --git a/arch/powerpc/include/asm/nohash/pte-book3e.h b/arch/powerpc/include/asm/nohash/pte-book3e.h
> index 813918f40765..f798640422c2 100644
> --- a/arch/powerpc/include/asm/nohash/pte-book3e.h
> +++ b/arch/powerpc/include/asm/nohash/pte-book3e.h
> @@ -48,7 +48,7 @@
>   #define _PAGE_WRITETHRU	0x800000 /* W: cache write-through */
>   
>   /* "Higher level" linux bit combinations */
> -#define _PAGE_EXEC		_PAGE_BAP_UX /* .. and was cache cleaned */
> +#define _PAGE_EXEC		(_PAGE_BAP_SX | _PAGE_BAP_UX) /* .. and was cache cleaned */
>   #define _PAGE_RW		(_PAGE_BAP_SW | _PAGE_BAP_UW) /* User write permission */
>   #define _PAGE_KERNEL_RW		(_PAGE_BAP_SW | _PAGE_BAP_SR | _PAGE_DIRTY)
>   #define _PAGE_KERNEL_RO		(_PAGE_BAP_SR)
> @@ -93,11 +93,11 @@
>   /* Permission masks used to generate the __P and __S table */
>   #define PAGE_NONE	__pgprot(_PAGE_BASE)
>   #define PAGE_SHARED	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW)
> -#define PAGE_SHARED_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW | _PAGE_EXEC)
> +#define PAGE_SHARED_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW | _PAGE_BAP_UX)
>   #define PAGE_COPY	__pgprot(_PAGE_BASE | _PAGE_USER)
> -#define PAGE_COPY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
> +#define PAGE_COPY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_BAP_UX)
>   #define PAGE_READONLY	__pgprot(_PAGE_BASE | _PAGE_USER)
> -#define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
> +#define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_BAP_UX)
>   
>   #ifndef __ASSEMBLY__
>   static inline pte_t pte_mkprivileged(pte_t pte)
> @@ -113,6 +113,16 @@ static inline pte_t pte_mkuser(pte_t pte)
>   }
>   
>   #define pte_mkuser pte_mkuser
> +
> +static inline pte_t pte_mkexec(pte_t pte)
> +{
> +	if (pte_val(pte) & _PAGE_BAP_UR)
> +		return __pte((pte_val(pte) & ~_PAGE_BAP_SX) | _PAGE_BAP_UX);
> +	else
> +		return __pte((pte_val(pte) & ~_PAGE_BAP_UX) | _PAGE_BAP_SX);
> +}
> +#define pte_mkexec pte_mkexec
> +
>   #endif /* __ASSEMBLY__ */
>   
>   #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S b/arch/powerpc/mm/nohash/tlb_low_64e.S
> index bf24451f3e71..9235e720e357 100644
> --- a/arch/powerpc/mm/nohash/tlb_low_64e.S
> +++ b/arch/powerpc/mm/nohash/tlb_low_64e.S
> @@ -222,7 +222,7 @@ tlb_miss_kernel_bolted:
>   
>   tlb_miss_fault_bolted:
>   	/* We need to check if it was an instruction miss */
> -	andi.	r10,r11,_PAGE_EXEC|_PAGE_BAP_SX
> +	andi.	r10,r11,_PAGE_BAP_UX|_PAGE_BAP_SX
>   	bne	itlb_miss_fault_bolted
>   dtlb_miss_fault_bolted:
>   	tlb_epilog_bolted
> @@ -239,7 +239,7 @@ itlb_miss_fault_bolted:
>   	srdi	r15,r16,60		/* get region */
>   	bne-	itlb_miss_fault_bolted
>   
> -	li	r11,_PAGE_PRESENT|_PAGE_EXEC	/* Base perm */
> +	li	r11,_PAGE_PRESENT|_PAGE_BAP_UX	/* Base perm */
>   
>   	/* We do the user/kernel test for the PID here along with the RW test
>   	 */
> @@ -614,7 +614,7 @@ itlb_miss_fault_e6500:
>   
>   	/* We do the user/kernel test for the PID here along with the RW test
>   	 */
> -	li	r11,_PAGE_PRESENT|_PAGE_EXEC	/* Base perm */
> +	li	r11,_PAGE_PRESENT|_PAGE_BAP_UX	/* Base perm */
>   	oris	r11,r11,_PAGE_ACCESSED@h
>   
>   	cmpldi	cr0,r15,0			/* Check for user region */
> @@ -734,7 +734,7 @@ normal_tlb_miss_done:
>   
>   normal_tlb_miss_access_fault:
>   	/* We need to check if it was an instruction miss */
> -	andi.	r10,r11,_PAGE_EXEC
> +	andi.	r10,r11,_PAGE_BAP_UX
>   	bne	1f
>   	ld	r14,EX_TLB_DEAR(r12)
>   	ld	r15,EX_TLB_ESR(r12)
> 

^ permalink raw reply

* [PATCH v3 02/10] powerpc/book3e: Fix set_memory_x() and set_memory_nx()
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

set_memory_x() calls pte_mkexec() which sets _PAGE_EXEC.
set_memory_nx() calls pte_exprotec() which clears _PAGE_EXEC.

Book3e has 2 bits, UX and SX, which defines the exec rights
resp. for user (PR=1) and for kernel (PR=0).

_PAGE_EXEC is defined as UX only.

An executable kernel page is set with either _PAGE_KERNEL_RWX
or _PAGE_KERNEL_ROX, which both have SX set and UX cleared.

So set_memory_nx() call for an executable kernel page does
nothing because UX is already cleared.

And set_memory_x() on a non-executable kernel page makes it
executable for the user and keeps it non-executable for kernel.

Also, pte_exec() always returns 'false' on kernel pages, because
it checks _PAGE_EXEC which doesn't include SX, so for instance
the W+X check doesn't work.

To fix this:
- change tlb_low_64e.S to use _PAGE_BAP_UX instead of _PAGE_USER
- sets both UX and SX in _PAGE_EXEC so that pte_user() returns
true whenever one of the two bits is set and pte_exprotect()
clears both bits.
- Define a book3e specific version of pte_mkexec() which sets
either SX or UX based on UR.

Fixes: 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: Removed pte_mkexec() from nohash/64/pgtable.h
v2: New
---
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 ++
 arch/powerpc/include/asm/nohash/64/pgtable.h |  5 -----
 arch/powerpc/include/asm/nohash/pte-book3e.h | 18 ++++++++++++++----
 arch/powerpc/mm/nohash/tlb_low_64e.S         |  8 ++++----
 4 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index ac0a5ff48c3a..d6ba821a56ce 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -193,10 +193,12 @@ static inline pte_t pte_wrprotect(pte_t pte)
 }
 #endif
 
+#ifndef pte_mkexec
 static inline pte_t pte_mkexec(pte_t pte)
 {
 	return __pte(pte_val(pte) | _PAGE_EXEC);
 }
+#endif
 
 #define pmd_none(pmd)		(!pmd_val(pmd))
 #define	pmd_bad(pmd)		(pmd_val(pmd) & _PMD_BAD)
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index d081704b13fb..9d2905a47410 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -118,11 +118,6 @@ static inline pte_t pte_wrprotect(pte_t pte)
 	return __pte(pte_val(pte) & ~_PAGE_RW);
 }
 
-static inline pte_t pte_mkexec(pte_t pte)
-{
-	return __pte(pte_val(pte) | _PAGE_EXEC);
-}
-
 #define PMD_BAD_BITS		(PTE_TABLE_SIZE-1)
 #define PUD_BAD_BITS		(PMD_TABLE_SIZE-1)
 
diff --git a/arch/powerpc/include/asm/nohash/pte-book3e.h b/arch/powerpc/include/asm/nohash/pte-book3e.h
index 813918f40765..f798640422c2 100644
--- a/arch/powerpc/include/asm/nohash/pte-book3e.h
+++ b/arch/powerpc/include/asm/nohash/pte-book3e.h
@@ -48,7 +48,7 @@
 #define _PAGE_WRITETHRU	0x800000 /* W: cache write-through */
 
 /* "Higher level" linux bit combinations */
-#define _PAGE_EXEC		_PAGE_BAP_UX /* .. and was cache cleaned */
+#define _PAGE_EXEC		(_PAGE_BAP_SX | _PAGE_BAP_UX) /* .. and was cache cleaned */
 #define _PAGE_RW		(_PAGE_BAP_SW | _PAGE_BAP_UW) /* User write permission */
 #define _PAGE_KERNEL_RW		(_PAGE_BAP_SW | _PAGE_BAP_SR | _PAGE_DIRTY)
 #define _PAGE_KERNEL_RO		(_PAGE_BAP_SR)
@@ -93,11 +93,11 @@
 /* Permission masks used to generate the __P and __S table */
 #define PAGE_NONE	__pgprot(_PAGE_BASE)
 #define PAGE_SHARED	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW)
-#define PAGE_SHARED_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW | _PAGE_EXEC)
+#define PAGE_SHARED_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_RW | _PAGE_BAP_UX)
 #define PAGE_COPY	__pgprot(_PAGE_BASE | _PAGE_USER)
-#define PAGE_COPY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
+#define PAGE_COPY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_BAP_UX)
 #define PAGE_READONLY	__pgprot(_PAGE_BASE | _PAGE_USER)
-#define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
+#define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_BAP_UX)
 
 #ifndef __ASSEMBLY__
 static inline pte_t pte_mkprivileged(pte_t pte)
@@ -113,6 +113,16 @@ static inline pte_t pte_mkuser(pte_t pte)
 }
 
 #define pte_mkuser pte_mkuser
+
+static inline pte_t pte_mkexec(pte_t pte)
+{
+	if (pte_val(pte) & _PAGE_BAP_UR)
+		return __pte((pte_val(pte) & ~_PAGE_BAP_SX) | _PAGE_BAP_UX);
+	else
+		return __pte((pte_val(pte) & ~_PAGE_BAP_UX) | _PAGE_BAP_SX);
+}
+#define pte_mkexec pte_mkexec
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S b/arch/powerpc/mm/nohash/tlb_low_64e.S
index bf24451f3e71..9235e720e357 100644
--- a/arch/powerpc/mm/nohash/tlb_low_64e.S
+++ b/arch/powerpc/mm/nohash/tlb_low_64e.S
@@ -222,7 +222,7 @@ tlb_miss_kernel_bolted:
 
 tlb_miss_fault_bolted:
 	/* We need to check if it was an instruction miss */
-	andi.	r10,r11,_PAGE_EXEC|_PAGE_BAP_SX
+	andi.	r10,r11,_PAGE_BAP_UX|_PAGE_BAP_SX
 	bne	itlb_miss_fault_bolted
 dtlb_miss_fault_bolted:
 	tlb_epilog_bolted
@@ -239,7 +239,7 @@ itlb_miss_fault_bolted:
 	srdi	r15,r16,60		/* get region */
 	bne-	itlb_miss_fault_bolted
 
-	li	r11,_PAGE_PRESENT|_PAGE_EXEC	/* Base perm */
+	li	r11,_PAGE_PRESENT|_PAGE_BAP_UX	/* Base perm */
 
 	/* We do the user/kernel test for the PID here along with the RW test
 	 */
@@ -614,7 +614,7 @@ itlb_miss_fault_e6500:
 
 	/* We do the user/kernel test for the PID here along with the RW test
 	 */
-	li	r11,_PAGE_PRESENT|_PAGE_EXEC	/* Base perm */
+	li	r11,_PAGE_PRESENT|_PAGE_BAP_UX	/* Base perm */
 	oris	r11,r11,_PAGE_ACCESSED@h
 
 	cmpldi	cr0,r15,0			/* Check for user region */
@@ -734,7 +734,7 @@ normal_tlb_miss_done:
 
 normal_tlb_miss_access_fault:
 	/* We need to check if it was an instruction miss */
-	andi.	r10,r11,_PAGE_EXEC
+	andi.	r10,r11,_PAGE_BAP_UX
 	bne	1f
 	ld	r14,EX_TLB_DEAR(r12)
 	ld	r15,EX_TLB_ESR(r12)
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 07/10] powerpc/fsl_booke: Tell map_mem_in_cams() if init is done
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

In order to be able to call map_mem_in_cams() once more
after init for STRICT_KERNEL_RWX, add an argument.

For now, map_mem_in_cams() is always called only during init.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: No change
---
 arch/powerpc/mm/mmu_decl.h           |  2 +-
 arch/powerpc/mm/nohash/fsl_book3e.c  | 12 ++++++------
 arch/powerpc/mm/nohash/kaslr_booke.c |  2 +-
 arch/powerpc/mm/nohash/tlb.c         |  4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index dd1cabc2ea0f..e13a3c0caa02 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -126,7 +126,7 @@ unsigned long mmu_mapin_ram(unsigned long base, unsigned long top);
 
 #ifdef CONFIG_PPC_FSL_BOOK3E
 extern unsigned long map_mem_in_cams(unsigned long ram, int max_cam_idx,
-				     bool dryrun);
+				     bool dryrun, bool init);
 extern unsigned long calc_cam_sz(unsigned long ram, unsigned long virt,
 				 phys_addr_t phys);
 #ifdef CONFIG_PPC32
diff --git a/arch/powerpc/mm/nohash/fsl_book3e.c b/arch/powerpc/mm/nohash/fsl_book3e.c
index 2668bb06e4fa..8ae1ba7985df 100644
--- a/arch/powerpc/mm/nohash/fsl_book3e.c
+++ b/arch/powerpc/mm/nohash/fsl_book3e.c
@@ -168,7 +168,7 @@ unsigned long calc_cam_sz(unsigned long ram, unsigned long virt,
 
 static unsigned long map_mem_in_cams_addr(phys_addr_t phys, unsigned long virt,
 					unsigned long ram, int max_cam_idx,
-					bool dryrun)
+					bool dryrun, bool init)
 {
 	int i;
 	unsigned long amount_mapped = 0;
@@ -203,12 +203,12 @@ static unsigned long map_mem_in_cams_addr(phys_addr_t phys, unsigned long virt,
 	return amount_mapped;
 }
 
-unsigned long map_mem_in_cams(unsigned long ram, int max_cam_idx, bool dryrun)
+unsigned long map_mem_in_cams(unsigned long ram, int max_cam_idx, bool dryrun, bool init)
 {
 	unsigned long virt = PAGE_OFFSET;
 	phys_addr_t phys = memstart_addr;
 
-	return map_mem_in_cams_addr(phys, virt, ram, max_cam_idx, dryrun);
+	return map_mem_in_cams_addr(phys, virt, ram, max_cam_idx, dryrun, init);
 }
 
 #ifdef CONFIG_PPC32
@@ -249,7 +249,7 @@ void __init adjust_total_lowmem(void)
 	ram = min((phys_addr_t)__max_low_memory, (phys_addr_t)total_lowmem);
 
 	i = switch_to_as1();
-	__max_low_memory = map_mem_in_cams(ram, CONFIG_LOWMEM_CAM_NUM, false);
+	__max_low_memory = map_mem_in_cams(ram, CONFIG_LOWMEM_CAM_NUM, false, true);
 	restore_to_as0(i, 0, 0, 1);
 
 	pr_info("Memory CAM mapping: ");
@@ -320,11 +320,11 @@ notrace void __init relocate_init(u64 dt_ptr, phys_addr_t start)
 		/* map a 64M area for the second relocation */
 		if (memstart_addr > start)
 			map_mem_in_cams(0x4000000, CONFIG_LOWMEM_CAM_NUM,
-					false);
+					false, true);
 		else
 			map_mem_in_cams_addr(start, PAGE_OFFSET + offset,
 					0x4000000, CONFIG_LOWMEM_CAM_NUM,
-					false);
+					false, true);
 		restore_to_as0(n, offset, __va(dt_ptr), 1);
 		/* We should never reach here */
 		panic("Relocation error");
diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c b/arch/powerpc/mm/nohash/kaslr_booke.c
index 4c74e8a5482b..8fc49b1b4a91 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -314,7 +314,7 @@ static unsigned long __init kaslr_choose_location(void *dt_ptr, phys_addr_t size
 		pr_warn("KASLR: No safe seed for randomizing the kernel base.\n");
 
 	ram = min_t(phys_addr_t, __max_low_memory, size);
-	ram = map_mem_in_cams(ram, CONFIG_LOWMEM_CAM_NUM, true);
+	ram = map_mem_in_cams(ram, CONFIG_LOWMEM_CAM_NUM, true, false);
 	linear_sz = min_t(unsigned long, ram, SZ_512M);
 
 	/* If the linear size is smaller than 64M, do not randmize */
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 5872f69141d5..fc195b9f524b 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -643,7 +643,7 @@ static void early_init_this_mmu(void)
 
 		if (map)
 			linear_map_top = map_mem_in_cams(linear_map_top,
-							 num_cams, false);
+							 num_cams, true, true);
 	}
 #endif
 
@@ -764,7 +764,7 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 		num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4;
 
 		linear_sz = map_mem_in_cams(first_memblock_size, num_cams,
-					    true);
+					    false, true);
 
 		ppc64_rma_size = min_t(u64, linear_sz, 0x40000000);
 	} else
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 01/10] powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect()
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

Commit 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
changed those two functions to use pte helpers to determine which
bits to clear and which bits to set.

This change was based on the assumption that bits to be set/cleared
are always the same and can be determined by applying the pte
manipulation helpers on __pte(0).

But on platforms like book3e, the bits depend on whether the page
is a user page or not.

For the time being it more or less works because of _PAGE_EXEC being
used for user pages only and exec right being set at all time on
kernel page. But following patch will clean that and output of
pte_mkexec() will depend on the page being a user or kernel page.

Instead of trying to make an even more complicated helper where bits
would become dependent on the final pte value, come back to a more
static situation like before commit 26973fa5ac0e ("powerpc/mm: use
pte helpers in generic code"), by introducing an 8xx specific
version of __ptep_set_access_flags() and ptep_set_wrprotect().

Fixes: 26973fa5ac0e ("powerpc/mm: use pte helpers in generic code")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: New
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 17 +++++++--------
 arch/powerpc/include/asm/nohash/32/pte-8xx.h | 22 ++++++++++++++++++++
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index f06ae00f2a65..ac0a5ff48c3a 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -306,30 +306,29 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 }
 
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#ifndef ptep_set_wrprotect
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 				      pte_t *ptep)
 {
-	unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
-	unsigned long set = pte_val(pte_wrprotect(__pte(0)));
-
-	pte_update(mm, addr, ptep, clr, set, 0);
+	pte_update(mm, addr, ptep, _PAGE_RW, 0, 0);
 }
+#endif
 
+#ifndef __ptep_set_access_flags
 static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
 					   pte_t *ptep, pte_t entry,
 					   unsigned long address,
 					   int psize)
 {
-	pte_t pte_set = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(0)))));
-	pte_t pte_clr = pte_mkyoung(pte_mkdirty(pte_mkwrite(pte_mkexec(__pte(~0)))));
-	unsigned long set = pte_val(entry) & pte_val(pte_set);
-	unsigned long clr = ~pte_val(entry) & ~pte_val(pte_clr);
+	unsigned long set = pte_val(entry) &
+			    (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC);
 	int huge = psize > mmu_virtual_psize ? 1 : 0;
 
-	pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+	pte_update(vma->vm_mm, address, ptep, 0, set, huge);
 
 	flush_tlb_page(vma, address);
 }
+#endif
 
 static inline int pte_young(pte_t pte)
 {
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index fcc48d590d88..1a89ebdc3acc 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -136,6 +136,28 @@ static inline pte_t pte_mkhuge(pte_t pte)
 
 #define pte_mkhuge pte_mkhuge
 
+static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p,
+				     unsigned long clr, unsigned long set, int huge);
+
+static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	pte_update(mm, addr, ptep, 0, _PAGE_RO, 0);
+}
+#define ptep_set_wrprotect ptep_set_wrprotect
+
+static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
+					   pte_t entry, unsigned long address, int psize)
+{
+	unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_EXEC);
+	unsigned long clr = ~pte_val(entry) & _PAGE_RO;
+	int huge = psize > mmu_virtual_psize ? 1 : 0;
+
+	pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+
+	flush_tlb_page(vma, address);
+}
+#define __ptep_set_access_flags __ptep_set_access_flags
+
 static inline unsigned long pgd_leaf_size(pgd_t pgd)
 {
 	if (pgd_val(pgd) & _PMD_PAGE_8M)
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 10/10] powerpc/fsl_booke: Enable STRICT_KERNEL_RWX
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

Enable STRICT_KERNEL_RWX on fsl_booke.

For that, we need additional TLBCAMs dedicated to linear mapping,
based on the alignment of _sinittext.

By default, up to 768 Mbytes of memory are mapped.
It uses 3 TLBCAMs of size 256 Mbytes.

With a data alignment of 16, we need up to 9 TLBCAMs:
  16/16/16/16/64/64/64/256/256

With a data alignment of 4, we need up to 12 TLBCAMs:
  4/4/4/4/16/16/16/64/64/64/256/256

With a data alignment of 1, we need up to 15 TLBCAMs:
  1/1/1/1/4/4/4/16/16/16/64/64/64/256/256

By default, set a 16 Mbytes alignment as a compromise between memory
usage and number of TLBCAMs. This can be adjusted manually when needed.

For the time being, it doens't work when the base is randomised.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: No change
---
 arch/powerpc/Kconfig | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6b9f523882c5..939a47642a9c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -139,6 +139,7 @@ config PPC
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_STRICT_KERNEL_RWX	if (PPC_BOOK3S || PPC_8xx || 40x) && !HIBERNATION
+	select ARCH_HAS_STRICT_KERNEL_RWX	if FSL_BOOKE && !HIBERNATION && !RANDOMIZE_BASE
 	select ARCH_HAS_STRICT_MODULE_RWX	if ARCH_HAS_STRICT_KERNEL_RWX && !PPC_BOOK3S_32
 	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAS_UACCESS_FLUSHCACHE
@@ -778,7 +779,8 @@ config DATA_SHIFT_BOOL
 	bool "Set custom data alignment"
 	depends on ADVANCED_OPTIONS
 	depends on STRICT_KERNEL_RWX || DEBUG_PAGEALLOC || KFENCE
-	depends on PPC_BOOK3S_32 || (PPC_8xx && !PIN_TLB_DATA && !STRICT_KERNEL_RWX)
+	depends on PPC_BOOK3S_32 || (PPC_8xx && !PIN_TLB_DATA && !STRICT_KERNEL_RWX) || \
+		   FSL_BOOKE
 	help
 	  This option allows you to set the kernel data alignment. When
 	  RAM is mapped by blocks, the alignment needs to fit the size and
@@ -791,11 +793,13 @@ config DATA_SHIFT
 	default 24 if STRICT_KERNEL_RWX && PPC64
 	range 17 28 if (STRICT_KERNEL_RWX || DEBUG_PAGEALLOC || KFENCE) && PPC_BOOK3S_32
 	range 19 23 if (STRICT_KERNEL_RWX || DEBUG_PAGEALLOC || KFENCE) && PPC_8xx
+	range 20 24 if (STRICT_KERNEL_RWX || DEBUG_PAGEALLOC || KFENCE) && PPC_FSL_BOOKE
 	default 22 if STRICT_KERNEL_RWX && PPC_BOOK3S_32
 	default 18 if (DEBUG_PAGEALLOC || KFENCE) && PPC_BOOK3S_32
 	default 23 if STRICT_KERNEL_RWX && PPC_8xx
 	default 23 if (DEBUG_PAGEALLOC || KFENCE) && PPC_8xx && PIN_TLB_DATA
 	default 19 if (DEBUG_PAGEALLOC || KFENCE) && PPC_8xx
+	default 24 if STRICT_KERNEL_RWX && FSL_BOOKE
 	default PPC_PAGE_SHIFT
 	help
 	  On Book3S 32 (603+), DBATs are used to map kernel text and rodata RO.
@@ -1123,7 +1127,10 @@ config LOWMEM_CAM_NUM_BOOL
 config LOWMEM_CAM_NUM
 	depends on FSL_BOOKE
 	int "Number of CAMs to use to map low memory" if LOWMEM_CAM_NUM_BOOL
-	default 3
+	default 3 if !STRICT_KERNEL_RWX
+	default 9 if DATA_SHIFT >= 24
+	default 12 if DATA_SHIFT >= 22
+	default 15
 
 config DYNAMIC_MEMSTART
 	bool "Enable page aligned dynamic load address for kernel"
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 09/10] powerpc/fsl_booke: Update of TLBCAMs after init
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

After init, set readonly memory as ROX and set readwrite
memory as RWX, if STRICT_KERNEL_RWX is enabled.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: No change
---
 arch/powerpc/mm/mmu_decl.h          |  2 +-
 arch/powerpc/mm/nohash/fsl_book3e.c | 32 +++++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index e13a3c0caa02..0dd4c18f8363 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -168,7 +168,7 @@ static inline phys_addr_t v_block_mapped(unsigned long va) { return 0; }
 static inline unsigned long p_block_mapped(phys_addr_t pa) { return 0; }
 #endif
 
-#if defined(CONFIG_PPC_BOOK3S_32) || defined(CONFIG_PPC_8xx)
+#if defined(CONFIG_PPC_BOOK3S_32) || defined(CONFIG_PPC_8xx) || defined(CONFIG_PPC_FSL_BOOK3E)
 void mmu_mark_initmem_nx(void);
 void mmu_mark_rodata_ro(void);
 #else
diff --git a/arch/powerpc/mm/nohash/fsl_book3e.c b/arch/powerpc/mm/nohash/fsl_book3e.c
index 88132cab3442..b231a54f540c 100644
--- a/arch/powerpc/mm/nohash/fsl_book3e.c
+++ b/arch/powerpc/mm/nohash/fsl_book3e.c
@@ -182,7 +182,7 @@ static unsigned long map_mem_in_cams_addr(phys_addr_t phys, unsigned long virt,
 	/* Calculate CAM values */
 	for (i = 0; boundary && i < max_cam_idx; i++) {
 		unsigned long cam_sz;
-		pgprot_t prot = PAGE_KERNEL_X;
+		pgprot_t prot = init ? PAGE_KERNEL_X : PAGE_KERNEL_ROX;
 
 		cam_sz = calc_cam_sz(boundary, virt, phys);
 		if (!dryrun)
@@ -195,7 +195,7 @@ static unsigned long map_mem_in_cams_addr(phys_addr_t phys, unsigned long virt,
 	}
 	for (ram -= amount_mapped; ram && i < max_cam_idx; i++) {
 		unsigned long cam_sz;
-		pgprot_t prot = PAGE_KERNEL_X;
+		pgprot_t prot = init ? PAGE_KERNEL_X : PAGE_KERNEL;
 
 		cam_sz = calc_cam_sz(ram, virt, phys);
 		if (!dryrun)
@@ -210,8 +210,13 @@ static unsigned long map_mem_in_cams_addr(phys_addr_t phys, unsigned long virt,
 	if (dryrun)
 		return amount_mapped;
 
-	loadcam_multi(0, i, max_cam_idx);
-	tlbcam_index = i;
+	if (init) {
+		loadcam_multi(0, i, max_cam_idx);
+		tlbcam_index = i;
+	} else {
+		loadcam_multi(0, i, 0);
+		WARN_ON(i > tlbcam_index);
+	}
 
 #ifdef CONFIG_PPC64
 	get_paca()->tcd.esel_next = i;
@@ -280,6 +285,25 @@ void __init adjust_total_lowmem(void)
 	memblock_set_current_limit(memstart_addr + __max_low_memory);
 }
 
+#ifdef CONFIG_STRICT_KERNEL_RWX
+void mmu_mark_rodata_ro(void)
+{
+	/* Everything is done in mmu_mark_initmem_nx() */
+}
+#endif
+
+void mmu_mark_initmem_nx(void)
+{
+	unsigned long remapped;
+
+	if (!strict_kernel_rwx_enabled())
+		return;
+
+	remapped = map_mem_in_cams(__max_low_memory, CONFIG_LOWMEM_CAM_NUM, false, false);
+
+	WARN_ON(__max_low_memory != remapped);
+}
+
 void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 				phys_addr_t first_memblock_size)
 {
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 03/10] powerpc/booke: Disable STRICT_KERNEL_RWX, DEBUG_PAGEALLOC and KFENCE
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

fsl_booke and 44x are not able to map kernel linear memory with
pages, so they can't support DEBUG_PAGEALLOC and KFENCE, and
STRICT_KERNEL_RWX is also a problem for now.

Enable those only on book3s (both 32 and 64 except KFENCE), 8xx and 40x.

Fixes: 88df6e90fa97 ("[POWERPC] DEBUG_PAGEALLOC for 32-bit")
Fixes: 95902e6c8864 ("powerpc/mm: Implement STRICT_KERNEL_RWX on PPC32")
Fixes: 90cbac0e995d ("powerpc: Enable KFENCE for PPC32")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: No change
---
 arch/powerpc/Kconfig | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ba5b66189358..6b9f523882c5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -138,7 +138,7 @@ config PPC
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
 	select ARCH_HAS_SET_MEMORY
-	select ARCH_HAS_STRICT_KERNEL_RWX	if ((PPC_BOOK3S_64 || PPC32) && !HIBERNATION)
+	select ARCH_HAS_STRICT_KERNEL_RWX	if (PPC_BOOK3S || PPC_8xx || 40x) && !HIBERNATION
 	select ARCH_HAS_STRICT_MODULE_RWX	if ARCH_HAS_STRICT_KERNEL_RWX && !PPC_BOOK3S_32
 	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAS_UACCESS_FLUSHCACHE
@@ -150,7 +150,7 @@ config PPC
 	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_STACKWALK
 	select ARCH_SUPPORTS_ATOMIC_RMW
-	select ARCH_SUPPORTS_DEBUG_PAGEALLOC	if PPC32 || PPC_BOOK3S_64
+	select ARCH_SUPPORTS_DEBUG_PAGEALLOC	if PPC_BOOK3S || PPC_8xx || 40x
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_MEMTEST
@@ -190,7 +190,7 @@ config PPC
 	select HAVE_ARCH_JUMP_LABEL_RELATIVE
 	select HAVE_ARCH_KASAN			if PPC32 && PPC_PAGE_SHIFT <= 14
 	select HAVE_ARCH_KASAN_VMALLOC		if PPC32 && PPC_PAGE_SHIFT <= 14
-	select HAVE_ARCH_KFENCE			if PPC32
+	select HAVE_ARCH_KFENCE			if PPC_BOOK3S_32 || PPC_8xx || 40x
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if COMPAT
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 08/10] powerpc/fsl_booke: Allocate separate TLBCAMs for readonly memory
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

Reorganise TLBCAM allocation so that when STRICT_KERNEL_RWX is
enabled, TLBCAMs are allocated such that readonly memory uses
different TLBCAMs.

This results in an allocation looking like:

Memory CAM mapping: 4/4/4/1/1/1/1/16/16/16/64/64/64/256/256 Mb, residual: 256Mb

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: No change
---
 arch/powerpc/mm/nohash/fsl_book3e.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/nohash/fsl_book3e.c b/arch/powerpc/mm/nohash/fsl_book3e.c
index 8ae1ba7985df..88132cab3442 100644
--- a/arch/powerpc/mm/nohash/fsl_book3e.c
+++ b/arch/powerpc/mm/nohash/fsl_book3e.c
@@ -172,15 +172,34 @@ static unsigned long map_mem_in_cams_addr(phys_addr_t phys, unsigned long virt,
 {
 	int i;
 	unsigned long amount_mapped = 0;
+	unsigned long boundary;
+
+	if (strict_kernel_rwx_enabled())
+		boundary = (unsigned long)(_sinittext - _stext);
+	else
+		boundary = ram;
 
 	/* Calculate CAM values */
-	for (i = 0; ram && i < max_cam_idx; i++) {
+	for (i = 0; boundary && i < max_cam_idx; i++) {
+		unsigned long cam_sz;
+		pgprot_t prot = PAGE_KERNEL_X;
+
+		cam_sz = calc_cam_sz(boundary, virt, phys);
+		if (!dryrun)
+			settlbcam(i, virt, phys, cam_sz, pgprot_val(prot), 0);
+
+		boundary -= cam_sz;
+		amount_mapped += cam_sz;
+		virt += cam_sz;
+		phys += cam_sz;
+	}
+	for (ram -= amount_mapped; ram && i < max_cam_idx; i++) {
 		unsigned long cam_sz;
+		pgprot_t prot = PAGE_KERNEL_X;
 
 		cam_sz = calc_cam_sz(ram, virt, phys);
 		if (!dryrun)
-			settlbcam(i, virt, phys, cam_sz,
-				  pgprot_val(PAGE_KERNEL_X), 0);
+			settlbcam(i, virt, phys, cam_sz, pgprot_val(prot), 0);
 
 		ram -= cam_sz;
 		amount_mapped += cam_sz;
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 05/10] powerpc/fsl_booke: Take exec flag into account when setting TLBCAMs
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

Don't force MAS3_SX and MAS3_UX at all time. Take into account the
exec flag.

While at it, fix a couple of closeby style problems (indent with space
and unnecessary parenthesis), it keeps more readability.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: Use the new _PAGE_EXEC to check executability of flags instead of _PAGE_BAP_SX (Error reported by robot with tqm8541_defconfig)
---
 arch/powerpc/mm/nohash/fsl_book3e.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/nohash/fsl_book3e.c b/arch/powerpc/mm/nohash/fsl_book3e.c
index 03dacbe940e5..2668bb06e4fa 100644
--- a/arch/powerpc/mm/nohash/fsl_book3e.c
+++ b/arch/powerpc/mm/nohash/fsl_book3e.c
@@ -122,15 +122,18 @@ static void settlbcam(int index, unsigned long virt, phys_addr_t phys,
 	TLBCAM[index].MAS2 |= (flags & _PAGE_GUARDED) ? MAS2_G : 0;
 	TLBCAM[index].MAS2 |= (flags & _PAGE_ENDIAN) ? MAS2_E : 0;
 
-	TLBCAM[index].MAS3 = (phys & MAS3_RPN) | MAS3_SX | MAS3_SR;
-	TLBCAM[index].MAS3 |= ((flags & _PAGE_RW) ? MAS3_SW : 0);
+	TLBCAM[index].MAS3 = (phys & MAS3_RPN) | MAS3_SR;
+	TLBCAM[index].MAS3 |= (flags & _PAGE_RW) ? MAS3_SW : 0;
 	if (mmu_has_feature(MMU_FTR_BIG_PHYS))
 		TLBCAM[index].MAS7 = (u64)phys >> 32;
 
 	/* Below is unlikely -- only for large user pages or similar */
 	if (pte_user(__pte(flags))) {
-	   TLBCAM[index].MAS3 |= MAS3_UX | MAS3_UR;
-	   TLBCAM[index].MAS3 |= ((flags & _PAGE_RW) ? MAS3_UW : 0);
+		TLBCAM[index].MAS3 |= MAS3_UR;
+		TLBCAM[index].MAS3 |= (flags & _PAGE_EXEC) ? MAS3_UX : 0;
+		TLBCAM[index].MAS3 |= (flags & _PAGE_RW) ? MAS3_UW : 0;
+	} else {
+		TLBCAM[index].MAS3 |= (flags & _PAGE_EXEC) ? MAS3_SX : 0;
 	}
 
 	tlbcam_addrs[index].start = virt;
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 06/10] powerpc/fsl_booke: Enable reloading of TLBCAM without switching to AS1
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

Avoid switching to AS1 when reloading TLBCAM after init for
STRICT_KERNEL_RWX.

When we setup AS1 we expect the entire accessible memory to be mapped
through one entry, this is not the case anymore at the end of init.

We are not changing the size of TLBCAMs, only flags, so no need to
switch to AS1.

So change loadcam_multi() to not switch to AS1 when the given
temporary tlb entry in 0.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: No change
---
 arch/powerpc/mm/nohash/tlb_low.S | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/nohash/tlb_low.S b/arch/powerpc/mm/nohash/tlb_low.S
index 5add4a51e51f..dd39074de9af 100644
--- a/arch/powerpc/mm/nohash/tlb_low.S
+++ b/arch/powerpc/mm/nohash/tlb_low.S
@@ -369,7 +369,7 @@ _GLOBAL(_tlbivax_bcast)
  * extern void loadcam_entry(unsigned int index)
  *
  * Load TLBCAM[index] entry in to the L2 CAM MMU
- * Must preserve r7, r8, r9, r10 and r11
+ * Must preserve r7, r8, r9, r10, r11, r12
  */
 _GLOBAL(loadcam_entry)
 	mflr	r5
@@ -401,7 +401,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_BIG_PHYS)
  *
  * r3 = first entry to write
  * r4 = number of entries to write
- * r5 = temporary tlb entry
+ * r5 = temporary tlb entry (0 means no switch to AS1)
  */
 _GLOBAL(loadcam_multi)
 	mflr	r8
@@ -409,6 +409,8 @@ _GLOBAL(loadcam_multi)
 	mfmsr	r11
 	andi.	r11,r11,MSR_IS
 	bne	10f
+	mr.	r12, r5
+	beq	10f
 
 	/*
 	 * Set up temporary TLB entry that is the same as what we're
@@ -446,6 +448,8 @@ _GLOBAL(loadcam_multi)
 	/* Don't return to AS=0 if we were in AS=1 at function start */
 	andi.	r11,r11,MSR_IS
 	bne	3f
+	cmpwi	r12, 0
+	beq	3f
 
 	/* Return to AS=0 and clear the temporary entry */
 	mfmsr	r6
-- 
2.31.1


^ permalink raw reply related

* [PATCH v3 04/10] powerpc/fsl_booke: Rename fsl_booke.c to fsl_book3e.c
From: Christophe Leroy @ 2021-10-25 21:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <f34898e2edb21db1bcb1c9a96ac7433a141d50c2.1635198209.git.christophe.leroy@csgroup.eu>

We have a myriad of CONFIG symbols around different variants
of BOOKEs, which would be worth tidying up one day.

But at least, make file names and CONFIG option match:

We have CONFIG_FSL_BOOKE and CONFIG_PPC_FSL_BOOK3E.

fsl_booke.c is selected by and only by CONFIG_PPC_FSL_BOOK3E.
So rename it fsl_book3e to reduce confusion.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: No change
v2: No change
---
 arch/powerpc/mm/nohash/Makefile                      | 4 ++--
 arch/powerpc/mm/nohash/{fsl_booke.c => fsl_book3e.c} | 0
 2 files changed, 2 insertions(+), 2 deletions(-)
 rename arch/powerpc/mm/nohash/{fsl_booke.c => fsl_book3e.c} (100%)

diff --git a/arch/powerpc/mm/nohash/Makefile b/arch/powerpc/mm/nohash/Makefile
index 0424f6ce5bd8..b1f630d423d8 100644
--- a/arch/powerpc/mm/nohash/Makefile
+++ b/arch/powerpc/mm/nohash/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_PPC_BOOK3E_64)  	+= tlb_low_64e.o book3e_pgtable.o
 obj-$(CONFIG_40x)		+= 40x.o
 obj-$(CONFIG_44x)		+= 44x.o
 obj-$(CONFIG_PPC_8xx)		+= 8xx.o
-obj-$(CONFIG_PPC_FSL_BOOK3E)	+= fsl_booke.o
+obj-$(CONFIG_PPC_FSL_BOOK3E)	+= fsl_book3e.o
 obj-$(CONFIG_RANDOMIZE_BASE)	+= kaslr_booke.o
 ifdef CONFIG_HUGETLB_PAGE
 obj-$(CONFIG_PPC_FSL_BOOK3E)	+= book3e_hugetlbpage.o
@@ -16,4 +16,4 @@ endif
 # Disable kcov instrumentation on sensitive code
 # This is necessary for booting with kcov enabled on book3e machines
 KCOV_INSTRUMENT_tlb.o := n
-KCOV_INSTRUMENT_fsl_booke.o := n
+KCOV_INSTRUMENT_fsl_book3e.o := n
diff --git a/arch/powerpc/mm/nohash/fsl_booke.c b/arch/powerpc/mm/nohash/fsl_book3e.c
similarity index 100%
rename from arch/powerpc/mm/nohash/fsl_booke.c
rename to arch/powerpc/mm/nohash/fsl_book3e.c
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active
From: Athira Rajeev @ 2021-10-25 17:10 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: maddy, rnsastry, Nicholas Piggin, kjain, linuxppc-dev
In-Reply-To: <87ilxqxoxy.fsf@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 3822 bytes --]



> On 21-Oct-2021, at 11:03 PM, Nathan Lynch <nathanl@linux.ibm.com> wrote:
> 
> Nicholas Piggin <npiggin@gmail.com <mailto:npiggin@gmail.com>> writes:
>> Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
>>> During Live Partition Migration (LPM), it is observed that perf
>>> counter values reports zero post migration completion. However
>>> 'perf stat' with workload continues to show counts post migration
>>> since PMU gets disabled/enabled during sched switches. But incase
>>> of system/cpu wide monitoring, zero counts were reported with 'perf
>>> stat' after migration completion.
>>> 
>>> Example:
>>> ./perf stat -e r1001e -I 1000
>>>           time             counts unit events
>>>     1.001010437         22,137,414      r1001e
>>>     2.002495447         15,455,821      r1001e
>>> <<>> As seen in next below logs, the counter values shows zero
>>>        after migration is completed.
>>> <<>>
>>>    86.142535370    129,392,333,440      r1001e
>>>    87.144714617                  0      r1001e
>>>    88.146526636                  0      r1001e
>>>    89.148085029                  0      r1001e
>>> 
>>> Here PMU is enabled during start of perf session and counter
>>> values are read at intervals. Counters are only disabled at the
>>> end of session. The powerpc mobility code presently does not handle
>>> disabling and enabling back of PMU counters during partition
>>> migration. Also since the PMU register values are not saved/restored
>>> during migration, PMU registers like Monitor Mode Control Register 0
>>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>>> the value it was programmed with. Hence PMU counters will not be
>>> enabled correctly post migration.
>>> 
>>> Fix this in mobility code by handling disabling and enabling of
>>> PMU in all cpu's before and after migration. Patch introduces two
>>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>>> mobility_pmu_disable() is called before the processor threads goes
>>> to suspend state so as to disable the PMU counters. And disable is
>>> done only if there are any active events running on that cpu.
>>> mobility_pmu_enable() is called after the processor threads are
>>> back online to enable back the PMU counters.
>>> 
>>> Since the performance Monitor counters ( PMCs) are not
>>> saved/restored during LPM, results in PMC value being zero and the
>>> 'event->hw.prev_count' being non-zero value. This causes problem
>> 
>> Interesting. Are they defined to not be migrated, or may not be 
>> migrated?
> 
> PAPR may be silent on this... at least I haven't found anything yet. But
> I'm not very familiar with perf counters.
> 
> How much assurance do we have that hardware events we've programmed on
> the source can be reliably re-enabled on the destination, with the same
> semantics? Aren't there some model-specific counters that don't make
> sense to handle this way?
> 
> 
>>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>>> index 9dc97d2..cea72d7 100644
>>> --- a/arch/powerpc/include/asm/rtas.h
>>> +++ b/arch/powerpc/include/asm/rtas.h
>>> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>>> static inline void read_24x7_sys_info(void) { }
>>> #endif
>>> 
>>> +#ifdef CONFIG_PPC_PERF_CTRS
>>> +void mobility_pmu_disable(void);
>>> +void mobility_pmu_enable(void);
>>> +#else
>>> +static inline void mobility_pmu_disable(void) { }
>>> +static inline void mobility_pmu_enable(void) { }
>>> +#endif
>>> +
>>> #endif /* __KERNEL__ */
>>> #endif /* _POWERPC_RTAS_H */
>> 
>> It's not implemented in rtas, maybe consider putting this into a perf 
>> header?
> 
> +1

Sure, I will move this to perf_event header file

Thanks
Athira

[-- Attachment #2: Type: text/html, Size: 14714 bytes --]

^ permalink raw reply

* Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active
From: Athira Rajeev @ 2021-10-25 17:09 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: kjain, maddy, linuxppc-dev, Nicholas Piggin, rnsastry
In-Reply-To: <87lf2mxpov.fsf@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2757 bytes --]



> On 21-Oct-2021, at 10:47 PM, Nathan Lynch <nathanl@linux.ibm.com> wrote:
> 
> Athira Rajeev <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> writes:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>> 
>> Example:
>> ./perf stat -e r1001e -I 1000
>>           time             counts unit events
>>     1.001010437         22,137,414      r1001e
>>     2.002495447         15,455,821      r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>>        after migration is completed.
>> <<>>
>>    86.142535370    129,392,333,440      r1001e
>>    87.144714617                  0      r1001e
>>    88.146526636                  0      r1001e
>>    89.148085029                  0      r1001e
> 
> Confirmed in my environment:
> 
>    51.099987985            300,338      cache-misses
>    52.101839374            296,586      cache-misses
>    53.116089796            263,150      cache-misses
>    54.117949249            232,290      cache-misses
>    55.602029375     68,700,421,711      cache-misses
>    56.610073969                  0      cache-misses
>    57.614732000                  0      cache-misses
> 
> I wonder what it means that there is a very unlikely huge value before
> the counter stops working -- I believe your example has this phenomenon
> too.
> 
> 
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>> index e83e089..ff7a77c 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -476,6 +476,8 @@ static int do_join(void *arg)
>> retry:
>> 	/* Must ensure MSR.EE off for H_JOIN. */
>> 	hard_irq_disable();
>> +	/* Disable PMU before suspend */
>> +	mobility_pmu_disable();
>> 	hvrc = plpar_hcall_norets(H_JOIN);
>> 
>> 	switch (hvrc) {
>> @@ -530,6 +532,8 @@ static int do_join(void *arg)
>> 	 * reset the watchdog.
>> 	 */
>> 	touch_nmi_watchdog();
>> +	/* Enable PMU after resuming */
>> +	mobility_pmu_enable();
>> 	return ret;
>> }
> 
> We should minimize calls into other subsystems from this context (the
> callback function we've passed to stop_machine); it's fairly sensitive.
> Can this be moved out to pseries_migrate_partition() or similar?

Hi Nathan

Thanks for the review.
I will move the callbacks to “pseries_migrate_partition” in next version

Athira.


[-- Attachment #2: Type: text/html, Size: 19859 bytes --]

^ permalink raw reply

* Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active
From: Athira Rajeev @ 2021-10-25 17:07 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: maddy, linuxppc-dev, rnsastry, kjain
In-Reply-To: <1634812863.5e9oss88pa.astroid@bobo.none>

[-- Attachment #1: Type: text/plain, Size: 10238 bytes --]



> On 21-Oct-2021, at 4:24 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>> 
>> Example:
>> ./perf stat -e r1001e -I 1000
>>           time             counts unit events
>>     1.001010437         22,137,414      r1001e
>>     2.002495447         15,455,821      r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>>        after migration is completed.
>> <<>>
>>    86.142535370    129,392,333,440      r1001e
>>    87.144714617                  0      r1001e
>>    88.146526636                  0      r1001e
>>    89.148085029                  0      r1001e
>> 
>> Here PMU is enabled during start of perf session and counter
>> values are read at intervals. Counters are only disabled at the
>> end of session. The powerpc mobility code presently does not handle
>> disabling and enabling back of PMU counters during partition
>> migration. Also since the PMU register values are not saved/restored
>> during migration, PMU registers like Monitor Mode Control Register 0
>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>> the value it was programmed with. Hence PMU counters will not be
>> enabled correctly post migration.
>> 
>> Fix this in mobility code by handling disabling and enabling of
>> PMU in all cpu's before and after migration. Patch introduces two
>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>> mobility_pmu_disable() is called before the processor threads goes
>> to suspend state so as to disable the PMU counters. And disable is
>> done only if there are any active events running on that cpu.
>> mobility_pmu_enable() is called after the processor threads are
>> back online to enable back the PMU counters.
>> 
>> Since the performance Monitor counters ( PMCs) are not
>> saved/restored during LPM, results in PMC value being zero and the
>> 'event->hw.prev_count' being non-zero value. This causes problem
> 
> Interesting. Are they defined to not be migrated, or may not be 
> migrated?
> 
> I wonder what QEMU migration does with PMU registers.
> 
>> during updation of event->count since we always accumulate
>> (event->hw.prev_count - PMC value) in event->count.  If
>> event->hw.prev_count is greater PMC value, event->count becomes
>> negative. Fix this by re-initialising 'prev_count' also for all
>> events while enabling back the events. A new variable 'migrate' is
>> introduced in 'struct cpu_hw_event' to achieve this for LPM cases
>> in power_pmu_enable. Use the 'migrate' value to clear the PMC
>> index (stored in event->hw.idx) for all events so that event
>> count settings will get re-initialised correctly.
>> 
>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>> [ Fixed compilation error reported by kernel test robot ]
>> Reported-by: kernel test robot <lkp@intel.com>
>> ---
>> Change from v1 -> v2:
>> - Moved the mobility_pmu_enable and mobility_pmu_disable
>>   declarations under CONFIG_PPC_PERF_CTRS in rtas.h.
>>   Also included 'asm/rtas.h' in core-book3s to fix the
>>   compilation warning reported by kernel test robot.
>> 
>> arch/powerpc/include/asm/rtas.h           |  8 ++++++
>> arch/powerpc/perf/core-book3s.c           | 44 ++++++++++++++++++++++++++++---
>> arch/powerpc/platforms/pseries/mobility.c |  4 +++
>> 3 files changed, 53 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>> index 9dc97d2..cea72d7 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>> static inline void read_24x7_sys_info(void) { }
>> #endif
>> 
>> +#ifdef CONFIG_PPC_PERF_CTRS
>> +void mobility_pmu_disable(void);
>> +void mobility_pmu_enable(void);
>> +#else
>> +static inline void mobility_pmu_disable(void) { }
>> +static inline void mobility_pmu_enable(void) { }
>> +#endif
>> +
>> #endif /* __KERNEL__ */
>> #endif /* _POWERPC_RTAS_H */
> 
> It's not implemented in rtas, maybe consider putting this into a perf 
> header?

Hi Nick,
Thanks for the review comments.

Sure, I will move this to perf_event header file

> 
>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>> index bb0ee71..90da7fa 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -18,6 +18,7 @@
>> #include <asm/ptrace.h>
>> #include <asm/code-patching.h>
>> #include <asm/interrupt.h>
>> +#include <asm/rtas.h>
>> 
>> #ifdef CONFIG_PPC64
>> #include "internal.h"
>> @@ -58,6 +59,7 @@ struct cpu_hw_events {
>> 
>> 	/* Store the PMC values */
>> 	unsigned long pmcs[MAX_HWEVENTS];
>> +	int migrate;
>> };
>> 
>> static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
>> @@ -1335,6 +1337,22 @@ static void power_pmu_disable(struct pmu *pmu)
>> }
>> 
>> /*
>> + * Called from powerpc mobility code
>> + * before migration to disable counters
>> + * if the PMU is active.
>> + */
>> +void mobility_pmu_disable(void)
>> +{
>> +	struct cpu_hw_events *cpuhw;
>> +
>> +	cpuhw = this_cpu_ptr(&cpu_hw_events);
>> +	if (cpuhw->n_events != 0) {
>> +		power_pmu_disable(NULL);
>> +		cpuhw->migrate = 1;
>> +	}
> 
> Rather than add this migrate logic into power_pmu_enable(), would you be 
> able to do the work in the mobility callbacks instead? Something like
> this:
> 
> void mobility_pmu_disable(void)
> {
>        struct cpu_hw_events *cpuhw;
> 
>        cpuhw = this_cpu_ptr(&cpu_hw_events);
>        if (cpuhw->n_events != 0) {
>                int i;
> 
>                power_pmu_disable(NULL);
>                /*
>                 * Read off any pre-existing events because the register
>                 * values may not be migrated.
>                 */
>                for (i = 0; i < cpuhw->n_events; ++i) {
>                        event = cpuhw->event[i];
>                        if (event->hw.idx) {
>                                power_pmu_read(event);
>                                event->hw.idx = 0;
>                        }
>                }
>        }
> }
> 
> void mobility_pmu_enable(void)
> {
>        struct cpu_hw_events *cpuhw;
> 
>        cpuhw = this_cpu_ptr(&cpu_hw_events);
>        cpuhw->n_added = cpuhw->n_events;
>        power_pmu_enable(NULL);
> }
> 

Ok Nick, Will address these changes in next version.
With this approach, I won’t need the “migrate” field.
>> +}
>> +
>> +/*
>>  * Re-enable all events if disable == 0.
>>  * If we were previously disabled and events were added, then
>>  * put the new config on the PMU.
>> @@ -1379,8 +1397,10 @@ static void power_pmu_enable(struct pmu *pmu)
>> 	 * no need to recalculate MMCR* settings and reset the PMCs.
>> 	 * Just reenable the PMU with the current MMCR* settings
>> 	 * (possibly updated for removal of events).
>> +	 * While reenabling PMU during partition migration, continue
>> +	 * with normal flow.
>> 	 */
>> -	if (!cpuhw->n_added) {
>> +	if (!cpuhw->n_added && !cpuhw->migrate) {
>> 		mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>> 		mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
>> 		if (ppmu->flags & PPMU_ARCH_31)
>> @@ -1434,11 +1454,15 @@ static void power_pmu_enable(struct pmu *pmu)
>> 	/*
>> 	 * Read off any pre-existing events that need to move
>> 	 * to another PMC.
>> +	 * While enabling PMU during partition migration,
>> +	 * skip power_pmu_read since all event count settings
>> +	 * needs to be re-initialised after migration.
>> 	 */
>> 	for (i = 0; i < cpuhw->n_events; ++i) {
>> 		event = cpuhw->event[i];
>> -		if (event->hw.idx && event->hw.idx != hwc_index[i] + 1) {
>> -			power_pmu_read(event);
>> +		if ((event->hw.idx && event->hw.idx != hwc_index[i] + 1) || (cpuhw->migrate)) {
>> +			if (!cpuhw->migrate)
>> +				power_pmu_read(event);
>> 			write_pmc(event->hw.idx, 0);
>> 			event->hw.idx = 0;
>> 		}
>> @@ -1506,6 +1530,20 @@ static void power_pmu_enable(struct pmu *pmu)
>> 	local_irq_restore(flags);
>> }
>> 
>> +/*
>> + * Called from powerpc mobility code
>> + * during migration completion to
>> + * enable back PMU counters.
>> + */
>> +void mobility_pmu_enable(void)
>> +{
>> +	struct cpu_hw_events *cpuhw;
>> +
>> +	cpuhw = this_cpu_ptr(&cpu_hw_events);
>> +	power_pmu_enable(NULL);
>> +	cpuhw->migrate = 0;
>> +}
>> +
>> static int collect_events(struct perf_event *group, int max_count,
>> 			  struct perf_event *ctrs[], u64 *events,
>> 			  unsigned int *flags)
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>> index e83e089..ff7a77c 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -476,6 +476,8 @@ static int do_join(void *arg)
>> retry:
>> 	/* Must ensure MSR.EE off for H_JOIN. */
>> 	hard_irq_disable();
>> +	/* Disable PMU before suspend */
>> +	mobility_pmu_disable();
>> 	hvrc = plpar_hcall_norets(H_JOIN);
>> 
>> 	switch (hvrc) {
> 
> Does the retry case cause mobility_pmu_disable to be called twice 
> without mobility_pmu_enable called? Would be neater if it was
> balanced.
> 

I will be moving these mobility callbacks to “pseries_migrate_partition” as Nathan suggested.
Will send a V3 with new changes. 

Thanks
Athira

> 
>> @@ -530,6 +532,8 @@ static int do_join(void *arg)
>> 	 * reset the watchdog.
>> 	 */
>> 	touch_nmi_watchdog();
>> +	/* Enable PMU after resuming */
>> +	mobility_pmu_enable();
>> 	return ret;
>> }
> 
> Not really a big deal but while you're updating it, you could leave 
> touch_nmi_watchdog() at the very end.
> 
> Thanks,
> Nick


[-- Attachment #2: Type: text/html, Size: 59543 bytes --]

^ permalink raw reply

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Waiman Long @ 2021-10-25 18:25 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-ia64, Peter Zijlstra, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Heiko Carstens, Vasily Gorbik, Boqun Feng, Stefan Kristiansson,
	Openrisc, Stafford Horne, Parisc List, Linux Kernel Mailing List,
	linuxppc-dev
In-Reply-To: <CAK8P3a3JEBF-dEg0iVThMMRNK3CDxY+mRtTeStMusycnethO_g@mail.gmail.com>

On 10/25/21 11:44 AM, Arnd Bergmann wrote:
> On Mon, Oct 25, 2021 at 5:28 PM Waiman Long <longman@redhat.com> wrote:
>> On 10/25/21 9:06 AM, Arnd Bergmann wrote:
>>> On s390, we pick between the cmpxchg() based directed-yield when
>>> running on virtualized CPUs, and a normal qspinlock when running on a
>>> dedicated CPU.
>> I am not aware that s390 is using qspinlocks at all as I don't see
>> ARCH_USE_QUEUED_SPINLOCKS being set anywhere under arch/s390. I only see
>> that it uses a cmpxchg based spinlock.
> Sorry, I should not have said "normal" here. See arch/s390/lib/spinlock.c
> for their custom queued spinlocks as implemented in arch_spin_lock_queued().
> I don't know if that code actually does the same thing as the generic qspinlock,
> but it seems at least similar.

Yes, you are right. Their queued lock code looks like a custom version 
of the pvqspinlock code.

Cheers,
Longman


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox