LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()
From: Laurent Dufour @ 2019-04-24  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jack, sergey.senozhatsky.work, Will Deacon, mhocko, linux-mm,
	paulus, Punit Agrawal, hpa, Michel Lespinasse, Alexei Starovoitov,
	Andrea Arcangeli, ak, Minchan Kim, aneesh.kumar, x86,
	Matthew Wilcox, Daniel Jordan, Ingo Molnar, David Rientjes,
	paulmck, Haiyan Song, npiggin, sj38.park, Jerome Glisse, dave,
	kemi.wang, kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran,
	Yang Shi, Mike Rapoport, linuxppc-dev, linux-kernel,
	Sergey Senozhatsky, vinayak menon, akpm, Tim Chen, haren
In-Reply-To: <20190423092710.GI11158@hirez.programming.kicks-ass.net>

Le 23/04/2019 à 11:27, Peter Zijlstra a écrit :
> On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
>> This allows to search for a VMA structure without holding the mmap_sem.
>>
>> The search is repeated while the mm seqlock is changing and until we found
>> a valid VMA.
>>
>> While under the RCU protection, a reference is taken on the VMA, so the
>> caller must call put_vma() once it not more need the VMA structure.
>>
>> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
>> reference is taken to the VMA by calling get_vma().
>>
>> When removing a VMA from the MM RB tree, the VMA is not release immediately
>> but at the end of the RCU grace period through vm_rcu_put(). This ensures
>> that the VMA remains allocated until the end the RCU grace period.
>>
>> Since the vm_file pointer, if valid, is released in put_vma(), there is no
>> guarantee that the file pointer will be valid on the returned VMA.
> 
> What I'm missing here, and in the previous patch introducing the
> refcount (also see refcount_t), is _why_ we need the refcount thing at
> all.

The need for the VMA's refcount is to ensure that the VMA will remain 
until the end of the SPF handler. This is a consequence of the use of 
RCU instead of SRCU to protect the RB tree.

I was not aware of the refcount_t type, it would be better here to avoid 
wrapping.

> My original plan was to use SRCU, which at the time was not complete
> enough so I abused/hacked preemptible RCU, but that is no longer the
> case, SRCU has all the required bits and pieces.

When I did test using SRCU it was involving a lot a scheduling to run 
the SRCU callback mechanism. In some workload the impact on the 
perfomance was significant [1].

I can't see this overhead using RCU.

> 
> Also; the initial motivation was prefaulting large VMAs and the
> contention on mmap was killing things; but similarly, the contention on
> the refcount (I did try that) killed things just the same.

Doing prefaulting should be doable, I'll try to think further about that.

Regarding the refcount, I should I missed something, this is an atomic 
counter, so there should not be contention on it but cache exclusivity, 
not ideal I agree but I can't see what else to use here.

> So I'm really sad to see the refcount return; and without any apparent
> justification.

I'm not opposed to use another mechanism here, but SRCU didn't show good 
performance with some workload, and I can't see how to use RCU without a 
reference counter here. So please, advise.

Thanks,
Laurent.

[1] 
https://lore.kernel.org/linux-mm/7ca80231-fe02-a3a7-84bc-ce81690ea051@intel.com/


^ permalink raw reply

* Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant
From: Matthias Brugger @ 2019-04-24  8:31 UTC (permalink / raw)
  To: Pingfan Liu, linux-kernel
  Cc: Rich Felker, linux-ia64, Julien Thierry, Yangtao Li,
	Palmer Dabbelt, Heiko Carstens, Stefan Agner, Paul Mackerras,
	H. Peter Anvin, Catalin Marinas, Thomas Gleixner, Logan Gunthorpe,
	linux-s390, Florian Fainelli, Yoshinori Sato, linux-sh, x86,
	Russell King, Ingo Molnar, Hari Bathini, James Hogan, Dave Young,
	Fenghua Yu, Will Deacon, Johannes Weiner, Borislav Petkov,
	David Hildenbrand, linux-arm-kernel, Jens Axboe, Tony Luck,
	Baoquan He, Ard Biesheuvel, Robin Murphy, linux-mips,
	Ralf Baechle, Thomas Bogendoerfer, Paul Burton,
	Greg Kroah-Hartman, Martin Schwidefsky, Andrew Morton,
	linuxppc-dev, Greg Hackmann
In-Reply-To: <1556087581-14513-1-git-send-email-kernelfans@gmail.com>



On 24/04/2019 08:33, Pingfan Liu wrote:
> At present, both return and crash_size should be checked to guarantee the
> success of parse_crashkernel().
> 
> Take a close look at the cases, which causes crash_size=0. Beside syntax
> error, three cases cause parsing to get crash_size=0.
> -1st. in parse_crashkernel_mem(), the demanded crash size is bigger than
>  system ram.
> -2nd. in parse_crashkernel_mem(), the system ram size does not match any
>  item in the range list.
> -3rd. "crashkernel=0MB", which is impractical.
> 
> All these cases can be treated as invalid argument.
> 
> By this way, only need a simple check on return value of
> parse_crashkernel().
> 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Fenghua Yu <fenghua.yu@intel.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Cc: Paul Burton <paul.burton@mips.com>
> Cc: James Hogan <jhogan@kernel.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Julien Thierry <julien.thierry@arm.com>
> Cc: Palmer Dabbelt <palmer@sifive.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Greg Hackmann <ghackmann@android.com>
> Cc: Stefan Agner <stefan@agner.ch>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Hari Bathini <hbathini@linux.ibm.com>
> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
> Cc: Yangtao Li <tiny.windzz@gmail.com>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: x86@kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-ia64@vger.kernel.org
> Cc: linux-mips@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> ---
> v1 -> v2: On error, return -EINVAL for all failure cases
> 
>  arch/arm/kernel/setup.c             |  2 +-
>  arch/arm64/mm/init.c                |  2 +-
>  arch/ia64/kernel/setup.c            |  2 +-
>  arch/mips/kernel/setup.c            |  2 +-
>  arch/powerpc/kernel/fadump.c        |  2 +-
>  arch/powerpc/kernel/machine_kexec.c |  2 +-
>  arch/s390/kernel/setup.c            |  2 +-
>  arch/sh/kernel/machine_kexec.c      |  2 +-
>  arch/x86/kernel/setup.c             |  4 ++--
>  kernel/crash_core.c                 | 10 +++++++++-
>  10 files changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
> index 5d78b6a..2feab13 100644
> --- a/arch/arm/kernel/setup.c
> +++ b/arch/arm/kernel/setup.c
> @@ -997,7 +997,7 @@ static void __init reserve_crashkernel(void)
>  	total_mem = get_total_mem();
>  	ret = parse_crashkernel(boot_command_line, total_mem,
>  				&crash_size, &crash_base);
> -	if (ret)
> +	if (ret < 0)
>  		return;
>  
>  	if (crash_base <= 0) {
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 6bc1350..240918c 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -79,7 +79,7 @@ static void __init reserve_crashkernel(void)
>  	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>  				&crash_size, &crash_base);
>  	/* no crashkernel= or invalid value specified */
> -	if (ret || !crash_size)
> +	if (ret < 0)
>  		return;
>  
>  	crash_size = PAGE_ALIGN(crash_size);
> diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
> index 583a374..3bbb58b 100644
> --- a/arch/ia64/kernel/setup.c
> +++ b/arch/ia64/kernel/setup.c
> @@ -277,7 +277,7 @@ static void __init setup_crashkernel(unsigned long total, int *n)
>  
>  	ret = parse_crashkernel(boot_command_line, total,
>  			&size, &base);
> -	if (ret == 0 && size > 0) {
> +	if (!ret) {
>  		if (!base) {
>  			sort_regions(rsvd_region, *n);
>  			*n = merge_regions(rsvd_region, *n);
> diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
> index 8d1dc6c..168571b 100644
> --- a/arch/mips/kernel/setup.c
> +++ b/arch/mips/kernel/setup.c
> @@ -715,7 +715,7 @@ static void __init mips_parse_crashkernel(void)
>  	total_mem = get_total_mem();
>  	ret = parse_crashkernel(boot_command_line, total_mem,
>  				&crash_size, &crash_base);
> -	if (ret != 0 || crash_size <= 0)
> +	if (ret < 0)
>  		return;
>  
>  	if (!memory_region_available(crash_base, crash_size)) {
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 45a8d0b..3571504 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -376,7 +376,7 @@ static inline unsigned long fadump_calculate_reserve_size(void)
>  	 */
>  	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>  				&size, &base);
> -	if (ret == 0 && size > 0) {
> +	if (!ret) {
>  		unsigned long max_size;
>  
>  		if (fw_dump.reserve_bootvar)
> diff --git a/arch/powerpc/kernel/machine_kexec.c b/arch/powerpc/kernel/machine_kexec.c
> index 63f5a93..1697ad2 100644
> --- a/arch/powerpc/kernel/machine_kexec.c
> +++ b/arch/powerpc/kernel/machine_kexec.c
> @@ -122,7 +122,7 @@ void __init reserve_crashkernel(void)
>  	/* use common parsing */
>  	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>  			&crash_size, &crash_base);
> -	if (ret == 0 && crash_size > 0) {
> +	if (!ret) {
>  		crashk_res.start = crash_base;
>  		crashk_res.end = crash_base + crash_size - 1;
>  	}
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 2c642af..d4bd61b 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -671,7 +671,7 @@ static void __init reserve_crashkernel(void)
>  
>  	crash_base = ALIGN(crash_base, KEXEC_CRASH_MEM_ALIGN);
>  	crash_size = ALIGN(crash_size, KEXEC_CRASH_MEM_ALIGN);
> -	if (rc || crash_size == 0)
> +	if (rc < 0)
>  		return;
>  
>  	if (memblock.memory.regions[0].size < crash_size) {
> diff --git a/arch/sh/kernel/machine_kexec.c b/arch/sh/kernel/machine_kexec.c
> index 63d63a3..3c03240 100644
> --- a/arch/sh/kernel/machine_kexec.c
> +++ b/arch/sh/kernel/machine_kexec.c
> @@ -157,7 +157,7 @@ void __init reserve_crashkernel(void)
>  
>  	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>  			&crash_size, &crash_base);
> -	if (ret == 0 && crash_size > 0) {
> +	if (!ret) {
>  		crashk_res.start = crash_base;
>  		crashk_res.end = crash_base + crash_size - 1;
>  	}
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3d872a5..592d5ad 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -526,11 +526,11 @@ static void __init reserve_crashkernel(void)
>  
>  	/* crashkernel=XM */
>  	ret = parse_crashkernel(boot_command_line, total_mem, &crash_size, &crash_base);
> -	if (ret != 0 || crash_size <= 0) {
> +	if (ret < 0) {
>  		/* crashkernel=X,high */
>  		ret = parse_crashkernel_high(boot_command_line, total_mem,
>  					     &crash_size, &crash_base);
> -		if (ret != 0 || crash_size <= 0)
> +		if (ret < 0)
>  			return;
>  		high = true;
>  	}
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 093c9f9..83ee4a9 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -108,8 +108,10 @@ static int __init parse_crashkernel_mem(char *cmdline,
>  				return -EINVAL;
>  			}
>  		}
> -	} else
> +	} else {
>  		pr_info("crashkernel size resulted in zero bytes\n");
> +		return -EINVAL;
> +	}
>  
>  	return 0;
>  }
> @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char *cmdline,
>  		pr_warn("crashkernel: unrecognized char: %c\n", *cur);
>  		return -EINVAL;
>  	}
> +	if (*crash_size == 0)
> +		return -EINVAL;

This covers the case where I pass an argument like "crashkernel=0M" ?
Can't we fix that by using kstrtoull() in memparse and check if the return value
is < 0? In that case we could return without updating the retptr and we will be
fine.

>  
>  	return 0;
>  }
> @@ -181,6 +185,8 @@ static int __init parse_crashkernel_suffix(char *cmdline,
>  		pr_warn("crashkernel: unrecognized char: %c\n", *cur);
>  		return -EINVAL;
>  	}
> +	if (*crash_size == 0)
> +		return -EINVAL;

Same here.

>  
>  	return 0;
>  }
> @@ -266,6 +272,8 @@ static int __init __parse_crashkernel(char *cmdline,
>  /*
>   * That function is the entry point for command line parsing and should be
>   * called from the arch-specific code.
> + * On success 0. On error for either syntax error or crash_size=0, -EINVAL is
> + * returned.
>   */
>  int __init parse_crashkernel(char *cmdline,
>  			     unsigned long long system_ram,
> 

^ permalink raw reply

* Re: [PATCH v3 3/6] powerpc/mm: Add helpers for accessing hash translation related variables
From: Christophe Leroy @ 2019-04-24 10:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V, npiggin, paulus, mpe; +Cc: linuxppc-dev
In-Reply-To: <20190417130351.3805-4-aneesh.kumar@linux.ibm.com>



Le 17/04/2019 à 15:03, Aneesh Kumar K.V a écrit :
> We want to switch to allocating them runtime only when hash translation is
> enabled. Add helpers so that both book3s and nohash can be adapted to
> upcoming change easily.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   arch/powerpc/include/asm/book3s/64/mmu-hash.h |  4 +-
>   arch/powerpc/include/asm/book3s/64/mmu.h      | 63 ++++++++++++++++++-
>   arch/powerpc/include/asm/nohash/32/mmu-8xx.h  | 50 +++++++++++++++
>   arch/powerpc/kernel/paca.c                    | 12 ++--
>   arch/powerpc/mm/hash_utils_64.c               | 10 +--
>   arch/powerpc/mm/slb.c                         |  2 +-
>   arch/powerpc/mm/slice.c                       | 49 +++++++--------
>   arch/powerpc/mm/subpage-prot.c                |  8 +--
>   8 files changed, 154 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index a28a28079edb..eb36fbfe4ef5 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -657,8 +657,8 @@ extern void slb_set_size(u16 size);
>   
>   /* 4 bits per slice and we have one slice per 1TB */
>   #define SLICE_ARRAY_SIZE	(H_PGTABLE_RANGE >> 41)
> -#define TASK_SLICE_ARRAY_SZ(x)	((x)->context.slb_addr_limit >> 41)
> -
> +#define LOW_SLICE_ARRAY_SZ	(BITS_PER_LONG / BITS_PER_BYTE)
> +#define TASK_SLICE_ARRAY_SZ(x)	((x)->slb_addr_limit >> 41)
>   #ifndef __ASSEMBLY__
>   
>   #ifdef CONFIG_PPC_SUBPAGE_PROT
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> index 484a8ff9b338..28213a36fef7 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -124,7 +124,7 @@ typedef struct {
>   	struct npu_context *npu_context;
>   
>   	 /* SLB page size encodings*/
> -	unsigned char low_slices_psize[BITS_PER_LONG / BITS_PER_BYTE];
> +	unsigned char low_slices_psize[LOW_SLICE_ARRAY_SZ];
>   	unsigned char high_slices_psize[SLICE_ARRAY_SIZE];
>   	unsigned long slb_addr_limit;
>   # ifdef CONFIG_PPC_64K_PAGES
> @@ -159,6 +159,67 @@ typedef struct {
>   #endif
>   } mm_context_t;
>   
> +static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> +{
> +	return ctx->user_psize;
> +}
> +
> +static inline void mm_ctx_set_user_psize(mm_context_t *ctx, u16 user_psize)
> +{
> +	ctx->user_psize = user_psize;
> +}
> +
> +static inline unsigned char *mm_ctx_low_slices(mm_context_t *ctx)
> +{
> +	return ctx->low_slices_psize;
> +}
> +
> +static inline unsigned char *mm_ctx_high_slices(mm_context_t *ctx)
> +{
> +	return ctx->high_slices_psize;
> +}
> +
> +static inline unsigned long mm_ctx_slb_addr_limit(mm_context_t *ctx)
> +{
> +	return ctx->slb_addr_limit;
> +}
> +
> +static inline void mm_ctx_set_slb_addr_limit(mm_context_t *ctx, unsigned long limit)
> +{
> +	ctx->slb_addr_limit = limit;
> +}
> +
> +#ifdef CONFIG_PPC_64K_PAGES
> +static inline struct slice_mask *mm_ctx_slice_mask_64k(mm_context_t *ctx)
> +{
> +	return &ctx->mask_64k;
> +}
> +#endif
> +
> +static inline struct slice_mask *mm_ctx_slice_mask_4k(mm_context_t *ctx)
> +{
> +	return &ctx->mask_4k;
> +}
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> +static inline struct slice_mask *mm_ctx_slice_mask_16m(mm_context_t *ctx)
> +{
> +	return &ctx->mask_16m;
> +}
> +
> +static inline struct slice_mask *mm_ctx_slice_mask_16g(mm_context_t *ctx)
> +{
> +	return &ctx->mask_16g;
> +}
> +#endif

I think it would be better to move slice_mask_for_size() into mmu.h 
instead of defining those helpers.

> +
> +#ifdef CONFIG_PPC_SUBPAGE_PROT
> +static inline struct subpage_prot_table *mm_ctx_subpage_prot(mm_context_t *ctx)
> +{
> +	return &ctx->spt;
> +}
> +#endif
> +
>   /*
>    * The current system page and segment sizes
>    */
> diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
> index 0a1a3fc54e54..0f4b0b50e5ad 100644
> --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
> +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
> @@ -167,6 +167,7 @@
>   #ifdef CONFIG_PPC_MM_SLICES
>   #include <asm/nohash/32/slice.h>
>   #define SLICE_ARRAY_SIZE	(1 << (32 - SLICE_LOW_SHIFT - 1))
> +#define LOW_SLICE_ARRAY_SZ	SLICE_ARRAY_SIZE
>   #endif
>   
>   #ifndef __ASSEMBLY__
> @@ -193,6 +194,55 @@ typedef struct {
>   	void *pte_frag;
>   } mm_context_t;
>   
> +#ifdef CONFIG_PPC_MM_SLICES
> +static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> +{
> +	return ctx->user_psize;
> +}
> +
> +static inline void mm_ctx_set_user_psize(mm_context_t *ctx, u16 user_psize)
> +{
> +	ctx->user_psize = user_psize;
> +}
> +
> +static inline unsigned char *mm_ctx_low_slices(mm_context_t *ctx)
> +{
> +	return ctx->low_slices_psize;
> +}
> +
> +static inline unsigned char *mm_ctx_high_slices(mm_context_t *ctx)
> +{
> +	return ctx->high_slices_psize;
> +}
> +
> +static inline unsigned long mm_ctx_slb_addr_limit(mm_context_t *ctx)
> +{
> +	return ctx->slb_addr_limit;
> +}
> +
> +static inline void mm_ctx_set_slb_addr_limit(mm_context_t *ctx, unsigned long limit)
> +{
> +	ctx->slb_addr_limit = limit;
> +}
> +
> +static inline struct slice_mask *mm_ctx_slice_mask_base(mm_context_t *ctx)
> +{
> +	return &ctx->mask_base_psize;
> +}
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> +static inline struct slice_mask *mm_ctx_slice_mask_512k(mm_context_t *ctx)
> +{
> +	return &ctx->mask_512k;
> +}
> +
> +static inline struct slice_mask *mm_ctx_slice_mask_8m(mm_context_t *ctx)
> +{
> +	return &ctx->mask_8m;
> +}
> +#endif

The 3 helpers above are never used, I think we will never need them.

What would be good is to move slice_mask_for_size() here in mmu-8xx.h

I'll rebase my series on top of yours since Michael has already merged it.

Christophe

> +#endif /* CONFIG_PPC_MM_SLICE */
> +
>   #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff80000)
>   #define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
>   
> diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
> index e7382abee868..9cc91d03ab62 100644
> --- a/arch/powerpc/kernel/paca.c
> +++ b/arch/powerpc/kernel/paca.c
> @@ -267,12 +267,12 @@ void copy_mm_to_paca(struct mm_struct *mm)
>   
>   	get_paca()->mm_ctx_id = context->id;
>   #ifdef CONFIG_PPC_MM_SLICES
> -	VM_BUG_ON(!mm->context.slb_addr_limit);
> -	get_paca()->mm_ctx_slb_addr_limit = mm->context.slb_addr_limit;
> -	memcpy(&get_paca()->mm_ctx_low_slices_psize,
> -	       &context->low_slices_psize, sizeof(context->low_slices_psize));
> -	memcpy(&get_paca()->mm_ctx_high_slices_psize,
> -	       &context->high_slices_psize, TASK_SLICE_ARRAY_SZ(mm));
> +	VM_BUG_ON(!mm_ctx_slb_addr_limit(context));
> +	get_paca()->mm_ctx_slb_addr_limit = mm_ctx_slb_addr_limit(context);
> +	memcpy(&get_paca()->mm_ctx_low_slices_psize, mm_ctx_low_slices(context),
> +	       LOW_SLICE_ARRAY_SZ);
> +	memcpy(&get_paca()->mm_ctx_high_slices_psize, mm_ctx_high_slices(context),
> +	       TASK_SLICE_ARRAY_SZ(context));
>   #else /* CONFIG_PPC_MM_SLICES */
>   	get_paca()->mm_ctx_user_psize = context->user_psize;
>   	get_paca()->mm_ctx_sllp = context->sllp;
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 0a4f939a8161..5a2bd132f92e 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -1147,7 +1147,7 @@ void demote_segment_4k(struct mm_struct *mm, unsigned long addr)
>    */
>   static int subpage_protection(struct mm_struct *mm, unsigned long ea)
>   {
> -	struct subpage_prot_table *spt = &mm->context.spt;
> +	struct subpage_prot_table *spt = mm_ctx_subpage_prot(&mm->context);
>   	u32 spp = 0;
>   	u32 **sbpm, *sbpp;
>   
> @@ -1470,7 +1470,7 @@ static bool should_hash_preload(struct mm_struct *mm, unsigned long ea)
>   	int psize = get_slice_psize(mm, ea);
>   
>   	/* We only prefault standard pages for now */
> -	if (unlikely(psize != mm->context.user_psize))
> +	if (unlikely(psize != mm_ctx_user_psize(&mm->context)))
>   		return false;
>   
>   	/*
> @@ -1549,7 +1549,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
>   
>   	/* Hash it in */
>   #ifdef CONFIG_PPC_64K_PAGES
> -	if (mm->context.user_psize == MMU_PAGE_64K)
> +	if (mm_ctx_user_psize(&mm->context) == MMU_PAGE_64K)
>   		rc = __hash_page_64K(ea, access, vsid, ptep, trap,
>   				     update_flags, ssize);
>   	else
> @@ -1562,8 +1562,8 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
>   	 */
>   	if (rc == -1)
>   		hash_failure_debug(ea, access, vsid, trap, ssize,
> -				   mm->context.user_psize,
> -				   mm->context.user_psize,
> +				   mm_ctx_user_psize(&mm->context),
> +				   mm_ctx_user_psize(&mm->context),
>   				   pte_val(*ptep));
>   out_exit:
>   	local_irq_restore(flags);
> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index 5986df48359b..78c0c0a0e355 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -739,7 +739,7 @@ static long slb_allocate_user(struct mm_struct *mm, unsigned long ea)
>   	 * consider this as bad access if we take a SLB miss
>   	 * on an address above addr limit.
>   	 */
> -	if (ea >= mm->context.slb_addr_limit)
> +	if (ea >= mm_ctx_slb_addr_limit(&mm->context))
>   		return -EFAULT;
>   
>   	context = get_user_context(&mm->context, ea);
> diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
> index aec91dbcdc0b..35b278082391 100644
> --- a/arch/powerpc/mm/slice.c
> +++ b/arch/powerpc/mm/slice.c
> @@ -101,7 +101,7 @@ static int slice_area_is_free(struct mm_struct *mm, unsigned long addr,
>   {
>   	struct vm_area_struct *vma;
>   
> -	if ((mm->context.slb_addr_limit - len) < addr)
> +	if ((mm_ctx_slb_addr_limit(&mm->context) - len) < addr)
>   		return 0;
>   	vma = find_vma(mm, addr);
>   	return (!vma || (addr + len) <= vm_start_gap(vma));
> @@ -155,15 +155,15 @@ static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize)
>   {
>   #ifdef CONFIG_PPC_64K_PAGES
>   	if (psize == MMU_PAGE_64K)
> -		return &mm->context.mask_64k;
> +		return mm_ctx_slice_mask_64k(&mm->context);
>   #endif
>   	if (psize == MMU_PAGE_4K)
> -		return &mm->context.mask_4k;
> +		return mm_ctx_slice_mask_4k(&mm->context);
>   #ifdef CONFIG_HUGETLB_PAGE
>   	if (psize == MMU_PAGE_16M)
> -		return &mm->context.mask_16m;
> +		return mm_ctx_slice_mask_16m(&mm->context);
>   	if (psize == MMU_PAGE_16G)
> -		return &mm->context.mask_16g;
> +		return mm_ctx_slice_mask_16g(&mm->context);
>   #endif
>   	BUG();
>   }
> @@ -253,7 +253,7 @@ static void slice_convert(struct mm_struct *mm,
>   	 */
>   	spin_lock_irqsave(&slice_convert_lock, flags);
>   
> -	lpsizes = mm->context.low_slices_psize;
> +	lpsizes = mm_ctx_low_slices(&mm->context);
>   	for (i = 0; i < SLICE_NUM_LOW; i++) {
>   		if (!(mask->low_slices & (1u << i)))
>   			continue;
> @@ -272,8 +272,8 @@ static void slice_convert(struct mm_struct *mm,
>   				(((unsigned long)psize) << (mask_index * 4));
>   	}
>   
> -	hpsizes = mm->context.high_slices_psize;
> -	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.slb_addr_limit); i++) {
> +	hpsizes = mm_ctx_high_slices(&mm->context);
> +	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm_ctx_slb_addr_limit(&mm->context)); i++) {
>   		if (!test_bit(i, mask->high_slices))
>   			continue;
>   
> @@ -292,8 +292,8 @@ static void slice_convert(struct mm_struct *mm,
>   	}
>   
>   	slice_dbg(" lsps=%lx, hsps=%lx\n",
> -		  (unsigned long)mm->context.low_slices_psize,
> -		  (unsigned long)mm->context.high_slices_psize);
> +		  (unsigned long)mm_ctx_low_slices(&mm->context),
> +		  (unsigned long)mm_ctx_high_slices(&mm->context));
>   
>   	spin_unlock_irqrestore(&slice_convert_lock, flags);
>   
> @@ -393,7 +393,7 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm,
>   	 * DEFAULT_MAP_WINDOW we should apply this.
>   	 */
>   	if (high_limit > DEFAULT_MAP_WINDOW)
> -		addr += mm->context.slb_addr_limit - DEFAULT_MAP_WINDOW;
> +		addr += mm_ctx_slb_addr_limit(&mm->context) - DEFAULT_MAP_WINDOW;
>   
>   	while (addr > min_addr) {
>   		info.high_limit = addr;
> @@ -505,20 +505,20 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
>   			return -ENOMEM;
>   	}
>   
> -	if (high_limit > mm->context.slb_addr_limit) {
> +	if (high_limit > mm_ctx_slb_addr_limit(&mm->context)) {
>   		/*
>   		 * Increasing the slb_addr_limit does not require
>   		 * slice mask cache to be recalculated because it should
>   		 * be already initialised beyond the old address limit.
>   		 */
> -		mm->context.slb_addr_limit = high_limit;
> +		mm_ctx_set_slb_addr_limit(&mm->context, high_limit);
>   
>   		on_each_cpu(slice_flush_segments, mm, 1);
>   	}
>   
>   	/* Sanity checks */
>   	BUG_ON(mm->task_size == 0);
> -	BUG_ON(mm->context.slb_addr_limit == 0);
> +	BUG_ON(mm_ctx_slb_addr_limit(&mm->context) == 0);
>   	VM_BUG_ON(radix_enabled());
>   
>   	slice_dbg("slice_get_unmapped_area(mm=%p, psize=%d...\n", mm, psize);
> @@ -696,7 +696,7 @@ unsigned long arch_get_unmapped_area(struct file *filp,
>   				     unsigned long flags)
>   {
>   	return slice_get_unmapped_area(addr, len, flags,
> -				       current->mm->context.user_psize, 0);
> +				       mm_ctx_user_psize(&current->mm->context), 0);
>   }
>   
>   unsigned long arch_get_unmapped_area_topdown(struct file *filp,
> @@ -706,7 +706,7 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp,
>   					     const unsigned long flags)
>   {
>   	return slice_get_unmapped_area(addr0, len, flags,
> -				       current->mm->context.user_psize, 1);
> +				       mm_ctx_user_psize(&current->mm->context), 1);
>   }
>   
>   unsigned int get_slice_psize(struct mm_struct *mm, unsigned long addr)
> @@ -717,10 +717,10 @@ unsigned int get_slice_psize(struct mm_struct *mm, unsigned long addr)
>   	VM_BUG_ON(radix_enabled());
>   
>   	if (slice_addr_is_low(addr)) {
> -		psizes = mm->context.low_slices_psize;
> +		psizes = mm_ctx_low_slices(&mm->context);
>   		index = GET_LOW_SLICE_INDEX(addr);
>   	} else {
> -		psizes = mm->context.high_slices_psize;
> +		psizes = mm_ctx_high_slices(&mm->context);
>   		index = GET_HIGH_SLICE_INDEX(addr);
>   	}
>   	mask_index = index & 0x1;
> @@ -742,20 +742,19 @@ void slice_init_new_context_exec(struct mm_struct *mm)
>   	 * duplicated.
>   	 */
>   #ifdef CONFIG_PPC64
> -	mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64;
> +	mm_ctx_set_slb_addr_limit(&mm->context, DEFAULT_MAP_WINDOW_USER64);
>   #else
>   	mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW;
>   #endif
> -
> -	mm->context.user_psize = psize;
> +	mm_ctx_set_user_psize(&mm->context, psize);
>   
>   	/*
>   	 * Set all slice psizes to the default.
>   	 */
> -	lpsizes = mm->context.low_slices_psize;
> +	lpsizes = mm_ctx_low_slices(&mm->context);
>   	memset(lpsizes, (psize << 4) | psize, SLICE_NUM_LOW >> 1);
>   
> -	hpsizes = mm->context.high_slices_psize;
> +	hpsizes = mm_ctx_high_slices(&mm->context);
>   	memset(hpsizes, (psize << 4) | psize, SLICE_NUM_HIGH >> 1);
>   
>   	/*
> @@ -777,7 +776,7 @@ void slice_setup_new_exec(void)
>   	if (!is_32bit_task())
>   		return;
>   
> -	mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW;
> +	mm_ctx_set_slb_addr_limit(&mm->context, DEFAULT_MAP_WINDOW);
>   }
>   #endif
>   
> @@ -816,7 +815,7 @@ int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
>   			   unsigned long len)
>   {
>   	const struct slice_mask *maskp;
> -	unsigned int psize = mm->context.user_psize;
> +	unsigned int psize = mm_ctx_user_psize(&mm->context);
>   
>   	VM_BUG_ON(radix_enabled());
>   
> diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
> index 5e4178790dee..c72252542210 100644
> --- a/arch/powerpc/mm/subpage-prot.c
> +++ b/arch/powerpc/mm/subpage-prot.c
> @@ -25,7 +25,7 @@
>    */
>   void subpage_prot_free(struct mm_struct *mm)
>   {
> -	struct subpage_prot_table *spt = &mm->context.spt;
> +	struct subpage_prot_table *spt = mm_ctx_subpage_prot(&mm->context);
>   	unsigned long i, j, addr;
>   	u32 **p;
>   
> @@ -52,7 +52,7 @@ void subpage_prot_free(struct mm_struct *mm)
>   
>   void subpage_prot_init_new_context(struct mm_struct *mm)
>   {
> -	struct subpage_prot_table *spt = &mm->context.spt;
> +	struct subpage_prot_table *spt = mm_ctx_subpage_prot(&mm->context);
>   
>   	memset(spt, 0, sizeof(*spt));
>   }
> @@ -93,7 +93,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
>   static void subpage_prot_clear(unsigned long addr, unsigned long len)
>   {
>   	struct mm_struct *mm = current->mm;
> -	struct subpage_prot_table *spt = &mm->context.spt;
> +	struct subpage_prot_table *spt = mm_ctx_subpage_prot(&mm->context);
>   	u32 **spm, *spp;
>   	unsigned long i;
>   	size_t nw;
> @@ -189,7 +189,7 @@ SYSCALL_DEFINE3(subpage_prot, unsigned long, addr,
>   		unsigned long, len, u32 __user *, map)
>   {
>   	struct mm_struct *mm = current->mm;
> -	struct subpage_prot_table *spt = &mm->context.spt;
> +	struct subpage_prot_table *spt = mm_ctx_subpage_prot(&mm->context);
>   	u32 **spm, *spp;
>   	unsigned long i;
>   	size_t nw;
> 

^ permalink raw reply

* [PATCH v1 1/7] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-s390, Michal Hocko, linux-ia64, Pavel Tatashin, linux-sh,
	Mathieu Malaterre, David Hildenbrand, linux-kernel, Wei Yang,
	Qian Cai, Arun KS, akpm, linuxppc-dev, Dan Williams,
	Oscar Salvador
In-Reply-To: <20190424102511.29318-1-david@redhat.com>

By converting start and size to page granularity, we actually ignore
unaligned parts within a page instead of properly bailing out with an
error.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 328878b6799d..202febe88b58 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1050,16 +1050,11 @@ int try_online_node(int nid)
 
 static int check_hotplug_memory_range(u64 start, u64 size)
 {
-	unsigned long block_sz = memory_block_size_bytes();
-	u64 block_nr_pages = block_sz >> PAGE_SHIFT;
-	u64 nr_pages = size >> PAGE_SHIFT;
-	u64 start_pfn = PFN_DOWN(start);
-
 	/* memory range must be block size aligned */
-	if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
-	    !IS_ALIGNED(nr_pages, block_nr_pages)) {
+	if (!size || !IS_ALIGNED(start, memory_block_size_bytes()) ||
+	    !IS_ALIGNED(size, memory_block_size_bytes())) {
 		pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
-		       block_sz, start, size);
+		       memory_block_size_bytes(), start, size);
 		return -EINVAL;
 	}
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH v1 0/7] mm/memory_hotplug: Factor out memory block device handling
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, Michal Hocko, linux-ia64, linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Chris Wilson,
	Masahiro Yamada, Pavel Tatashin, Rich Felker, Arun KS,
	H. Peter Anvin, Ingo Molnar, Rafael J. Wysocki, Qian Cai,
	linux-s390, Baoquan He, Logan Gunthorpe, David Hildenbrand,
	Mike Rapoport, Ingo Molnar, Fenghua Yu, Pavel Tatashin,
	Vasily Gorbik, Rob Herring, mike.travis@hpe.com, Nicholas Piggin,
	Martin Schwidefsky, Mark Brown, Borislav Petkov, Andy Lutomirski,
	Jonathan Cameron, Dan Williams, Wei Yang, Joonsoo Kim,
	Oscar Salvador, Tony Luck, Yoshinori Sato, Andrew Banman,
	Mathieu Malaterre, Greg Kroah-Hartman, linux-kernel,
	Mike Rapoport, Thomas Gleixner, Wei Yang, Alex Deucher,
	Paul Mackerras, akpm, linuxppc-dev, David S. Miller,
	Kirill A. Shutemov

We only want memory block devices for memory to be onlined/offlined
(add/remove from the buddy). This is required so user space can
online/offline memory and kdump gets notified about newly onlined memory.

Only such memory has the requirement of having to span whole memory blocks.
Let's factor out creation/removal of memory block devices. This helps
to further cleanup arch_add_memory/arch_remove_memory() and to make
implementation of new features (e.g. sub-section hot-add) easier.

Patch 1 makes sure the memory block size granularity is always respected.
Patch 2 implements arch_remove_memory() on s390x. Patch 3 prepares
arch_remove_memory() to be also called without CONFIG_MEMORY_HOTREMOVE.
Patch 4,5 and 6 factor out creation/removal of memory block devices.
Patch 7 gets rid of some unlikely errors that could have happened, not
removinf links between memory block devices and nodes, previously brought
up by Oscar.

Did a quick sanity test with DIMM plug/unplug, making sure all devices
and sysfs links properly get added/removed. Compile tested on s390x and
x86-64.

Based on git://git.cmpxchg.org/linux-mmots.git

David Hildenbrand (7):
  mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
  s390x/mm: Implement arch_remove_memory()
  mm/memory_hotplug: arch_remove_memory() and __remove_pages() with
    CONFIG_MEMORY_HOTPLUG
  mm/memory_hotplug: Create memory block devices after arch_add_memory()
  mm/memory_hotplug: Drop MHP_MEMBLOCK_API
  mm/memory_hotplug: Remove memory block devices before
    arch_remove_memory()
  mm/memory_hotplug: Make unregister_memory_block_under_nodes() never
    fail

 arch/ia64/mm/init.c            |   2 -
 arch/powerpc/mm/mem.c          |   2 -
 arch/s390/mm/init.c            |  15 +++--
 arch/sh/mm/init.c              |   2 -
 arch/x86/mm/init_32.c          |   2 -
 arch/x86/mm/init_64.c          |   2 -
 drivers/base/memory.c          | 109 +++++++++++++++++++--------------
 drivers/base/node.c            |  27 +++-----
 include/linux/memory.h         |   6 +-
 include/linux/memory_hotplug.h |  10 ---
 include/linux/node.h           |   7 +--
 mm/memory_hotplug.c            |  42 +++++--------
 mm/sparse.c                    |   6 --
 13 files changed, 100 insertions(+), 132 deletions(-)

-- 
2.20.1

^ permalink raw reply

* [PATCH v1 2/7] s390x/mm: Implement arch_remove_memory()
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, linux-s390, Michal Hocko, linux-ia64,
	Vasily Gorbik, linux-sh, David Hildenbrand, Heiko Carstens,
	linux-kernel, Mike Rapoport, Martin Schwidefsky, akpm,
	linuxppc-dev, Dan Williams
In-Reply-To: <20190424102511.29318-1-david@redhat.com>

Will come in handy when wanting to handle errors after
arch_add_memory().

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/init.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 31b1071315d7..2636d62df04e 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -237,12 +237,13 @@ int arch_add_memory(int nid, u64 start, u64 size,
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
-	/*
-	 * There is no hardware or firmware interface which could trigger a
-	 * hot memory remove on s390. So there is nothing that needs to be
-	 * implemented.
-	 */
-	BUG();
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	vmem_remove_mapping(start, size);
+	zone = page_zone(pfn_to_page(start_pfn));
+	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
 #endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
2.20.1


^ permalink raw reply related

* [PATCH v1 3/7] mm/memory_hotplug: arch_remove_memory() and __remove_pages() with CONFIG_MEMORY_HOTPLUG
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Chris Wilson, Masahiro Yamada,
	Michal Hocko, Paul Mackerras, H. Peter Anvin, Dan Williams,
	Rafael J. Wysocki, Qian Cai, linux-s390, Yoshinori Sato,
	David Hildenbrand, Mike Rapoport, Ingo Molnar, Fenghua Yu,
	Pavel Tatashin, Vasily Gorbik, Rob Herring, mike.travis@hpe.com,
	Nicholas Piggin, Alex Deucher, Mark Brown, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Tony Luck, Baoquan He,
	Andrew Banman, Mathieu Malaterre, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Wei Yang, Martin Schwidefsky,
	Arun KS, akpm, linuxppc-dev, David S. Miller, Kirill A. Shutemov
In-Reply-To: <20190424102511.29318-1-david@redhat.com>

Let's prepare for better error handling while adding memory by allowing
to use arch_remove_memory() and __remove_pages() even if
CONFIG_MEMORY_HOTREMOVE is not set. CONFIG_MEMORY_HOTREMOVE effectively
covers
- Offlining of system ram (memory block devices) - offline_pages()
- Unplug of system ram - remove_memory()
- Unplug/remap of device memory - devm_memremap()

This allows e.g. for handling like

arch_add_memory()
rc = do_something();
if (rc) {
	arch_remove_memory();
}

Whereby do_something() will for example be memory block device creation
after it has been factored out.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/ia64/mm/init.c            | 2 --
 arch/powerpc/mm/mem.c          | 2 --
 arch/s390/mm/init.c            | 2 --
 arch/sh/mm/init.c              | 2 --
 arch/x86/mm/init_32.c          | 2 --
 arch/x86/mm/init_64.c          | 2 --
 drivers/base/memory.c          | 2 --
 include/linux/memory.h         | 2 --
 include/linux/memory_hotplug.h | 2 --
 mm/memory_hotplug.c            | 2 --
 mm/sparse.c                    | 6 ------
 11 files changed, 26 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d28e29103bdb..aae75fd7b810 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -681,7 +681,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return ret;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -693,4 +692,3 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
 #endif
-#endif
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c14fc0bd422d..28fea4c56cd4 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -131,7 +131,6 @@ int __ref arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altm
 	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 int __ref arch_remove_memory(int nid, u64 start, u64 size,
 			     struct vmem_altmap *altmap)
 {
@@ -164,7 +163,6 @@ int __ref arch_remove_memory(int nid, u64 start, u64 size,
 	resize_hpt_for_hotplug(memblock_phys_mem_size());
 }
 #endif
-#endif /* CONFIG_MEMORY_HOTPLUG */
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 void __init mem_topology_setup(void)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 2636d62df04e..5874cfb392da 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -233,7 +233,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return rc;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -245,5 +244,4 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	zone = page_zone(pfn_to_page(start_pfn));
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
-#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 5aeb4d7099a1..59c5fe511f25 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -428,7 +428,6 @@ int memory_add_physaddr_to_nid(u64 addr)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -439,5 +438,4 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	zone = page_zone(pfn_to_page(start_pfn));
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
-#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 075e568098f2..8d4bf2d97d50 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -859,7 +859,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -871,7 +870,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
 #endif
-#endif
 
 int kernel_set_to_readonly __read_mostly;
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 20d14254b686..f1b55ddea23f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1131,7 +1131,6 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
 	remove_pagetable(start, end, false, altmap);
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void __meminit
 kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 {
@@ -1156,7 +1155,6 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 	kernel_physical_mapping_remove(start, start + size);
 }
-#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f180427e48f4..6e0cb4fda179 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -728,7 +728,6 @@ int hotplug_memory_register(int nid, struct mem_section *section)
 	return ret;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void
 unregister_memory(struct memory_block *memory)
 {
@@ -767,7 +766,6 @@ void unregister_memory_section(struct mem_section *section)
 out_unlock:
 	mutex_unlock(&mem_sysfs_mutex);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /* return true if the memory block is offlined, otherwise, return false */
 bool is_memblock_offlined(struct memory_block *mem)
diff --git a/include/linux/memory.h b/include/linux/memory.h
index e1dc1bb2b787..474c7c60c8f2 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -112,9 +112,7 @@ extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 int hotplug_memory_register(int nid, struct mem_section *section);
-#ifdef CONFIG_MEMORY_HOTREMOVE
 extern void unregister_memory_section(struct mem_section *);
-#endif
 extern int memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
 extern int memory_isolate_notify(unsigned long val, void *v);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ae892eef8b82..2d4de313926d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -123,12 +123,10 @@ static inline bool movable_node_is_enabled(void)
 	return movable_node_enabled;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 extern void arch_remove_memory(int nid, u64 start, u64 size,
 			       struct vmem_altmap *altmap);
 extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
 			   unsigned long nr_pages, struct vmem_altmap *altmap);
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /*
  * Do we want sysfs memblock files created. This will allow userspace to online
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 202febe88b58..7b5439839d67 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -317,7 +317,6 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 	return err;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
 static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 				     unsigned long start_pfn,
@@ -581,7 +580,6 @@ void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 
 	set_zone_contiguous(zone);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 int set_online_page_callback(online_page_callback_t callback)
 {
diff --git a/mm/sparse.c b/mm/sparse.c
index fd13166949b5..d1d5e05f5b8d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -604,7 +604,6 @@ static void __kfree_section_memmap(struct page *memmap,
 
 	vmemmap_free(start, end, altmap);
 }
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void free_map_bootmem(struct page *memmap)
 {
 	unsigned long start = (unsigned long)memmap;
@@ -612,7 +611,6 @@ static void free_map_bootmem(struct page *memmap)
 
 	vmemmap_free(start, end, NULL);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #else
 static struct page *__kmalloc_section_memmap(void)
 {
@@ -651,7 +649,6 @@ static void __kfree_section_memmap(struct page *memmap,
 			   get_order(sizeof(struct page) * PAGES_PER_SECTION));
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void free_map_bootmem(struct page *memmap)
 {
 	unsigned long maps_section_nr, removing_section_nr, i;
@@ -681,7 +678,6 @@ static void free_map_bootmem(struct page *memmap)
 			put_page_bootmem(page);
 	}
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
 /**
@@ -746,7 +742,6 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	return ret;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 #ifdef CONFIG_MEMORY_FAILURE
 static void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 {
@@ -823,5 +818,4 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
 			PAGES_PER_SECTION - map_offset);
 	free_section_usemap(memmap, usemap, altmap);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
2.20.1


^ permalink raw reply related

* [PATCH v1 4/7] mm/memory_hotplug: Create memory block devices after arch_add_memory()
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-s390, Michal Hocko, linux-ia64, Pavel Tatashin, linux-sh,
	mike.travis@hpe.com, Greg Kroah-Hartman, David Hildenbrand,
	linux-kernel, Ingo Molnar, Mathieu Malaterre, Andrew Banman,
	Qian Cai, Rafael J. Wysocki, Arun KS, akpm, Wei Yang,
	linuxppc-dev, Dan Williams, Oscar Salvador
In-Reply-To: <20190424102511.29318-1-david@redhat.com>

Only memory to be added to the buddy and to be onlined/offlined by
user space using memory block devices needs (and should have!) memory
block devices.

Factor out creation of memory block devices Create all devices after
arch_add_memory() succeeded. We can later drop the want_memblock parameter,
because it is now effectively stale.

Only after memory block devices have been added, memory can be onlined
by user space. This implies, that memory is not visible to user space at
all before arch_add_memory() succeeded.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/memory.c  | 70 ++++++++++++++++++++++++++----------------
 include/linux/memory.h |  2 +-
 mm/memory_hotplug.c    | 15 ++++-----
 3 files changed, 53 insertions(+), 34 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 6e0cb4fda179..862c202a18ca 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr)
 	return 0;
 }
 
+static void unregister_memory(struct memory_block *memory)
+{
+	BUG_ON(memory->dev.bus != &memory_subsys);
+
+	/* drop the ref. we got via find_memory_block() */
+	put_device(&memory->dev);
+	device_unregister(&memory->dev);
+}
+
 /*
- * need an interface for the VM to add new memory regions,
- * but without onlining it.
+ * Create memory block devices for the given memory area. Start and size
+ * have to be aligned to memory block granularity. Memory block devices
+ * will be initialized as offline.
  */
-int hotplug_memory_register(int nid, struct mem_section *section)
+int hotplug_memory_register(unsigned long start, unsigned long size)
 {
-	int ret = 0;
+	unsigned long block_nr_pages = memory_block_size_bytes() >> PAGE_SHIFT;
+	unsigned long start_pfn = PFN_DOWN(start);
+	unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT);
+	unsigned long pfn;
 	struct memory_block *mem;
+	int ret = 0;
 
-	mutex_lock(&mem_sysfs_mutex);
+	BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes()));
+	BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes()));
 
-	mem = find_memory_block(section);
-	if (mem) {
-		mem->section_count++;
-		put_device(&mem->dev);
-	} else {
-		ret = init_memory_block(&mem, section, MEM_OFFLINE);
+	mutex_lock(&mem_sysfs_mutex);
+	for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
+		mem = find_memory_block(__pfn_to_section(pfn));
+		if (mem) {
+			WARN_ON_ONCE(false);
+			put_device(&mem->dev);
+			continue;
+		}
+		ret = init_memory_block(&mem, __pfn_to_section(pfn),
+					MEM_OFFLINE);
 		if (ret)
-			goto out;
-		mem->section_count++;
+			break;
+		mem->section_count = memory_block_size_bytes() /
+				     MIN_MEMORY_BLOCK_SIZE;
+	}
+	if (ret) {
+		end_pfn = pfn;
+		for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
+			mem = find_memory_block(__pfn_to_section(pfn));
+			if (!mem)
+				continue;
+			mem->section_count = 0;
+			unregister_memory(mem);
+		}
 	}
-
-out:
 	mutex_unlock(&mem_sysfs_mutex);
 	return ret;
 }
 
-static void
-unregister_memory(struct memory_block *memory)
-{
-	BUG_ON(memory->dev.bus != &memory_subsys);
-
-	/* drop the ref. we got via find_memory_block() */
-	put_device(&memory->dev);
-	device_unregister(&memory->dev);
-}
-
-void unregister_memory_section(struct mem_section *section)
+static int remove_memory_section(struct mem_section *section)
 {
 	struct memory_block *mem;
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 474c7c60c8f2..95505fbb5f85 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -111,7 +111,7 @@ extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
-int hotplug_memory_register(int nid, struct mem_section *section);
+int hotplug_memory_register(unsigned long start, unsigned long size);
 extern void unregister_memory_section(struct mem_section *);
 extern int memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7b5439839d67..e1637c8a0723 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -258,13 +258,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		return -EEXIST;
 
 	ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
-	if (ret < 0)
-		return ret;
-
-	if (!want_memblock)
-		return 0;
-
-	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
+	return ret < 0 ? ret : 0;
 }
 
 /*
@@ -1106,6 +1100,13 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	if (ret < 0)
 		goto error;
 
+	/* create memory block devices after memory was added */
+	ret = hotplug_memory_register(start, size);
+	if (ret) {
+		arch_remove_memory(nid, start, size, NULL);
+		goto error;
+	}
+
 	if (new_node) {
 		/* If sysfs file of new node can't be created, cpu on the node
 		 * can't be hot-added. There is no rollback way now.
-- 
2.20.1


^ permalink raw reply related

* [PATCH v1 5/7] mm/memory_hotplug: Drop MHP_MEMBLOCK_API
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, linux-s390, Michal Hocko, linux-ia64,
	Pavel Tatashin, linux-sh, Mathieu Malaterre, David Hildenbrand,
	linux-kernel, Wei Yang, Qian Cai, Joonsoo Kim, akpm, linuxppc-dev,
	Dan Williams, Arun KS
In-Reply-To: <20190424102511.29318-1-david@redhat.com>

No longer needed, the callers of arch_add_memory() can handle this
manually.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h | 8 --------
 mm/memory_hotplug.c            | 9 +++------
 2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 2d4de313926d..2f1f87e13baa 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -128,14 +128,6 @@ extern void arch_remove_memory(int nid, u64 start, u64 size,
 extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
 			   unsigned long nr_pages, struct vmem_altmap *altmap);
 
-/*
- * Do we want sysfs memblock files created. This will allow userspace to online
- * and offline memory explicitly. Lack of this bit means that the caller has to
- * call move_pfn_range_to_zone to finish the initialization.
- */
-
-#define MHP_MEMBLOCK_API               (1<<0)
-
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 		       struct mhp_restrictions *restrictions);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e1637c8a0723..107f72952347 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,7 +250,7 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
 static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-		struct vmem_altmap *altmap, bool want_memblock)
+				   struct vmem_altmap *altmap)
 {
 	int ret;
 
@@ -293,8 +293,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 	}
 
 	for (i = start_sec; i <= end_sec; i++) {
-		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				restrictions->flags & MHP_MEMBLOCK_API);
+		err = __add_section(nid, section_nr_to_pfn(i), altmap);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1066,9 +1065,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  */
 int __ref add_memory_resource(int nid, struct resource *res)
 {
-	struct mhp_restrictions restrictions = {
-		.flags = MHP_MEMBLOCK_API,
-	};
+	struct mhp_restrictions restrictions = {};
 	u64 start, size;
 	bool new_node = false;
 	int ret;
-- 
2.20.1


^ permalink raw reply related

* [PATCH v1 6/7] mm/memory_hotplug: Remove memory block devices before arch_remove_memory()
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, linux-ia64, linux-sh, Chris Wilson, Arun KS,
	Ingo Molnar, Rafael J. Wysocki, linux-s390, David Hildenbrand,
	Pavel Tatashin, mike.travis@hpe.com, Mark Brown, Jonathan Cameron,
	Dan Williams, Oscar Salvador, Andrew Banman, Mathieu Malaterre,
	Greg Kroah-Hartman, linux-kernel, Alex Deucher, akpm,
	linuxppc-dev, David S. Miller
In-Reply-To: <20190424102511.29318-1-david@redhat.com>

Let's factor out removing of memory block devices, which is only
necessary for memory added via add_memory() and friends that created
memory block devices. Remove the devices before calling
arch_remove_memory().

This finishes factoring out memory block device handling from
arch_add_memory() and arch_remove_memory().

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/memory.c  | 39 +++++++++++++++++++--------------------
 drivers/base/node.c    | 11 ++++++-----
 include/linux/memory.h |  2 +-
 include/linux/node.h   |  6 ++----
 mm/memory_hotplug.c    |  5 +++--
 5 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 862c202a18ca..47ff49058d1f 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -756,32 +756,31 @@ int hotplug_memory_register(unsigned long start, unsigned long size)
 	return ret;
 }
 
-static int remove_memory_section(struct mem_section *section)
+/*
+ * Remove memory block devices for the given memory area. Start and size
+ * have to be aligned to memory block granularity. Memory block devices
+ * have to be offline.
+ */
+void hotplug_memory_unregister(unsigned long start, unsigned long size)
 {
+	unsigned long block_nr_pages = memory_block_size_bytes() >> PAGE_SHIFT;
+	unsigned long start_pfn = PFN_DOWN(start);
+	unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT);
 	struct memory_block *mem;
+	unsigned long pfn;
 
-	if (WARN_ON_ONCE(!present_section(section)))
-		return;
+	BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes()));
+	BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes()));
 
 	mutex_lock(&mem_sysfs_mutex);
-
-	/*
-	 * Some users of the memory hotplug do not want/need memblock to
-	 * track all sections. Skip over those.
-	 */
-	mem = find_memory_block(section);
-	if (!mem)
-		goto out_unlock;
-
-	unregister_mem_sect_under_nodes(mem, __section_nr(section));
-
-	mem->section_count--;
-	if (mem->section_count == 0)
+	for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
+		mem = find_memory_block(__pfn_to_section(pfn));
+		if (!mem)
+			continue;
+		mem->section_count = 0;
+		unregister_memory_block_under_nodes(mem);
 		unregister_memory(mem);
-	else
-		put_device(&mem->dev);
-
-out_unlock:
+	}
 	mutex_unlock(&mem_sysfs_mutex);
 }
 
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 8598fcbd2a17..04fdfa99b8bc 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -801,9 +801,10 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 	return 0;
 }
 
-/* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-				    unsigned long phys_index)
+/*
+ * Unregister memory block device under all nodes that it spans.
+ */
+int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -816,8 +817,8 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
 
-	sect_start_pfn = section_nr_to_pfn(phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int nid;
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 95505fbb5f85..aa236c2a0466 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -112,7 +112,7 @@ extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 int hotplug_memory_register(unsigned long start, unsigned long size);
-extern void unregister_memory_section(struct mem_section *);
+void hotplug_memory_unregister(unsigned long start, unsigned long size);
 extern int memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
 extern int memory_isolate_notify(unsigned long val, void *v);
diff --git a/include/linux/node.h b/include/linux/node.h
index 1a557c589ecb..02a29e71b175 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -139,8 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						void *arg);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-					   unsigned long phys_index);
+extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk);
 
 extern int register_memory_node_under_compute_node(unsigned int mem_nid,
 						   unsigned int cpu_nid,
@@ -176,8 +175,7 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-						  unsigned long phys_index)
+static inline int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
 	return 0;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 107f72952347..527fe4f9c620 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -519,8 +519,6 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	unregister_memory_section(ms);
-
 	scn_nr = __section_nr(ms);
 	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
 	__remove_zone(zone, start_pfn);
@@ -1844,6 +1842,9 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
+	/* remove memory block devices before removing memory */
+	hotplug_memory_unregister(start, size);
+
 	arch_remove_memory(nid, start, size, NULL);
 	__release_memory_resource(start, size);
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH v1 7/7] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail
From: David Hildenbrand @ 2019-04-24 10:25 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-s390, linux-ia64, linux-sh, Greg Kroah-Hartman,
	David Hildenbrand, linux-kernel, David S. Miller, Alex Deucher,
	Mark Brown, Jonathan Cameron, Rafael J. Wysocki, akpm,
	Chris Wilson, linuxppc-dev, Dan Williams, Oscar Salvador
In-Reply-To: <20190424102511.29318-1-david@redhat.com>

We really don't want anything during memory hotunplug to fail.
We always pass a valid memory block device, that check can go. Avoid
allocating memory and eventually failing. As we are always called under
lock, we can use a static piece of memory. This avoids having to put
the structure onto the stack, having to guess about the stack size
of callers.

Patch inspired by a patch from Oscar Salvador.

In the future, there might be no need to iterate over nodes at all.
mem->nid should tell us exactly what to remove. Memory block devices
with mixed nodes (added during boot) should properly fenced off and never
removed.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/node.c  | 18 +++++-------------
 include/linux/node.h |  5 ++---
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 04fdfa99b8bc..9be88fd05147 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -803,20 +803,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 
 /*
  * Unregister memory block device under all nodes that it spans.
+ * Has to be called with mem_sysfs_mutex held (due to unlinked_nodes).
  */
-int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
+void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
-	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
+	static nodemask_t unlinked_nodes;
 
-	if (!mem_blk) {
-		NODEMASK_FREE(unlinked_nodes);
-		return -EFAULT;
-	}
-	if (!unlinked_nodes)
-		return -ENOMEM;
-	nodes_clear(*unlinked_nodes);
-
+	nodes_clear(unlinked_nodes);
 	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
 	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
@@ -827,15 +821,13 @@ int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 			continue;
 		if (!node_online(nid))
 			continue;
-		if (node_test_and_set(nid, *unlinked_nodes))
+		if (node_test_and_set(nid, unlinked_nodes))
 			continue;
 		sysfs_remove_link(&node_devices[nid]->dev.kobj,
 			 kobject_name(&mem_blk->dev.kobj));
 		sysfs_remove_link(&mem_blk->dev.kobj,
 			 kobject_name(&node_devices[nid]->dev.kobj));
 	}
-	NODEMASK_FREE(unlinked_nodes);
-	return 0;
 }
 
 int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn)
diff --git a/include/linux/node.h b/include/linux/node.h
index 02a29e71b175..548c226966a2 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -139,7 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						void *arg);
-extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk);
+extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk);
 
 extern int register_memory_node_under_compute_node(unsigned int mem_nid,
 						   unsigned int cpu_nid,
@@ -175,9 +175,8 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
 {
 	return 0;
 }
-static inline int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
+static inline void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
-	return 0;
 }
 
 static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()
From: Laurent Dufour @ 2019-04-24 10:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
	linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
	Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
	aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
	David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park, dave,
	kemi.wang, kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran,
	Yang Shi, Mike Rapoport, linuxppc-dev, linux-kernel,
	Sergey Senozhatsky, Vinayak Menon, vinayak menon, akpm, Tim Chen,
	haren
In-Reply-To: <20190422203217.GI14666@redhat.com>

Le 22/04/2019 à 22:32, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
>> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
>> lead to thread being blocked due to data corruption:
>>
>>      CPU 1                   CPU 2                    CPU 3
>>      Process 1,              Process 1,               Process 1,
>>      Thread A                Thread B                 Thread C
>>
>>      while (1) {             while (1) {              while(1) {
>>      pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
>>      pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
>>      }                       }
>>
>> In the details this happens because :
>>
>>      CPU 1                CPU 2                       CPU 3
>>      fork()
>>      copy_pte_range()
>>        set PTE rdonly
>>      got to next VMA...
>>       .                   PTE is seen rdonly          PTE still writable
>>       .                   thread is writing to page
>>       .                   -> page fault
>>       .                     copy the page             Thread writes to page
>>       .                      .                        -> no page fault
>>       .                     update the PTE
>>       .                     flush TLB for that PTE
>>     flush TLB                                        PTE are now rdonly
> 
> Should the fork be on CPU3 to be consistant with the top thing (just to
> make it easier to read and go from one to the other as thread can move
> from one CPU to another).

Sure, this is quite confusing this way ;)

>>
>> So the write done by the CPU 3 is interfering with the page copy operation
>> done by CPU 2, leading to the data corruption.
>>
>> To avoid this we mark all the VMA involved in the COW mechanism as changing
>> by calling vm_write_begin(). This ensures that the speculative page fault
>> handler will not try to handle a fault on these pages.
>> The marker is set until the TLB is flushed, ensuring that all the CPUs will
>> now see the PTE as not writable.
>> Once the TLB is flush, the marker is removed by calling vm_write_end().
>>
>> The variable last is used to keep tracked of the latest VMA marked to
>> handle the error path where part of the VMA may have been marked.
>>
>> Since multiple VMA from the same mm may have the sequence count increased
>> during this process, the use of the vm_raw_write_begin/end() is required to
>> avoid lockdep false warning messages.
>>
>> Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>> Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> 
> A minor comment (see below)
> 
> Reviewed-by: Jérome Glisse <jglisse@redhat.com>

Thanks for the review Jérôme.

>> ---
>>   kernel/fork.c | 30 ++++++++++++++++++++++++++++--
>>   1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f8dae021c2e5..2992d2c95256 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>>   static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   					struct mm_struct *oldmm)
>>   {
>> -	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>> +	struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>>   	struct rb_node **rb_link, *rb_parent;
>>   	int retval;
>>   	unsigned long charge;
>> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		rb_parent = &tmp->vm_rb;
>>   
>>   		mm->map_count++;
>> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> +		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
>> +			if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
>> +				/*
>> +				 * Mark this VMA as changing to prevent the
>> +				 * speculative page fault hanlder to process
>> +				 * it until the TLB are flushed below.
>> +				 */
>> +				last = mpnt;
>> +				vm_raw_write_begin(mpnt);
>> +			}
>>   			retval = copy_page_range(mm, oldmm, mpnt);
>> +		}
>>   
>>   		if (tmp->vm_ops && tmp->vm_ops->open)
>>   			tmp->vm_ops->open(tmp);
>> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   out:
>>   	up_write(&mm->mmap_sem);
>>   	flush_tlb_mm(oldmm);
>> +
>> +	if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> 
> You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
> will always be NULL if it is not enabled but maybe the compiler will
> miss the optimization opportunity if you only have the for() loop
> below.

I didn't check for the generated code, perhaps the compiler will be 
optimize that correctly.
This being said, I think the if block is better for the code 
readability, highlighting that this block is only needed in the case of SPF.

>> +		/*
>> +		 * Since the TLB has been flush, we can safely unmark the
>> +		 * copied VMAs and allows the speculative page fault handler to
>> +		 * process them again.
>> +		 * Walk back the VMA list from the last marked VMA.
>> +		 */
>> +		for (; last; last = last->vm_prev) {
>> +			if (last->vm_flags & VM_DONTCOPY)
>> +				continue;
>> +			if (!(last->vm_flags & VM_WIPEONFORK))
>> +				vm_raw_write_end(last);
>> +		}
>> +	}
>> +
>>   	up_write(&oldmm->mmap_sem);
>>   	dup_userfaultfd_complete(&uf);
>>   fail_uprobe_end:
>> -- 
>> 2.21.0
>>
> 


^ permalink raw reply

* Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
From: Laurent Dufour @ 2019-04-24 10:34 UTC (permalink / raw)
  To: Mark Rutland
  Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
	linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
	Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
	aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
	David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park,
	Jerome Glisse, dave, kemi.wang, kirill, Thomas Gleixner,
	zhong jiang, Ganesh Mahendran, Yang Shi, Mike Rapoport,
	linuxppc-dev, linux-kernel, Sergey Senozhatsky, vinayak menon,
	akpm, Tim Chen, haren
In-Reply-To: <20190423161931.GE56999@lakrids.cambridge.arm.com>

Le 23/04/2019 à 18:19, Mark Rutland a écrit :
> On Tue, Apr 23, 2019 at 05:36:31PM +0200, Laurent Dufour wrote:
>> Le 18/04/2019 à 23:51, Jerome Glisse a écrit :
>>> On Tue, Apr 16, 2019 at 03:41:56PM +0100, Mark Rutland wrote:
>>>> On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
>>>>> Le 16/04/2019 à 16:27, Mark Rutland a écrit :
>>>>>> On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
>>>>>>> From: Mahendran Ganesh <opensource.ganesh@gmail.com>
>>>>>>>
>>>>>>> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
>>>>>>> enables Speculative Page Fault handler.
>>>>>>>
>>>>>>> Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>>>>>>
>>>>>> This is missing your S-o-B.
>>>>>
>>>>> You're right, I missed that...
>>>>>
>>>>>> The first patch noted that the ARCH_SUPPORTS_* option was there because
>>>>>> the arch code had to make an explicit call to try to handle the fault
>>>>>> speculatively, but that isn't addeed until patch 30.
>>>>>>
>>>>>> Why is this separate from that code?
>>>>>
>>>>> Andrew was recommended this a long time ago for bisection purpose. This
>>>>> allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
>>>>> that trigger the spf handler is added to the per architecture's code.
>>>>
>>>> Ok. I think it would be worth noting that in the commit message, to
>>>> avoid anyone else asking the same question. :)
>>>
>>> Should have read this thread before looking at x86 and ppc :)
>>>
>>> In any case the patch is:
>>>
>>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
>>
>> Thanks Mark and Jérôme for reviewing this.
>>
>> Regarding the change in the commit message, I'm wondering if this would be
>> better to place it in the Series's letter head.
>>
>> But I'm fine to put it in each architecture's commit.
> 
> I think noting it in both the cover letter and specific patches is best.
> 
> Having something in the commit message means that the intent will be
> clear when the patch is viewed in isolation (e.g. as they will be once
> merged).
> 
> All that's necessary is something like:
> 
>    Note that this patch only enables building the common speculative page
>    fault code such that this can be bisected, and has no functional
>    impact. The architecture-specific code to make use of this and enable
>    the feature will be addded in a subsequent patch.

Thanks Mark, will do it this way.


> Thanks,
> Mark.


^ permalink raw reply

* [PATCH] memblock: make keeping memblock memory opt-in rather than opt-out
From: Mike Rapoport @ 2019-04-24 10:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rich Felker, linux-ia64, linux-sh, Heiko Carstens, linux-mips,
	linux-mm, Paul Mackerras, H. Peter Anvin, linux-s390,
	Yoshinori Sato, x86, Russell King, Mike Rapoport, Ingo Molnar,
	Geert Uytterhoeven, Catalin Marinas, James Hogan, Fenghua Yu,
	Will Deacon, linux-m68k, Borislav Petkov, nios2-dev,
	Thomas Gleixner, linux-arm-kernel, Tony Luck, kexec, linux-kernel,
	Ralf Baechle, Richard Kuo, Paul Burton, Eric Biederman,
	linux-hexagon, Martin Schwidefsky, Ley Foon Tan, linuxppc-dev

Most architectures do not need the memblock memory after the page allocator
is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after system
initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK to the
architectures that are still missing that option.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/arm/Kconfig         |  2 +-
 arch/arm64/Kconfig       |  1 +
 arch/hexagon/Kconfig     |  1 -
 arch/ia64/Kconfig        |  1 -
 arch/m68k/Kconfig        |  1 -
 arch/mips/Kconfig        |  1 -
 arch/nios2/Kconfig       |  1 -
 arch/powerpc/Kconfig     |  1 +
 arch/s390/Kconfig        |  1 +
 arch/sh/Kconfig          |  1 -
 arch/x86/Kconfig         |  1 -
 include/linux/memblock.h |  3 ++-
 kernel/kexec_file.c      | 16 ++++++++--------
 mm/Kconfig               |  2 +-
 mm/memblock.c            |  6 +++---
 mm/page_alloc.c          |  3 +--
 16 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 850b480..7073436 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -4,7 +4,6 @@ config ARM
 	default y
 	select ARCH_32BIT_OFF_T
 	select ARCH_CLOCKSOURCE_DATA
-	select ARCH_DISCARD_MEMBLOCK if !HAVE_ARCH_PFN_VALID && !KEXEC
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
 	select ARCH_HAS_ELF_RANDOMIZE
@@ -21,6 +20,7 @@ config ARM
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAVE_CUSTOM_GPIO_H
 	select ARCH_HAS_GCOV_PROFILE_ALL
+	select ARCH_KEEP_MEMBLOCK if HAVE_ARCH_PFN_VALID || KEXEC
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_NO_SG_CHAIN if !ARM_HAS_SG_CHAIN
 	select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7e34b9e..d71f043 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -58,6 +58,7 @@ config ARM64
 	select ARCH_INLINE_SPIN_UNLOCK_BH if !PREEMPT
 	select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPT
 	select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPT
+	select ARCH_KEEP_MEMBLOCK
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index ac44168..bbe3819 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -22,7 +22,6 @@ config HEXAGON
 	select GENERIC_IRQ_SHOW
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_TRACEHOOK
-	select ARCH_DISCARD_MEMBLOCK
 	select NEED_SG_DMA_LENGTH
 	select NO_IOPORT_MAP
 	select GENERIC_IOMAP
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 8d7396b..bd51d3b 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -33,7 +33,6 @@ config IA64
 	select ARCH_HAS_DMA_COHERENT_TO_PFN if SWIOTLB
 	select ARCH_HAS_SYNC_DMA_FOR_CPU if SWIOTLB
 	select VIRT_TO_BUS
-	select ARCH_DISCARD_MEMBLOCK
 	select GENERIC_IRQ_PROBE
 	select GENERIC_PENDING_IRQ if SMP
 	select GENERIC_IRQ_SHOW
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index b542064..7d1e5d9 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -27,7 +27,6 @@ config M68K
 	select MODULES_USE_ELF_RELA
 	select OLD_SIGSUSPEND3
 	select OLD_SIGACTION
-	select ARCH_DISCARD_MEMBLOCK
 
 config CPU_BIG_ENDIAN
 	def_bool y
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 4a5f5b0..8b9298b 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -5,7 +5,6 @@ config MIPS
 	select ARCH_32BIT_OFF_T if !64BIT
 	select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT
 	select ARCH_CLOCKSOURCE_DATA
-	select ARCH_DISCARD_MEMBLOCK
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index 4ef15a6..dc4239c 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -23,7 +23,6 @@ config NIOS2
 	select SPARSE_IRQ
 	select USB_ARCH_HAS_HCD if USB_SUPPORT
 	select CPU_NO_EFFICIENT_FFS
-	select ARCH_DISCARD_MEMBLOCK
 
 config GENERIC_CSUM
 	def_bool y
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d0be82..39877b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -143,6 +143,7 @@ config PPC
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_ZONE_DEVICE		if PPC_BOOK3S_64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_KEEP_MEMBLOCK
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index b6e3d06..3eaf660 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -106,6 +106,7 @@ config S390
 	select ARCH_INLINE_WRITE_UNLOCK_BH
 	select ARCH_INLINE_WRITE_UNLOCK_IRQ
 	select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
+	select ARCH_KEEP_MEMBLOCK
 	select ARCH_SAVE_PAGE_KEYS if HIBERNATION
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_NUMA_BALANCING
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index b1c91ea..418e928 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -10,7 +10,6 @@ config SUPERH
 	select DMA_DECLARE_COHERENT
 	select HAVE_IDE if HAS_IOPORT_MAP
 	select HAVE_MEMBLOCK_NODE_MAP
-	select ARCH_DISCARD_MEMBLOCK
 	select HAVE_OPROFILE
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_PERF_EVENTS
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 62fc3fd..80bc582 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -48,7 +48,6 @@ config X86
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_DATA
 	select ARCH_CLOCKSOURCE_INIT
-	select ARCH_DISCARD_MEMBLOCK
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
 	select ARCH_HAS_DEBUG_VIRTUAL
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 294d5d8..6ba69ba 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -96,13 +96,14 @@ struct memblock {
 extern struct memblock memblock;
 extern int memblock_debug;
 
-#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
+#ifndef CONFIG_ARCH_KEEP_MEMBLOCK
 #define __init_memblock __meminit
 #define __initdata_memblock __meminitdata
 void memblock_discard(void);
 #else
 #define __init_memblock
 #define __initdata_memblock
+static inline void memblock_discard(void) {}
 #endif
 
 #define memblock_dbg(fmt, ...) \
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f1d0e00..623b022 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -500,13 +500,7 @@ static int locate_mem_hole_callback(struct resource *res, void *arg)
 	return locate_mem_hole_bottom_up(start, end, kbuf);
 }
 
-#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
-static int kexec_walk_memblock(struct kexec_buf *kbuf,
-			       int (*func)(struct resource *, void *))
-{
-	return 0;
-}
-#else
+#ifdef CONFIG_ARCH_KEEP_MEMBLOCK
 static int kexec_walk_memblock(struct kexec_buf *kbuf,
 			       int (*func)(struct resource *, void *))
 {
@@ -550,6 +544,12 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf,
 
 	return ret;
 }
+#else
+static int kexec_walk_memblock(struct kexec_buf *kbuf,
+			       int (*func)(struct resource *, void *))
+{
+	return 0;
+}
 #endif
 
 /**
@@ -589,7 +589,7 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
 	if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN)
 		return 0;
 
-	if (IS_ENABLED(CONFIG_ARCH_DISCARD_MEMBLOCK))
+	if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
 		ret = kexec_walk_resources(kbuf, locate_mem_hole_callback);
 	else
 		ret = kexec_walk_memblock(kbuf, locate_mem_hole_callback);
diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb..1e1c6ce 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -136,7 +136,7 @@ config HAVE_MEMBLOCK_PHYS_MAP
 config HAVE_GENERIC_GUP
 	bool
 
-config ARCH_DISCARD_MEMBLOCK
+config ARCH_KEEP_MEMBLOCK
 	bool
 
 config MEMORY_ISOLATION
diff --git a/mm/memblock.c b/mm/memblock.c
index e7665cf..d6b5755 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -94,7 +94,7 @@
  * :c:func:`mem_init` function frees all the memory to the buddy page
  * allocator.
  *
- * If an architecure enables %CONFIG_ARCH_DISCARD_MEMBLOCK, the
+ * Unless an architecure enables %CONFIG_ARCH_KEEP_MEMBLOCK, the
  * memblock data structures will be discarded after the system
  * initialization compltes.
  */
@@ -375,7 +375,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
 	}
 }
 
-#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
+#ifndef CONFIG_ARCH_KEEP_MEMBLOCK
 /**
  * memblock_discard - discard memory and reserved arrays if they were allocated
  */
@@ -1923,7 +1923,7 @@ unsigned long __init memblock_free_all(void)
 	return pages;
 }
 
-#if defined(CONFIG_DEBUG_FS) && !defined(CONFIG_ARCH_DISCARD_MEMBLOCK)
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_ARCH_KEEP_MEMBLOCK)
 
 static int memblock_debug_show(struct seq_file *m, void *private)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6ce20a..a1825e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1831,10 +1831,9 @@ void __init page_alloc_init_late(void)
 	/* Reinit limits that are based on free pages after the kernel is up */
 	files_maxfiles_init();
 #endif
-#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
+
 	/* Discard memblock private memory */
 	memblock_discard();
-#endif
 
 	for_each_populated_zone(zone)
 		set_zone_contiguous(zone);
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH 5/6] powerpc/mmu: drop mmap_sem now that locked_vm is atomic
From: Jason Gunthorpe @ 2019-04-24 11:10 UTC (permalink / raw)
  To: Daniel Jordan, Christophe Leroy, akpm@linux-foundation.org,
	Alexey Kardashevskiy, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Paul Mackerras, Christoph Lameter,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20190424021544.ygqa4hvwbyb6nuxp@linux-r8p5>

On Tue, Apr 23, 2019 at 07:15:44PM -0700, Davidlohr Bueso wrote:
> On Wed, 03 Apr 2019, Daniel Jordan wrote:
> 
> > On Wed, Apr 03, 2019 at 06:58:45AM +0200, Christophe Leroy wrote:
> > > Le 02/04/2019 à 22:41, Daniel Jordan a écrit :
> > > > With locked_vm now an atomic, there is no need to take mmap_sem as
> > > > writer.  Delete and refactor accordingly.
> > > 
> > > Could you please detail the change ?
> > 
> > Ok, I'll be more specific in the next version, using some of your language in
> > fact.  :)
> > 
> > > It looks like this is not the only
> > > change. I'm wondering what the consequences are.
> > > 
> > > Before we did:
> > > - lock
> > > - calculate future value
> > > - check the future value is acceptable
> > > - update value if future value acceptable
> > > - return error if future value non acceptable
> > > - unlock
> > > 
> > > Now we do:
> > > - atomic update with future (possibly too high) value
> > > - check the new value is acceptable
> > > - atomic update back with older value if new value not acceptable and return
> > > error
> > > 
> > > So if a concurrent action wants to increase locked_vm with an acceptable
> > > step while another one has temporarily set it too high, it will now fail.
> > > 
> > > I think we should keep the previous approach and do a cmpxchg after
> > > validating the new value.
> 
> Wouldn't the cmpxchg alternative also be exposed the locked_vm changing between
> validating the new value and the cmpxchg() and we'd bogusly fail even when there
> is still just because the value changed (I'm assuming we don't hold any locks,
> otherwise all this is pointless).
> 
>   current_locked = atomic_read(&mm->locked_vm);
>   new_locked = current_locked + npages;
>   if (new_locked < lock_limit)
>      if (cmpxchg(&mm->locked_vm, current_locked, new_locked) == current_locked)
>      	 /* ENOMEM */

Well it needs a loop..

again:
   current_locked = atomic_read(&mm->locked_vm);
   new_locked = current_locked + npages;
   if (new_locked < lock_limit)
      if (cmpxchg(&mm->locked_vm, current_locked, new_locked) != current_locked)
            goto again;

So it won't have bogus failures as there is no unwind after
error. Basically this is a load locked/store conditional style of
locking pattern.

> > That's a good idea, and especially worth doing considering that an arbitrary
> > number of threads that charge a low amount of locked_vm can fail just because
> > one thread charges lots of it.
> 
> Yeah but the window for this is quite small, I doubt it would be a real issue.
> 
> What if before doing the atomic_add_return(), we first did the racy new_locked
> check for ENOMEM, then do the speculative add and cleanup, if necessary. This
> would further reduce the scope of the window where false ENOMEM can occur.
> 
> > pinned_vm appears to be broken the same way, so I can fix it too unless someone
> > beats me to it.
> 
> This should not be a surprise for the rdma folks. Cc'ing Jason nonetheless.

I think we accepted this tiny race as a side effect of removing the
lock, which was very beneficial. Really the time window between the
atomic failing and unwind is very small, and there are enough other
ways a hostile user could DOS locked_vm that I don't think it really
matters in practice..

However, the cmpxchg seems better, so a helper to implement that would
probably be the best thing to do.

Jason

^ permalink raw reply

* Re: [PATCH v3 5/6] powerpc/mm: Reduce memory usage for mm_context_t for radix
From: Christophe Leroy @ 2019-04-24 12:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V, npiggin, paulus, mpe; +Cc: linuxppc-dev
In-Reply-To: <20190417130351.3805-6-aneesh.kumar@linux.ibm.com>



Le 17/04/2019 à 15:03, Aneesh Kumar K.V a écrit :
> Currently, our mm_context_t on book3s64 include all hash specific
> context details like slice mask and subpage protection details. We
> can skip allocating these with radix translation. This will help us to save
> 8K per mm_context with radix translation.
> 
> With the patch applied we have
> 
> sizeof(mm_context_t)  = 136
> sizeof(struct hash_mm_context)  = 8288
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   arch/powerpc/include/asm/book3s/64/mmu-hash.h | 33 ++++++++++++-
>   arch/powerpc/include/asm/book3s/64/mmu.h      | 49 +++++--------------
>   arch/powerpc/kernel/setup-common.c            |  6 +++
>   arch/powerpc/mm/hash_utils_64.c               |  4 +-
>   arch/powerpc/mm/mmu_context_book3s64.c        | 16 +++++-
>   5 files changed, 68 insertions(+), 40 deletions(-)
> 

[...]

> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index a07de8608484..21b1ce200b22 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -947,6 +947,12 @@ void __init setup_arch(char **cmdline_p)
>   	init_mm.end_data = (unsigned long) _edata;
>   	init_mm.brk = klimit;
>   
> +#ifdef CONFIG_PPC_MM_SLICES
> +#if defined(CONFIG_PPC_8xx)
> +	init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW;
> +#endif
> +#endif
> +

In the previous patch, you moved the above into early_init_mmu(). Why 
bringing it back here ?

Christophe

>   #ifdef CONFIG_SPAPR_TCE_IOMMU
>   	mm_iommu_init(&init_mm);
>   #endif
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 2cb3a456f5b5..04ac7c36d380 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -968,6 +968,7 @@ void __init hash__early_init_devtree(void)
>   	htab_scan_page_sizes();
>   }
>   
> +struct hash_mm_context init_hash_mm_context;
>   void __init hash__early_init_mmu(void)
>   {
>   #ifndef CONFIG_PPC_64K_PAGES
> @@ -1041,7 +1042,8 @@ void __init hash__early_init_mmu(void)
>   	 */
>   	htab_initialize();
>   
> -	init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64;
> +	init_mm.context.hash_context = &init_hash_mm_context;
> +	init_mm.context.hash_context->slb_addr_limit = DEFAULT_MAP_WINDOW_USER64;
>   
>   	pr_info("Initializing hash mmu with SLB\n");
>   	/* Initialize SLB management */
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index f720c5cc0b5e..6eef5a36b2e9 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -63,6 +63,12 @@ static int hash__init_new_context(struct mm_struct *mm)
>   	if (index < 0)
>   		return index;
>   
> +	mm->context.hash_context = kmalloc(sizeof(struct hash_mm_context), GFP_KERNEL);
> +	if (!mm->context.hash_context) {
> +		ida_free(&mmu_context_ida, index);
> +		return -ENOMEM;
> +	}
> +
>   	/*
>   	 * The old code would re-promote on fork, we don't do that when using
>   	 * slices as it could cause problem promoting slices that have been
> @@ -77,8 +83,14 @@ static int hash__init_new_context(struct mm_struct *mm)
>   	 * We should not be calling init_new_context() on init_mm. Hence a
>   	 * check against 0 is OK.
>   	 */
> -	if (mm->context.id == 0)
> +	if (mm->context.id == 0) {
> +		memset(mm->context.hash_context, 0, sizeof(struct hash_mm_context));
>   		slice_init_new_context_exec(mm);
> +	} else {
> +		/* This is fork. Copy hash_context details from current->mm */
> +		memcpy(mm->context.hash_context, current->mm->context.hash_context, sizeof(struct hash_mm_context));
> +
> +	}
>   
>   	subpage_prot_init_new_context(mm);
>   
> @@ -118,6 +130,7 @@ static int radix__init_new_context(struct mm_struct *mm)
>   	asm volatile("ptesync;isync" : : : "memory");
>   
>   	mm->context.npu_context = NULL;
> +	mm->context.hash_context = NULL;
>   
>   	return index;
>   }
> @@ -162,6 +175,7 @@ static void destroy_contexts(mm_context_t *ctx)
>   		if (context_id)
>   			ida_free(&mmu_context_ida, context_id);
>   	}
> +	kfree(ctx->hash_context);
>   }
>   
>   static void pmd_frag_destroy(void *pmd_frag)
> 

^ permalink raw reply

* Re: [PATCH stable v4.4 00/52] powerpc spectre backports for 4.4
From: Greg KH @ 2019-04-24 13:48 UTC (permalink / raw)
  To: Diana Madalina Craciun
  Cc: npiggin@gmail.com, linuxppc-dev@ozlabs.org,
	stable@vger.kernel.org, msuchanek@suse.de
In-Reply-To: <VI1PR0401MB24639DFFA532DF374BB86070FF220@VI1PR0401MB2463.eurprd04.prod.outlook.com>

On Mon, Apr 22, 2019 at 03:27:56PM +0000, Diana Madalina Craciun wrote:
> On 4/21/2019 7:34 PM, Greg KH wrote:
> > On Mon, Apr 22, 2019 at 12:19:45AM +1000, Michael Ellerman wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Hi Greg/Sasha,
> >>
> >> Please queue up these powerpc patches for 4.4 if you have no objections.
> > why?  Do you, or someone else, really care about spectre issues in 4.4?
> > Who is using ppc for 4.4 becides a specific enterprise distro (and they
> > don't seem to be pulling in my stable updates anyway...)?
> 
> We (NXP) received questions from customers regarding Spectre mitigations
> on kernel 4.4. Not sure if they really need them as some systems are
> enclosed embedded ones, but they asked for them.

"Asking about", and "actually needing them" are two different things, as
you state.  It would be good to get confirmation from someone that these
are "actually needed".

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v1 01/27] powerpc/mm: Don't BUG() in hugepd_page()
From: Christophe Leroy @ 2019-04-24 14:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <87r2a93xto.fsf@linux.ibm.com>



Le 11/04/2019 à 07:39, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
> 
>> Don't BUG(), just warn and return NULL.
>> If the NULL value is not handled, it will get catched anyway.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/include/asm/hugetlb.h | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
>> index 8d40565ad0c3..48c29686c78e 100644
>> --- a/arch/powerpc/include/asm/hugetlb.h
>> +++ b/arch/powerpc/include/asm/hugetlb.h
>> @@ -14,7 +14,8 @@
>>    */
>>   static inline pte_t *hugepd_page(hugepd_t hpd)
>>   {
>> -	BUG_ON(!hugepd_ok(hpd));
>> +	if (WARN_ON(!hugepd_ok(hpd)))
>> +		return NULL;
> 
> We should not find that true. That BUG_ON was there to catch errors
> when changing pte formats. May be switch that VM_BUG_ON()?

Ok, I'll switch to VM_BUG_ON()

Christophe

> 
>>   	/*
>>   	 * We have only four bits to encode, MMU page size
>>   	 */
>> @@ -42,7 +43,8 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma,
>>   
>>   static inline pte_t *hugepd_page(hugepd_t hpd)
>>   {
>> -	BUG_ON(!hugepd_ok(hpd));
>> +	if (WARN_ON(!hugepd_ok(hpd)))
>> +		return NULL;
>>   #ifdef CONFIG_PPC_8xx
>>   	return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK);
>>   #else
>> -- 
>> 2.13.3

^ permalink raw reply

* Re: [PATCH v1 02/27] powerpc/mm: don't BUG in add_huge_page_size()
From: Christophe Leroy @ 2019-04-24 14:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <87o95d3xqe.fsf@linux.ibm.com>



Le 11/04/2019 à 07:41, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
> 
>> No reason to BUG() in add_huge_page_size(). Just WARN and
>> reject the add.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/mm/hugetlbpage.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
>> index 9e732bb2c84a..cf2978e235f3 100644
>> --- a/arch/powerpc/mm/hugetlbpage.c
>> +++ b/arch/powerpc/mm/hugetlbpage.c
>> @@ -634,7 +634,8 @@ static int __init add_huge_page_size(unsigned long long size)
>>   	}
>>   #endif
>>   
>> -	BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
>> +	if (WARN_ON(mmu_psize_defs[mmu_psize].shift != shift))
>> +		return -EINVAL;
> 
> Same here. There are not catching runtime errors. We should never find
> that true. This is to catch mistakes during development changes. Switch
> to VM_BUG_ON?

Ok, I'll switch to VM_BUG_ON()

Christophe

> 
> 
>>   
>>   	/* Return if huge page size has already been setup */
>>   	if (size_to_hstate(size))
>> -- 
>> 2.13.3

^ permalink raw reply

* Re: [PATCH v1 03/27] powerpc/mm: don't BUG() in slice_mask_for_size()
From: Christophe Leroy @ 2019-04-24 14:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <87lg0h3xpo.fsf@linux.ibm.com>



Le 11/04/2019 à 07:41, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
> 
>> When no mask is found for the page size, WARN() and return NULL
>> instead of BUG()ing.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>>   arch/powerpc/mm/slice.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
>> index aec91dbcdc0b..011d470ea340 100644
>> --- a/arch/powerpc/mm/slice.c
>> +++ b/arch/powerpc/mm/slice.c
>> @@ -165,7 +165,8 @@ static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize)
>>   	if (psize == MMU_PAGE_16G)
>>   		return &mm->context.mask_16g;
>>   #endif
>> -	BUG();
>> +	WARN_ON(true);
>> +	return NULL;
>>   }
> 
> 
> Same here. There are not catching runtime errors. We should never find
> that true. This is to catch mistakes during development changes. Switch
> to VM_BUG_ON?

Ok, I'll switch to VM_BUG_ON()

Christophe

> 
> 
>>   #elif defined(CONFIG_PPC_8xx)
>>   static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize)
>> @@ -178,7 +179,8 @@ static struct slice_mask *slice_mask_for_size(struct mm_struct *mm, int psize)
>>   	if (psize == MMU_PAGE_8M)
>>   		return &mm->context.mask_8m;
>>   #endif
>> -	BUG();
>> +	WARN_ON(true);
>> +	return NULL;
>>   }
>>   #else
>>   #error "Must define the slice masks for page sizes supported by the platform"
>> -- 
>> 2.13.3

^ permalink raw reply

* Re: [PATCH v2 5/5] arm64/speculation: Support 'mitigations=' cmdline option
From: Will Deacon @ 2019-04-24 14:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Heiko Carstens, Paul Mackerras, H . Peter Anvin,
	Ingo Molnar, Andrea Arcangeli, linux-s390, x86, Steven Price,
	Linus Torvalds, Catalin Marinas, Waiman Long, linux-arch,
	Jon Masters, Jiri Kosina, Borislav Petkov, Andy Lutomirski,
	Josh Poimboeuf, linux-arm-kernel, Phil Auld, Greg Kroah-Hartman,
	Randy Dunlap, linux-kernel, Tyler Hicks, Martin Schwidefsky,
	linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.1904162124020.1780@nanos.tec.linutronix.de>

Hi Thomas,

On Tue, Apr 16, 2019 at 09:26:13PM +0200, Thomas Gleixner wrote:
> On Fri, 12 Apr 2019, Josh Poimboeuf wrote:
> 
> > Configure arm64 runtime CPU speculation bug mitigations in accordance
> > with the 'mitigations=' cmdline option.  This affects Meltdown, Spectre
> > v2, and Speculative Store Bypass.
> > 
> > The default behavior is unchanged.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > ---
> > NOTE: This is based on top of Jeremy Linton's patches:
> >       https://lkml.kernel.org/r/20190410231237.52506-1-jeremy.linton@arm.com
> 
> So I keep that out and we have to revisit that once the ARM64 stuff hits a
> tree, right? I can have a branch with just the 4 first patches applied
> which ARM64 folks can pull in when they apply Jeremy's patches before te
> merge window.

I'm assuming that this refers to the core/speculation branch in tip:

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=core/speculation

but please can you confirm that I'm good to pull that into arm64?

Cheers,

Will

^ permalink raw reply

* Re: [PATCH v12 20/31] mm: introduce vma reference counter
From: Laurent Dufour @ 2019-04-24 14:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
	linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
	Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
	aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
	David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park, dave,
	kemi.wang, kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran,
	Yang Shi, Mike Rapoport, linuxppc-dev, linux-kernel,
	Sergey Senozhatsky, vinayak menon, akpm, Tim Chen, haren
In-Reply-To: <20190422203647.GK14666@redhat.com>

Le 22/04/2019 à 22:36, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:11PM +0200, Laurent Dufour wrote:
>> The final goal is to be able to use a VMA structure without holding the
>> mmap_sem and to be sure that the structure will not be freed in our back.
>>
>> The lockless use of the VMA will be done through RCU protection and thus a
>> dedicated freeing service is required to manage it asynchronously.
>>
>> As reported in a 2010's thread [1], this may impact file handling when a
>> file is still referenced while the mapping is no more there.  As the final
>> goal is to handle anonymous VMA in a speculative way and not file backed
>> mapping, we could close and free the file pointer in a synchronous way, as
>> soon as we are guaranteed to not use it without holding the mmap_sem. For
>> sanity reason, in a minimal effort, the vm_file file pointer is unset once
>> the file pointer is put.
>>
>> [1] https://lore.kernel.org/linux-mm/20100104182429.833180340@chello.nl/
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> 
> Using kref would have been better from my POV even with RCU freeing
> but anyway:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Thanks Jérôme,

I think kref is a good option here, I'll give it a try.


>> ---
>>   include/linux/mm.h       |  4 ++++
>>   include/linux/mm_types.h |  3 +++
>>   mm/internal.h            | 27 +++++++++++++++++++++++++++
>>   mm/mmap.c                | 13 +++++++++----
>>   4 files changed, 43 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f14b2c9ddfd4..f761a9c65c74 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -529,6 +529,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>>   	vma->vm_mm = mm;
>>   	vma->vm_ops = &dummy_vm_ops;
>>   	INIT_LIST_HEAD(&vma->anon_vma_chain);
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +	atomic_set(&vma->vm_ref_count, 1);
>> +#endif
>>   }
>>   
>>   static inline void vma_set_anonymous(struct vm_area_struct *vma)
>> @@ -1418,6 +1421,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
>>   	INIT_LIST_HEAD(&vma->anon_vma_chain);
>>   #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>   	seqcount_init(&vma->vm_sequence);
>> +	atomic_set(&vma->vm_ref_count, 1);
>>   #endif
>>   }
>>   
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 24b3f8ce9e42..6a6159e11a3f 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -285,6 +285,9 @@ struct vm_area_struct {
>>   	/* linked list of VM areas per task, sorted by address */
>>   	struct vm_area_struct *vm_next, *vm_prev;
>>   
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +	atomic_t vm_ref_count;
>> +#endif
>>   	struct rb_node vm_rb;
>>   
>>   	/*
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9eeaf2b95166..302382bed406 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -40,6 +40,33 @@ void page_writeback_init(void);
>>   
>>   vm_fault_t do_swap_page(struct vm_fault *vmf);
>>   
>> +
>> +extern void __free_vma(struct vm_area_struct *vma);
>> +
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +static inline void get_vma(struct vm_area_struct *vma)
>> +{
>> +	atomic_inc(&vma->vm_ref_count);
>> +}
>> +
>> +static inline void put_vma(struct vm_area_struct *vma)
>> +{
>> +	if (atomic_dec_and_test(&vma->vm_ref_count))
>> +		__free_vma(vma);
>> +}
>> +
>> +#else
>> +
>> +static inline void get_vma(struct vm_area_struct *vma)
>> +{
>> +}
>> +
>> +static inline void put_vma(struct vm_area_struct *vma)
>> +{
>> +	__free_vma(vma);
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>>   void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>   		unsigned long floor, unsigned long ceiling);
>>   
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index f7f6027a7dff..c106440dcae7 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -188,6 +188,12 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
>>   }
>>   #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>   
>> +void __free_vma(struct vm_area_struct *vma)
>> +{
>> +	mpol_put(vma_policy(vma));
>> +	vm_area_free(vma);
>> +}
>> +
>>   /*
>>    * Close a vm structure and free it, returning the next.
>>    */
>> @@ -200,8 +206,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>>   		vma->vm_ops->close(vma);
>>   	if (vma->vm_file)
>>   		fput(vma->vm_file);
>> -	mpol_put(vma_policy(vma));
>> -	vm_area_free(vma);
>> +	vma->vm_file = NULL;
>> +	put_vma(vma);
>>   	return next;
>>   }
>>   
>> @@ -990,8 +996,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>   		if (next->anon_vma)
>>   			anon_vma_merge(vma, next);
>>   		mm->map_count--;
>> -		mpol_put(vma_policy(next));
>> -		vm_area_free(next);
>> +		put_vma(next);
>>   		/*
>>   		 * In mprotect's case 6 (see comments on vma_merge),
>>   		 * we must remove another next too. It would clutter
>> -- 
>> 2.21.0
>>
> 


^ permalink raw reply

* Re: [PATCH v4 17/63] Documentation: ACPI: move method-tracing.txt to firmware-guide/acpi and convert to rsST
From: Mauro Carvalho Chehab @ 2019-04-24 14:26 UTC (permalink / raw)
  To: Changbin Du
  Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
	Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
	Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-18-changbin.du@gmail.com>

Em Wed, 24 Apr 2019 00:28:46 +0800
Changbin Du <changbin.du@gmail.com> escreveu:

> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
> 
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> ---
>  Documentation/acpi/method-tracing.txt         | 192 ---------------
>  Documentation/firmware-guide/acpi/index.rst   |   1 +
>  .../firmware-guide/acpi/method-tracing.rst    | 225 ++++++++++++++++++
>  3 files changed, 226 insertions(+), 192 deletions(-)
>  delete mode 100644 Documentation/acpi/method-tracing.txt
>  create mode 100644 Documentation/firmware-guide/acpi/method-tracing.rst
> 
> diff --git a/Documentation/acpi/method-tracing.txt b/Documentation/acpi/method-tracing.txt
> deleted file mode 100644
> index 0aba14c8f459..000000000000
> --- a/Documentation/acpi/method-tracing.txt
> +++ /dev/null
> @@ -1,192 +0,0 @@
> -ACPICA Trace Facility
> -
> -Copyright (C) 2015, Intel Corporation
> -Author: Lv Zheng <lv.zheng@intel.com>
> -
> -
> -Abstract:
> -
> -This document describes the functions and the interfaces of the method
> -tracing facility.
> -
> -1. Functionalities and usage examples:
> -
> -   ACPICA provides method tracing capability. And two functions are
> -   currently implemented using this capability.
> -
> -   A. Log reducer
> -   ACPICA subsystem provides debugging outputs when CONFIG_ACPI_DEBUG is
> -   enabled. The debugging messages which are deployed via
> -   ACPI_DEBUG_PRINT() macro can be reduced at 2 levels - per-component
> -   level (known as debug layer, configured via
> -   /sys/module/acpi/parameters/debug_layer) and per-type level (known as
> -   debug level, configured via /sys/module/acpi/parameters/debug_level).
> -
> -   But when the particular layer/level is applied to the control method
> -   evaluations, the quantity of the debugging outputs may still be too
> -   large to be put into the kernel log buffer. The idea thus is worked out
> -   to only enable the particular debug layer/level (normally more detailed)
> -   logs when the control method evaluation is started, and disable the
> -   detailed logging when the control method evaluation is stopped.
> -
> -   The following command examples illustrate the usage of the "log reducer"
> -   functionality:
> -   a. Filter out the debug layer/level matched logs when control methods
> -      are being evaluated:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0xXXXXXXXX" > trace_debug_layer
> -      # echo "0xYYYYYYYY" > trace_debug_level
> -      # echo "enable" > trace_state
> -   b. Filter out the debug layer/level matched logs when the specified
> -      control method is being evaluated:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0xXXXXXXXX" > trace_debug_layer
> -      # echo "0xYYYYYYYY" > trace_debug_level
> -      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> -      # echo "method" > /sys/module/acpi/parameters/trace_state
> -   c. Filter out the debug layer/level matched logs when the specified
> -      control method is being evaluated for the first time:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0xXXXXXXXX" > trace_debug_layer
> -      # echo "0xYYYYYYYY" > trace_debug_level
> -      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> -      # echo "method-once" > /sys/module/acpi/parameters/trace_state
> -   Where:
> -      0xXXXXXXXX/0xYYYYYYYY: Refer to Documentation/acpi/debug.txt for
> -			     possible debug layer/level masking values.
> -      \PPPP.AAAA.TTTT.HHHH: Full path of a control method that can be found
> -			    in the ACPI namespace. It needn't be an entry
> -			    of a control method evaluation.
> -
> -   B. AML tracer
> -
> -   There are special log entries added by the method tracing facility at
> -   the "trace points" the AML interpreter starts/stops to execute a control
> -   method, or an AML opcode. Note that the format of the log entries are
> -   subject to change:
> -     [    0.186427]   exdebug-0398 ex_trace_point        : Method Begin [0xf58394d8:\_SB.PCI0.LPCB.ECOK] execution.
> -     [    0.186630]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905c88:If] execution.
> -     [    0.186820]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905cc0:LEqual] execution.
> -     [    0.187010]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905a20:-NamePath-] execution.
> -     [    0.187214]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905a20:-NamePath-] execution.
> -     [    0.187407]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905f60:One] execution.
> -     [    0.187594]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905f60:One] execution.
> -     [    0.187789]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905cc0:LEqual] execution.
> -     [    0.187980]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905cc0:Return] execution.
> -     [    0.188146]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905f60:One] execution.
> -     [    0.188334]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905f60:One] execution.
> -     [    0.188524]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905cc0:Return] execution.
> -     [    0.188712]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905c88:If] execution.
> -     [    0.188903]   exdebug-0398 ex_trace_point        : Method End [0xf58394d8:\_SB.PCI0.LPCB.ECOK] execution.
> -
> -   Developers can utilize these special log entries to track the AML
> -   interpretion, thus can aid issue debugging and performance tuning. Note
> -   that, as the "AML tracer" logs are implemented via ACPI_DEBUG_PRINT()
> -   macro, CONFIG_ACPI_DEBUG is also required to be enabled for enabling
> -   "AML tracer" logs.
> -
> -   The following command examples illustrate the usage of the "AML tracer"
> -   functionality:
> -   a. Filter out the method start/stop "AML tracer" logs when control
> -      methods are being evaluated:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0x80" > trace_debug_layer
> -      # echo "0x10" > trace_debug_level
> -      # echo "enable" > trace_state
> -   b. Filter out the method start/stop "AML tracer" when the specified
> -      control method is being evaluated:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0x80" > trace_debug_layer
> -      # echo "0x10" > trace_debug_level
> -      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> -      # echo "method" > trace_state
> -   c. Filter out the method start/stop "AML tracer" logs when the specified
> -      control method is being evaluated for the first time:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0x80" > trace_debug_layer
> -      # echo "0x10" > trace_debug_level
> -      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> -      # echo "method-once" > trace_state
> -   d. Filter out the method/opcode start/stop "AML tracer" when the
> -      specified control method is being evaluated:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0x80" > trace_debug_layer
> -      # echo "0x10" > trace_debug_level
> -      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> -      # echo "opcode" > trace_state
> -   e. Filter out the method/opcode start/stop "AML tracer" when the
> -      specified control method is being evaluated for the first time:
> -      # cd /sys/module/acpi/parameters
> -      # echo "0x80" > trace_debug_layer
> -      # echo "0x10" > trace_debug_level
> -      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> -      # echo "opcode-opcode" > trace_state
> -
> -  Note that all above method tracing facility related module parameters can
> -  be used as the boot parameters, for example:
> -      acpi.trace_debug_layer=0x80 acpi.trace_debug_level=0x10 \
> -      acpi.trace_method_name=\_SB.LID0._LID acpi.trace_state=opcode-once
> -
> -2. Interface descriptions:
> -
> -   All method tracing functions can be configured via ACPI module
> -   parameters that are accessible at /sys/module/acpi/parameters/:
> -
> -   trace_method_name
> -	The full path of the AML method that the user wants to trace.
> -	Note that the full path shouldn't contain the trailing "_"s in its
> -	name segments but may contain "\" to form an absolute path.
> -
> -   trace_debug_layer
> -	The temporary debug_layer used when the tracing feature is enabled.
> -	Using ACPI_EXECUTER (0x80) by default, which is the debug_layer
> -	used to match all "AML tracer" logs.
> -
> -   trace_debug_level
> -	The temporary debug_level used when the tracing feature is enabled.
> -	Using ACPI_LV_TRACE_POINT (0x10) by default, which is the
> -	debug_level used to match all "AML tracer" logs.
> -
> -   trace_state
> -	The status of the tracing feature.
> -	Users can enable/disable this debug tracing feature by executing
> -	the following command:
> -	    # echo string > /sys/module/acpi/parameters/trace_state
> -	Where "string" should be one of the following:
> -	"disable"
> -	    Disable the method tracing feature.
> -	"enable"
> -	    Enable the method tracing feature.
> -	    ACPICA debugging messages matching
> -	    "trace_debug_layer/trace_debug_level" during any method
> -	    execution will be logged.
> -	"method"
> -	    Enable the method tracing feature.
> -	    ACPICA debugging messages matching
> -	    "trace_debug_layer/trace_debug_level" during method execution
> -	    of "trace_method_name" will be logged.
> -	"method-once"
> -	    Enable the method tracing feature.
> -	    ACPICA debugging messages matching
> -	    "trace_debug_layer/trace_debug_level" during method execution
> -	    of "trace_method_name" will be logged only once.
> -	"opcode"
> -	    Enable the method tracing feature.
> -	    ACPICA debugging messages matching
> -	    "trace_debug_layer/trace_debug_level" during method/opcode
> -	    execution of "trace_method_name" will be logged.
> -	"opcode-once"
> -	    Enable the method tracing feature.
> -	    ACPICA debugging messages matching
> -	    "trace_debug_layer/trace_debug_level" during method/opcode
> -	    execution of "trace_method_name" will be logged only once.
> -	Note that, the difference between the "enable" and other feature
> -        enabling options are:
> -	1. When "enable" is specified, since
> -	   "trace_debug_layer/trace_debug_level" shall apply to all control
> -	   method evaluations, after configuring "trace_state" to "enable",
> -	   "trace_method_name" will be reset to NULL.
> -	2. When "method/opcode" is specified, if
> -	   "trace_method_name" is NULL when "trace_state" is configured to
> -	   these options, the "trace_debug_layer/trace_debug_level" will
> -	   apply to all control method evaluations.
> diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> index a45fea11f998..287a7cbd82ac 100644
> --- a/Documentation/firmware-guide/acpi/index.rst
> +++ b/Documentation/firmware-guide/acpi/index.rst
> @@ -13,6 +13,7 @@ ACPI Support
>     enumeration
>     osi
>     method-customizing
> +   method-tracing
>     DSD-properties-rules
>     debug
>     gpio-properties
> diff --git a/Documentation/firmware-guide/acpi/method-tracing.rst b/Documentation/firmware-guide/acpi/method-tracing.rst
> new file mode 100644
> index 000000000000..7a997ba168d7
> --- /dev/null
> +++ b/Documentation/firmware-guide/acpi/method-tracing.rst
> @@ -0,0 +1,225 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
> +
> +=====================
> +ACPICA Trace Facility
> +=====================
> +
> +:Copyright: |copy| 2015, Intel Corporation
> +:Author: Lv Zheng <lv.zheng@intel.com>
> +
> +
> +:Abstract: This document describes the functions and the interfaces of the
> +  method tracing facility.

Same comment as on other patches.

> +
> +1. Functionalities and usage examples
> +=====================================
> +
> +ACPICA provides method tracing capability. And two functions are
> +currently implemented using this capability.
> +
> +Log reducer
> +--------------
> +
> +ACPICA subsystem provides debugging outputs when CONFIG_ACPI_DEBUG is
> +enabled. The debugging messages which are deployed via
> +ACPI_DEBUG_PRINT() macro can be reduced at 2 levels - per-component
> +level (known as debug layer, configured via
> +/sys/module/acpi/parameters/debug_layer) and per-type level (known as
> +debug level, configured via /sys/module/acpi/parameters/debug_level).
> +
> +But when the particular layer/level is applied to the control method
> +evaluations, the quantity of the debugging outputs may still be too
> +large to be put into the kernel log buffer. The idea thus is worked out
> +to only enable the particular debug layer/level (normally more detailed)
> +logs when the control method evaluation is started, and disable the
> +detailed logging when the control method evaluation is stopped.
> +
> +The following command examples illustrate the usage of the "log reducer"
> +functionality:
> +
> +a. Filter out the debug layer/level matched logs when control methods
> +   are being evaluated::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0xXXXXXXXX" > trace_debug_layer
> +      # echo "0xYYYYYYYY" > trace_debug_level
> +      # echo "enable" > trace_state
> +
> +b. Filter out the debug layer/level matched logs when the specified
> +   control method is being evaluated::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0xXXXXXXXX" > trace_debug_layer
> +      # echo "0xYYYYYYYY" > trace_debug_level
> +      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> +      # echo "method" > /sys/module/acpi/parameters/trace_state
> +
> +c. Filter out the debug layer/level matched logs when the specified
> +   control method is being evaluated for the first time::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0xXXXXXXXX" > trace_debug_layer
> +      # echo "0xYYYYYYYY" > trace_debug_level
> +      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> +      # echo "method-once" > /sys/module/acpi/parameters/trace_state
> +
> +Where:
> +   0xXXXXXXXX/0xYYYYYYYY
> +     Refer to Documentation/acpi/debug.txt for possible debug layer/level
> +     masking values.
> +   \PPPP.AAAA.TTTT.HHHH
> +     Full path of a control method that can be found in the ACPI namespace.
> +     It needn't be an entry of a control method evaluation.
> +
> +AML tracer
> +-------------

The markup is bigger than the line. You should have seen a Sphinx
warning here.

> +
> +There are special log entries added by the method tracing facility at
> +the "trace points" the AML interpreter starts/stops to execute a control
> +method, or an AML opcode. Note that the format of the log entries are
> +subject to change::
> +
> +   [    0.186427]   exdebug-0398 ex_trace_point        : Method Begin [0xf58394d8:\_SB.PCI0.LPCB.ECOK] execution.
> +   [    0.186630]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905c88:If] execution.
> +   [    0.186820]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905cc0:LEqual] execution.
> +   [    0.187010]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905a20:-NamePath-] execution.
> +   [    0.187214]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905a20:-NamePath-] execution.
> +   [    0.187407]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905f60:One] execution.
> +   [    0.187594]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905f60:One] execution.
> +   [    0.187789]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905cc0:LEqual] execution.
> +   [    0.187980]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905cc0:Return] execution.
> +   [    0.188146]   exdebug-0398 ex_trace_point        : Opcode Begin [0xf5905f60:One] execution.
> +   [    0.188334]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905f60:One] execution.
> +   [    0.188524]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905cc0:Return] execution.
> +   [    0.188712]   exdebug-0398 ex_trace_point        : Opcode End [0xf5905c88:If] execution.
> +   [    0.188903]   exdebug-0398 ex_trace_point        : Method End [0xf58394d8:\_SB.PCI0.LPCB.ECOK] execution.
> +
> +Developers can utilize these special log entries to track the AML
> +interpretion, thus can aid issue debugging and performance tuning. Note
> +that, as the "AML tracer" logs are implemented via ACPI_DEBUG_PRINT()
> +macro, CONFIG_ACPI_DEBUG is also required to be enabled for enabling
> +"AML tracer" logs.
> +
> +The following command examples illustrate the usage of the "AML tracer"
> +functionality:
> +
> +a. Filter out the method start/stop "AML tracer" logs when control
> +   methods are being evaluated::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0x80" > trace_debug_layer
> +      # echo "0x10" > trace_debug_level
> +      # echo "enable" > trace_state
> +
> +b. Filter out the method start/stop "AML tracer" when the specified
> +   control method is being evaluated::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0x80" > trace_debug_layer
> +      # echo "0x10" > trace_debug_level
> +      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> +      # echo "method" > trace_state
> +
> +c. Filter out the method start/stop "AML tracer" logs when the specified
> +   control method is being evaluated for the first time::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0x80" > trace_debug_layer
> +      # echo "0x10" > trace_debug_level
> +      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> +      # echo "method-once" > trace_state
> +
> +d. Filter out the method/opcode start/stop "AML tracer" when the
> +   specified control method is being evaluated::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0x80" > trace_debug_layer
> +      # echo "0x10" > trace_debug_level
> +      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> +      # echo "opcode" > trace_state
> +
> +e. Filter out the method/opcode start/stop "AML tracer" when the
> +   specified control method is being evaluated for the first time::
> +
> +      # cd /sys/module/acpi/parameters
> +      # echo "0x80" > trace_debug_layer
> +      # echo "0x10" > trace_debug_level
> +      # echo "\PPPP.AAAA.TTTT.HHHH" > trace_method_name
> +      # echo "opcode-opcode" > trace_state
> +
> +Note that all above method tracing facility related module parameters can
> +be used as the boot parameters, for example::
> +
> +   acpi.trace_debug_layer=0x80 acpi.trace_debug_level=0x10 \
> +   acpi.trace_method_name=\_SB.LID0._LID acpi.trace_state=opcode-once
> +
> +2. Interface descriptions
> +=========================
> +
> +All method tracing functions can be configured via ACPI module
> +parameters that are accessible at /sys/module/acpi/parameters/:
> +
> +trace_method_name
> +The full path of the AML method that the user wants to trace.
> +Note that the full path shouldn't contain the trailing "_"s in its
> +name segments but may contain "\" to form an absolute path.
> +


> +trace_debug_layer
> +The temporary debug_layer used when the tracing feature is enabled.
> +Using ACPI_EXECUTER (0x80) by default, which is the debug_layer
> +used to match all "AML tracer" logs.
> +
> +trace_debug_level
> +The temporary debug_level used when the tracing feature is enabled.
> +Using ACPI_LV_TRACE_POINT (0x10) by default, which is the
> +debug_level used to match all "AML tracer" logs.
> +
> +trace_state
> +The status of the tracing feature.
> +Users can enable/disable this debug tracing feature by executing
> +the following command::

For the above, please indent, in order to properly change the
sysfs node font to bold. Also, mark paragraphs with a \n, e. g:

   trace_method_name
	The full path of the AML method that the user wants to trace.

	Note that the full path shouldn't contain the trailing "_"s in its
	name segments but may contain "\" to form an absolute path.

   trace_debug_layer
	The temporary debug_layer used when the tracing feature is enabled.

	Using ACPI_EXECUTER (0x80) by default, which is the debug_layer
	used to match all "AML tracer" logs.

   trace_debug_level
	The temporary debug_level used when the tracing feature is enabled.

	Using ACPI_LV_TRACE_POINT (0x10) by default, which is the
	debug_level used to match all "AML tracer" logs.

   trace_state
	The status of the tracing feature.

	Users can enable/disable this debug tracing feature by executing
	the following command::

After doing such changes:

Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>


> +
> +   # echo string > /sys/module/acpi/parameters/trace_state
> +
> +Where "string" should be one of the following:
> +
> +"disable"
> +      Disable the method tracing feature.
> +"enable"
> +      Enable the method tracing feature.
> +      ACPICA debugging messages matching
> +      "trace_debug_layer/trace_debug_level" during any method
> +      execution will be logged.
> +"method"
> +      Enable the method tracing feature.
> +      ACPICA debugging messages matching
> +      "trace_debug_layer/trace_debug_level" during method execution
> +      of "trace_method_name" will be logged.
> +"method-once"
> +      Enable the method tracing feature.
> +      ACPICA debugging messages matching
> +      "trace_debug_layer/trace_debug_level" during method execution
> +      of "trace_method_name" will be logged only once.
> +"opcode"
> +      Enable the method tracing feature.
> +      ACPICA debugging messages matching
> +      "trace_debug_layer/trace_debug_level" during method/opcode
> +      execution of "trace_method_name" will be logged.
> +"opcode-once"
> +      Enable the method tracing feature.
> +      ACPICA debugging messages matching
> +      "trace_debug_layer/trace_debug_level" during method/opcode
> +      execution of "trace_method_name" will be logged only once.
> +
> +Note that, the difference between the "enable" and other feature
> +enabling options are:
> +
> +1. When "enable" is specified, since
> +   "trace_debug_layer/trace_debug_level" shall apply to all control
> +   method evaluations, after configuring "trace_state" to "enable",
> +   "trace_method_name" will be reset to NULL.
> +2. When "method/opcode" is specified, if
> +   "trace_method_name" is NULL when "trace_state" is configured to
> +   these options, the "trace_debug_layer/trace_debug_level" will
> +   apply to all control method evaluations.



Thanks,
Mauro

^ permalink raw reply

* Re: [PATCH v4 18/63] Documentation: ACPI: move aml-debugger.txt to firmware-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:28 UTC (permalink / raw)
  To: Changbin Du
  Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
	Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
	Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-19-changbin.du@gmail.com>

Em Wed, 24 Apr 2019 00:28:47 +0800
Changbin Du <changbin.du@gmail.com> escreveu:

> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
> 
> Signed-off-by: Changbin Du <changbin.du@gmail.com>

For the conversion changes:

Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

> ---
>  Documentation/acpi/aml-debugger.txt           | 66 ----------------
>  .../firmware-guide/acpi/aml-debugger.rst      | 75 +++++++++++++++++++
>  Documentation/firmware-guide/acpi/index.rst   |  1 +
>  3 files changed, 76 insertions(+), 66 deletions(-)
>  delete mode 100644 Documentation/acpi/aml-debugger.txt
>  create mode 100644 Documentation/firmware-guide/acpi/aml-debugger.rst
> 
> diff --git a/Documentation/acpi/aml-debugger.txt b/Documentation/acpi/aml-debugger.txt
> deleted file mode 100644
> index 75ebeb64ab29..000000000000
> --- a/Documentation/acpi/aml-debugger.txt
> +++ /dev/null
> @@ -1,66 +0,0 @@
> -The AML Debugger
> -
> -Copyright (C) 2016, Intel Corporation
> -Author: Lv Zheng <lv.zheng@intel.com>
> -
> -
> -This document describes the usage of the AML debugger embedded in the Linux
> -kernel.
> -
> -1. Build the debugger
> -
> -   The following kernel configuration items are required to enable the AML
> -   debugger interface from the Linux kernel:
> -
> -   CONFIG_ACPI_DEBUGGER=y
> -   CONFIG_ACPI_DEBUGGER_USER=m
> -
> -   The userspace utilities can be built from the kernel source tree using
> -   the following commands:
> -
> -   $ cd tools
> -   $ make acpi
> -
> -   The resultant userspace tool binary is then located at:
> -
> -     tools/power/acpi/acpidbg
> -
> -   It can be installed to system directories by running "make install" (as a
> -   sufficiently privileged user).
> -
> -2. Start the userspace debugger interface
> -
> -   After booting the kernel with the debugger built-in, the debugger can be
> -   started by using the following commands:
> -
> -   # mount -t debugfs none /sys/kernel/debug
> -   # modprobe acpi_dbg
> -   # tools/power/acpi/acpidbg
> -
> -   That spawns the interactive AML debugger environment where you can execute
> -   debugger commands.
> -
> -   The commands are documented in the "ACPICA Overview and Programmer Reference"
> -   that can be downloaded from
> -
> -   https://acpica.org/documentation
> -
> -   The detailed debugger commands reference is located in Chapter 12 "ACPICA
> -   Debugger Reference".  The "help" command can be used for a quick reference.
> -
> -3. Stop the userspace debugger interface
> -
> -   The interactive debugger interface can be closed by pressing Ctrl+C or using
> -   the "quit" or "exit" commands.  When finished, unload the module with:
> -
> -   # rmmod acpi_dbg
> -
> -   The module unloading may fail if there is an acpidbg instance running.
> -
> -4. Run the debugger in a script
> -
> -   It may be useful to run the AML debugger in a test script. "acpidbg" supports
> -   this in a special "batch" mode.  For example, the following command outputs
> -   the entire ACPI namespace:
> -
> -   # acpidbg -b "namespace"
> diff --git a/Documentation/firmware-guide/acpi/aml-debugger.rst b/Documentation/firmware-guide/acpi/aml-debugger.rst
> new file mode 100644
> index 000000000000..a889d43bc6c5
> --- /dev/null
> +++ b/Documentation/firmware-guide/acpi/aml-debugger.rst
> @@ -0,0 +1,75 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
> +
> +================
> +The AML Debugger
> +================
> +
> +:Copyright: |copy| 2016, Intel Corporation
> +:Author: Lv Zheng <lv.zheng@intel.com>
> +
> +
> +This document describes the usage of the AML debugger embedded in the Linux
> +kernel.
> +
> +1. Build the debugger
> +=====================
> +
> +The following kernel configuration items are required to enable the AML
> +debugger interface from the Linux kernel::
> +
> +   CONFIG_ACPI_DEBUGGER=y
> +   CONFIG_ACPI_DEBUGGER_USER=m
> +
> +The userspace utilities can be built from the kernel source tree using
> +the following commands::
> +
> +   $ cd tools
> +   $ make acpi
> +
> +The resultant userspace tool binary is then located at::
> +
> +   tools/power/acpi/acpidbg
> +
> +It can be installed to system directories by running "make install" (as a
> +sufficiently privileged user).
> +
> +2. Start the userspace debugger interface
> +=========================================
> +
> +After booting the kernel with the debugger built-in, the debugger can be
> +started by using the following commands::
> +
> +   # mount -t debugfs none /sys/kernel/debug
> +   # modprobe acpi_dbg
> +   # tools/power/acpi/acpidbg
> +
> +That spawns the interactive AML debugger environment where you can execute
> +debugger commands.
> +
> +The commands are documented in the "ACPICA Overview and Programmer Reference"
> +that can be downloaded from
> +
> +https://acpica.org/documentation
> +
> +The detailed debugger commands reference is located in Chapter 12 "ACPICA
> +Debugger Reference".  The "help" command can be used for a quick reference.
> +
> +3. Stop the userspace debugger interface
> +========================================
> +
> +The interactive debugger interface can be closed by pressing Ctrl+C or using
> +the "quit" or "exit" commands.  When finished, unload the module with::
> +
> +   # rmmod acpi_dbg
> +
> +The module unloading may fail if there is an acpidbg instance running.
> +
> +4. Run the debugger in a script
> +===============================
> +
> +It may be useful to run the AML debugger in a test script. "acpidbg" supports
> +this in a special "batch" mode.  For example, the following command outputs
> +the entire ACPI namespace::
> +
> +   # acpidbg -b "namespace"
> diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> index 287a7cbd82ac..e9f253d54897 100644
> --- a/Documentation/firmware-guide/acpi/index.rst
> +++ b/Documentation/firmware-guide/acpi/index.rst
> @@ -16,6 +16,7 @@ ACPI Support
>     method-tracing
>     DSD-properties-rules
>     debug
> +   aml-debugger
>     gpio-properties
>     i2c-muxes
>     acpi-lid



Thanks,
Mauro

^ permalink raw reply

* Re: [PATCH v4 19/63] Documentation: ACPI: move apei/output_format.txt to firmware-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:29 UTC (permalink / raw)
  To: Changbin Du
  Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
	Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
	Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-20-changbin.du@gmail.com>

Em Wed, 24 Apr 2019 00:28:48 +0800
Changbin Du <changbin.du@gmail.com> escreveu:

> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
> 
> Signed-off-by: Changbin Du <changbin.du@gmail.com>

For the conversion changes:

Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

> ---
>  Documentation/acpi/apei/output_format.txt     | 147 -----------------
>  .../acpi/apei/output_format.rst               | 150 ++++++++++++++++++
>  Documentation/firmware-guide/acpi/index.rst   |   1 +
>  3 files changed, 151 insertions(+), 147 deletions(-)
>  delete mode 100644 Documentation/acpi/apei/output_format.txt
>  create mode 100644 Documentation/firmware-guide/acpi/apei/output_format.rst
> 
> diff --git a/Documentation/acpi/apei/output_format.txt b/Documentation/acpi/apei/output_format.txt
> deleted file mode 100644
> index 0c49c197c47a..000000000000
> --- a/Documentation/acpi/apei/output_format.txt
> +++ /dev/null
> @@ -1,147 +0,0 @@
> -                     APEI output format
> -                     ~~~~~~~~~~~~~~~~~~
> -
> -APEI uses printk as hardware error reporting interface, the output
> -format is as follow.
> -
> -<error record> :=
> -APEI generic hardware error status
> -severity: <integer>, <severity string>
> -section: <integer>, severity: <integer>, <severity string>
> -flags: <integer>
> -<section flags strings>
> -fru_id: <uuid string>
> -fru_text: <string>
> -section_type: <section type string>
> -<section data>
> -
> -<severity string>* := recoverable | fatal | corrected | info
> -
> -<section flags strings># :=
> -[primary][, containment warning][, reset][, threshold exceeded]\
> -[, resource not accessible][, latent error]
> -
> -<section type string> := generic processor error | memory error | \
> -PCIe error | unknown, <uuid string>
> -
> -<section data> :=
> -<generic processor section data> | <memory section data> | \
> -<pcie section data> | <null>
> -
> -<generic processor section data> :=
> -[processor_type: <integer>, <proc type string>]
> -[processor_isa: <integer>, <proc isa string>]
> -[error_type: <integer>
> -<proc error type strings>]
> -[operation: <integer>, <proc operation string>]
> -[flags: <integer>
> -<proc flags strings>]
> -[level: <integer>]
> -[version_info: <integer>]
> -[processor_id: <integer>]
> -[target_address: <integer>]
> -[requestor_id: <integer>]
> -[responder_id: <integer>]
> -[IP: <integer>]
> -
> -<proc type string>* := IA32/X64 | IA64
> -
> -<proc isa string>* := IA32 | IA64 | X64
> -
> -<processor error type strings># :=
> -[cache error][, TLB error][, bus error][, micro-architectural error]
> -
> -<proc operation string>* := unknown or generic | data read | data write | \
> -instruction execution
> -
> -<proc flags strings># :=
> -[restartable][, precise IP][, overflow][, corrected]
> -
> -<memory section data> :=
> -[error_status: <integer>]
> -[physical_address: <integer>]
> -[physical_address_mask: <integer>]
> -[node: <integer>]
> -[card: <integer>]
> -[module: <integer>]
> -[bank: <integer>]
> -[device: <integer>]
> -[row: <integer>]
> -[column: <integer>]
> -[bit_position: <integer>]
> -[requestor_id: <integer>]
> -[responder_id: <integer>]
> -[target_id: <integer>]
> -[error_type: <integer>, <mem error type string>]
> -
> -<mem error type string>* :=
> -unknown | no error | single-bit ECC | multi-bit ECC | \
> -single-symbol chipkill ECC | multi-symbol chipkill ECC | master abort | \
> -target abort | parity error | watchdog timeout | invalid address | \
> -mirror Broken | memory sparing | scrub corrected error | \
> -scrub uncorrected error
> -
> -<pcie section data> :=
> -[port_type: <integer>, <pcie port type string>]
> -[version: <integer>.<integer>]
> -[command: <integer>, status: <integer>]
> -[device_id: <integer>:<integer>:<integer>.<integer>
> -slot: <integer>
> -secondary_bus: <integer>
> -vendor_id: <integer>, device_id: <integer>
> -class_code: <integer>]
> -[serial number: <integer>, <integer>]
> -[bridge: secondary_status: <integer>, control: <integer>]
> -[aer_status: <integer>, aer_mask: <integer>
> -<aer status string>
> -[aer_uncor_severity: <integer>]
> -aer_layer=<aer layer string>, aer_agent=<aer agent string>
> -aer_tlp_header: <integer> <integer> <integer> <integer>]
> -
> -<pcie port type string>* := PCIe end point | legacy PCI end point | \
> -unknown | unknown | root port | upstream switch port | \
> -downstream switch port | PCIe to PCI/PCI-X bridge | \
> -PCI/PCI-X to PCIe bridge | root complex integrated endpoint device | \
> -root complex event collector
> -
> -if section severity is fatal or recoverable
> -<aer status string># :=
> -unknown | unknown | unknown | unknown | Data Link Protocol | \
> -unknown | unknown | unknown | unknown | unknown | unknown | unknown | \
> -Poisoned TLP | Flow Control Protocol | Completion Timeout | \
> -Completer Abort | Unexpected Completion | Receiver Overflow | \
> -Malformed TLP | ECRC | Unsupported Request
> -else
> -<aer status string># :=
> -Receiver Error | unknown | unknown | unknown | unknown | unknown | \
> -Bad TLP | Bad DLLP | RELAY_NUM Rollover | unknown | unknown | unknown | \
> -Replay Timer Timeout | Advisory Non-Fatal
> -fi
> -
> -<aer layer string> :=
> -Physical Layer | Data Link Layer | Transaction Layer
> -
> -<aer agent string> :=
> -Receiver ID | Requester ID | Completer ID | Transmitter ID
> -
> -Where, [] designate corresponding content is optional
> -
> -All <field string> description with * has the following format:
> -
> -field: <integer>, <field string>
> -
> -Where value of <integer> should be the position of "string" in <field
> -string> description. Otherwise, <field string> will be "unknown".  
> -
> -All <field strings> description with # has the following format:
> -
> -field: <integer>
> -<field strings>
> -
> -Where each string in <fields strings> corresponding to one set bit of
> -<integer>. The bit position is the position of "string" in <field
> -strings> description.  
> -
> -For more detailed explanation of every field, please refer to UEFI
> -specification version 2.3 or later, section Appendix N: Common
> -Platform Error Record.
> diff --git a/Documentation/firmware-guide/acpi/apei/output_format.rst b/Documentation/firmware-guide/acpi/apei/output_format.rst
> new file mode 100644
> index 000000000000..c2e7ebddb529
> --- /dev/null
> +++ b/Documentation/firmware-guide/acpi/apei/output_format.rst
> @@ -0,0 +1,150 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==================
> +APEI output format
> +==================
> +
> +APEI uses printk as hardware error reporting interface, the output
> +format is as follow::
> +
> +        <error record> :=
> +        APEI generic hardware error status
> +        severity: <integer>, <severity string>
> +        section: <integer>, severity: <integer>, <severity string>
> +        flags: <integer>
> +        <section flags strings>
> +        fru_id: <uuid string>
> +        fru_text: <string>
> +        section_type: <section type string>
> +        <section data>
> +
> +        <severity string>* := recoverable | fatal | corrected | info
> +
> +        <section flags strings># :=
> +        [primary][, containment warning][, reset][, threshold exceeded]\
> +        [, resource not accessible][, latent error]
> +
> +        <section type string> := generic processor error | memory error | \
> +        PCIe error | unknown, <uuid string>
> +
> +        <section data> :=
> +        <generic processor section data> | <memory section data> | \
> +        <pcie section data> | <null>
> +
> +        <generic processor section data> :=
> +        [processor_type: <integer>, <proc type string>]
> +        [processor_isa: <integer>, <proc isa string>]
> +        [error_type: <integer>
> +        <proc error type strings>]
> +        [operation: <integer>, <proc operation string>]
> +        [flags: <integer>
> +        <proc flags strings>]
> +        [level: <integer>]
> +        [version_info: <integer>]
> +        [processor_id: <integer>]
> +        [target_address: <integer>]
> +        [requestor_id: <integer>]
> +        [responder_id: <integer>]
> +        [IP: <integer>]
> +
> +        <proc type string>* := IA32/X64 | IA64
> +
> +        <proc isa string>* := IA32 | IA64 | X64
> +
> +        <processor error type strings># :=
> +        [cache error][, TLB error][, bus error][, micro-architectural error]
> +
> +        <proc operation string>* := unknown or generic | data read | data write | \
> +        instruction execution
> +
> +        <proc flags strings># :=
> +        [restartable][, precise IP][, overflow][, corrected]
> +
> +        <memory section data> :=
> +        [error_status: <integer>]
> +        [physical_address: <integer>]
> +        [physical_address_mask: <integer>]
> +        [node: <integer>]
> +        [card: <integer>]
> +        [module: <integer>]
> +        [bank: <integer>]
> +        [device: <integer>]
> +        [row: <integer>]
> +        [column: <integer>]
> +        [bit_position: <integer>]
> +        [requestor_id: <integer>]
> +        [responder_id: <integer>]
> +        [target_id: <integer>]
> +        [error_type: <integer>, <mem error type string>]
> +
> +        <mem error type string>* :=
> +        unknown | no error | single-bit ECC | multi-bit ECC | \
> +        single-symbol chipkill ECC | multi-symbol chipkill ECC | master abort | \
> +        target abort | parity error | watchdog timeout | invalid address | \
> +        mirror Broken | memory sparing | scrub corrected error | \
> +        scrub uncorrected error
> +
> +        <pcie section data> :=
> +        [port_type: <integer>, <pcie port type string>]
> +        [version: <integer>.<integer>]
> +        [command: <integer>, status: <integer>]
> +        [device_id: <integer>:<integer>:<integer>.<integer>
> +        slot: <integer>
> +        secondary_bus: <integer>
> +        vendor_id: <integer>, device_id: <integer>
> +        class_code: <integer>]
> +        [serial number: <integer>, <integer>]
> +        [bridge: secondary_status: <integer>, control: <integer>]
> +        [aer_status: <integer>, aer_mask: <integer>
> +        <aer status string>
> +        [aer_uncor_severity: <integer>]
> +        aer_layer=<aer layer string>, aer_agent=<aer agent string>
> +        aer_tlp_header: <integer> <integer> <integer> <integer>]
> +
> +        <pcie port type string>* := PCIe end point | legacy PCI end point | \
> +        unknown | unknown | root port | upstream switch port | \
> +        downstream switch port | PCIe to PCI/PCI-X bridge | \
> +        PCI/PCI-X to PCIe bridge | root complex integrated endpoint device | \
> +        root complex event collector
> +
> +        if section severity is fatal or recoverable
> +        <aer status string># :=
> +        unknown | unknown | unknown | unknown | Data Link Protocol | \
> +        unknown | unknown | unknown | unknown | unknown | unknown | unknown | \
> +        Poisoned TLP | Flow Control Protocol | Completion Timeout | \
> +        Completer Abort | Unexpected Completion | Receiver Overflow | \
> +        Malformed TLP | ECRC | Unsupported Request
> +        else
> +        <aer status string># :=
> +        Receiver Error | unknown | unknown | unknown | unknown | unknown | \
> +        Bad TLP | Bad DLLP | RELAY_NUM Rollover | unknown | unknown | unknown | \
> +        Replay Timer Timeout | Advisory Non-Fatal
> +        fi
> +
> +        <aer layer string> :=
> +        Physical Layer | Data Link Layer | Transaction Layer
> +
> +        <aer agent string> :=
> +        Receiver ID | Requester ID | Completer ID | Transmitter ID
> +
> +Where, [] designate corresponding content is optional
> +
> +All <field string> description with * has the following format::
> +
> +        field: <integer>, <field string>
> +
> +Where value of <integer> should be the position of "string" in <field
> +string> description. Otherwise, <field string> will be "unknown".  
> +
> +All <field strings> description with # has the following format::
> +
> +        field: <integer>
> +        <field strings>
> +
> +Where each string in <fields strings> corresponding to one set bit of
> +<integer>. The bit position is the position of "string" in <field
> +strings> description.  
> +
> +For more detailed explanation of every field, please refer to UEFI
> +specification version 2.3 or later, section Appendix N: Common
> +Platform Error Record.
> diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> index e9f253d54897..869badba6d7a 100644
> --- a/Documentation/firmware-guide/acpi/index.rst
> +++ b/Documentation/firmware-guide/acpi/index.rst
> @@ -17,6 +17,7 @@ ACPI Support
>     DSD-properties-rules
>     debug
>     aml-debugger
> +   apei/output_format
>     gpio-properties
>     i2c-muxes
>     acpi-lid



Thanks,
Mauro

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox