LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 12/23] kernel/watchdog: Introduce a struct for NMI watchdog operations
From: Thomas Gleixner @ 2018-06-14  8:32 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ricardo Neri, Peter Zijlstra, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Don Zickus, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180614123250.230667f6@roar.ozlabs.ibm.com>

On Thu, 14 Jun 2018, Nicholas Piggin wrote:
> On Wed, 13 Jun 2018 18:31:17 -0700
> > I could look into creating the library for
> > common code and relocate the hpet watchdog into arch/x86 for the hpet-
> > specific parts.
> 
> If you can investigate that approach, that would be appreciated. I hope
> I did not misunderstand you there, Thomas.

I'm not against cleanups and consolidation, quite the contrary.

But this stuff just adds new infrastructure w/o showing that it's actually
a cleanup and consolidation.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2] powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP
From: Nicholas Piggin @ 2018-06-14  6:54 UTC (permalink / raw)
  Cc: linuxppc-dev
In-Reply-To: <201806141310.bQbDyW9s%fengguang.wu@intel.com>

On Thu, 14 Jun 2018 13:51:40 +0800
kbuild test robot <lkp@intel.com> wrote:

> Hi Nicholas,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on next-20180613]
> [cannot apply to v4.17]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-radix-Fix-MADV_-FREE-DONTNEED-TLB-flush-miss-problem-with-THP/20180614-114728
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: microblaze-nommu_defconfig (attached as .config)
> compiler: microblaze-linux-gcc (GCC) 8.1.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         GCC_VERSION=8.1.0 make.cross ARCH=microblaze 

Ooops, absent mindedly edited a header without thinking it's not
powerpc specific. Will fix it somehow.

^ permalink raw reply

* Re: [RFC PATCH 3/3] powerpc/64s/radix: optimise TLB flush with precise TLB ranges in mmu_gather
From: Nicholas Piggin @ 2018-06-14  6:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, ppc-dev, linux-arch, Aneesh Kumar K. V, Minchan Kim,
	Mel Gorman, Nadav Amit, Andrew Morton
In-Reply-To: <CA+55aFwP-6QZ0u2ZYCjTebP6OmkeTpbUHyLT0ih-57TbvJBPxg@mail.gmail.com>

On Thu, 14 Jun 2018 15:15:47 +0900
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 14, 2018 at 11:49 AM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > +#ifndef pte_free_tlb
> >  #define pte_free_tlb(tlb, ptep, address)                       \
> >         do {                                                    \
> >                 __tlb_adjust_range(tlb, address, PAGE_SIZE);    \
> >                 __pte_free_tlb(tlb, ptep, address);             \
> >         } while (0)
> > +#endif  
> 
> Do you really want to / need to take over the whole pte_free_tlb macro?
> 
> I was hoping that you'd just replace the __tlv_adjust_range() instead.
> 
> Something like
> 
>  - replace the
> 
>         __tlb_adjust_range(tlb, address, PAGE_SIZE);
> 
>    with a "page directory" version:
> 
>         __tlb_free_directory(tlb, address, size);
> 
>  - have the default implementation for that be the old code:
> 
>         #ifndef __tlb_free_directory
>           #define __tlb_free_directory(tlb,addr,size)
> __tlb_adjust_range(tlb, addr, PAGE_SIZE)
>         #endif
> 
> and that way architectures can now just hook into that
> "__tlb_free_directory()" thing.
> 
> Hmm?

Isn't it just easier and less indirection for the arch to just take
over the pte_free_tlb instead? 

I don't see what the __tlb_free_directory gets you except having to
follow another macro -- if the arch has something special they want
to do there, just do it in their __pte_free_tlb and call it
pte_free_tlb instead.

Thanks,
Nick

> 
>              Linus

^ permalink raw reply

* Re: [PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
From: Alexey Kardashevskiy @ 2018-06-14  6:35 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm-ppc, kvm, Alex Williamson,
	Benjamin Herrenschmidt
In-Reply-To: <20180612041700.GZ2737@umbus.fritz.box>

On 12/6/18 2:17 pm, David Gibson wrote:
> On Fri, Jun 08, 2018 at 03:46:33PM +1000, Alexey Kardashevskiy wrote:
>> At the moment we allocate the entire TCE table, twice (hardware part and
>> userspace translation cache). This normally works as we normally have
>> contigous memory and the guest will map entire RAM for 64bit DMA.
>>
>> However if we have sparse RAM (one example is a memory device), then
>> we will allocate TCEs which will never be used as the guest only maps
>> actual memory for DMA. If it is a single level TCE table, there is nothing
>> we can really do but if it a multilevel table, we can skip allocating
>> TCEs we know we won't need.
>>
>> This adds ability to allocate only first level, saving memory.
>>
>> This changes iommu_table::free() to avoid allocating of an extra level;
>> iommu_table::set() will do this when needed.
>>
>> This adds @alloc parameter to iommu_table::exchange() to tell the callback
>> if it can allocate an extra level; the flag is set to "false" for
>> the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
>> H_TOO_HARD.
>>
>> This still requires the entire table to be counted in mm::locked_vm.
>>
>> To be conservative, this only does on-demand allocation when
>> the usespace cache table is requested which is the case of VFIO.
>>
>> The example math for a system replicating a powernv setup with NVLink2
>> in a guest:
>> 16GB RAM mapped at 0x0
>> 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
>>
>> the table to cover that all with 64K pages takes:
>> (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
>>
>> If we allocate only necessary TCE levels, we will only need:
>> (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
>> levels).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/iommu.h              |  7 ++-
>>  arch/powerpc/platforms/powernv/pci.h          |  6 ++-
>>  arch/powerpc/kvm/book3s_64_vio_hv.c           |  4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 ++++++++++++++++++++-------
>>  arch/powerpc/platforms/powernv/pci-ioda.c     |  8 ++--
>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  2 +-
>>  6 files changed, 69 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 4bdcf22..daa3ee5 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -70,7 +70,7 @@ struct iommu_table_ops {
>>  			unsigned long *hpa,
>>  			enum dma_data_direction *direction);
>>  
>> -	__be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
>> +	__be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
>>  #endif
>>  	void (*clear)(struct iommu_table *tbl,
>>  			long index, long npages);
>> @@ -122,10 +122,13 @@ struct iommu_table {
>>  	__be64 *it_userspace; /* userspace view of the table */
>>  	struct iommu_table_ops *it_ops;
>>  	struct kref    it_kref;
>> +	int it_nid;
>>  };
>>  
>> +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
>> +		((tbl)->it_ops->useraddrptr((tbl), (entry), false))
> 
> Is real mode really the only case where you want to inhibit new
> allocations?  I would have thought some paths would be read-only and
> you wouldn't want to allocate, even in virtual mode.


There are paths when I do not want allocation but I can figure out that
from dma direction flag, for example, I am cleaning up the table and I do
not want any extra  allocation to happen there and they do happen because I
made a mistake so I'll repost. Other than that, this @alloc flag is for
real mode only.


> 
>>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
>> -		((tbl)->it_ops->useraddrptr((tbl), (entry)))
>> +		((tbl)->it_ops->useraddrptr((tbl), (entry), true))
>>  
>>  /* Pure 2^n version of get_order */
>>  static inline __attribute_const__
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index 5e02408..1fa5590 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -267,8 +267,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>>  		unsigned long attrs);
>>  extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
>>  extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
>> -		unsigned long *hpa, enum dma_data_direction *direction);
>> -extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index);
>> +		unsigned long *hpa, enum dma_data_direction *direction,
>> +		bool alloc);
>> +extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index,
>> +		bool alloc);
>>  extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
>>  
>>  extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index db0490c..05b4865 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -200,7 +200,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
>>  {
>>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>>  	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
>> -	__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
>>  
>>  	if (!pua)
>>  		/* it_userspace allocation might be delayed */
>> @@ -264,7 +264,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
>>  {
>>  	long ret;
>>  	unsigned long hpa = 0;
>> -	__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +	__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
>>  	struct mm_iommu_table_group_mem_t *mem;
>>  
>>  	if (!pua)
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
>> index 36c2eb0..a7debfb 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
>> @@ -48,7 +48,7 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
>>  	return addr;
>>  }
>>  
>> -static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
>> +static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx, bool alloc)
>>  {
>>  	__be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
>>  	int  level = tbl->it_indirect_levels;
>> @@ -57,7 +57,20 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
>>  
>>  	while (level) {
>>  		int n = (idx & mask) >> (level * shift);
>> -		unsigned long tce = be64_to_cpu(tmp[n]);
>> +		unsigned long tce;
>> +
>> +		if (tmp[n] == 0) {
>> +			__be64 *tmp2;
>> +
>> +			if (!alloc)
>> +				return NULL;
>> +
>> +			tmp2 = pnv_alloc_tce_level(tbl->it_nid,
>> +					ilog2(tbl->it_level_size) + 3);
> 
> What if the allocation fails?


Fair question, this needs to be handled with at least H_TOO_HARD and real
mode and H_HARDWARE in virtual, I'll fix.




-- 
Alexey

^ permalink raw reply

* Re: [RFC PATCH 3/3] powerpc/64s/radix: optimise TLB flush with precise TLB ranges in mmu_gather
From: Linus Torvalds @ 2018-06-14  6:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, ppc-dev, linux-arch, Aneesh Kumar K. V, Minchan Kim,
	Mel Gorman, Nadav Amit, Andrew Morton
In-Reply-To: <20180614124931.703e5b54@roar.ozlabs.ibm.com>

On Thu, Jun 14, 2018 at 11:49 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> +#ifndef pte_free_tlb
>  #define pte_free_tlb(tlb, ptep, address)                       \
>         do {                                                    \
>                 __tlb_adjust_range(tlb, address, PAGE_SIZE);    \
>                 __pte_free_tlb(tlb, ptep, address);             \
>         } while (0)
> +#endif

Do you really want to / need to take over the whole pte_free_tlb macro?

I was hoping that you'd just replace the __tlv_adjust_range() instead.

Something like

 - replace the

        __tlb_adjust_range(tlb, address, PAGE_SIZE);

   with a "page directory" version:

        __tlb_free_directory(tlb, address, size);

 - have the default implementation for that be the old code:

        #ifndef __tlb_free_directory
          #define __tlb_free_directory(tlb,addr,size)
__tlb_adjust_range(tlb, addr, PAGE_SIZE)
        #endif

and that way architectures can now just hook into that
"__tlb_free_directory()" thing.

Hmm?

             Linus

^ permalink raw reply

* Re: [PATCH v2] powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP
From: kbuild test robot @ 2018-06-14  5:51 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: kbuild-all, linuxppc-dev, Nicholas Piggin
In-Reply-To: <20180614032256.5440-1-npiggin@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 17691 bytes --]

Hi Nicholas,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on next-20180613]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-radix-Fix-MADV_-FREE-DONTNEED-TLB-flush-miss-problem-with-THP/20180614-114728
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: microblaze-nommu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 8.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=8.1.0 make.cross ARCH=microblaze 

All errors (new ones prefixed by >>):

   In file included from include/linux/mm.h:478,
                    from arch/microblaze/include/asm/io.h:17,
                    from include/linux/clocksource.h:21,
                    from include/linux/clockchips.h:14,
                    from include/linux/tick.h:8,
                    from include/linux/sched/isolation.h:6,
                    from kernel/sched/sched.h:17,
                    from kernel/sched/loadavg.c:9:
   include/linux/migrate.h: In function 'new_page_nodemask':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'NMI_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^~~~~~~~~
   include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
   include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^~~~~~~~~
   include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
   include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
--
   In file included from include/linux/mm.h:478,
                    from arch/microblaze/include/asm/io.h:17,
                    from include/linux/clocksource.h:21,
                    from include/linux/clockchips.h:14,
                    from include/linux/tick.h:8,
                    from include/linux/sched/isolation.h:6,
                    from kernel/sched/sched.h:17,
                    from kernel/sched/core.c:8:
   include/linux/migrate.h: In function 'new_page_nodemask':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'NMI_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^~~~~~~~~
   include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
   include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^~~~~~~~~
   include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
   include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   In file included from kernel/sched/sched.h:63,
                    from kernel/sched/core.c:8:
   kernel/sched/core.c: At top level:
   include/linux/syscalls.h:233:18: warning: 'sys_sched_rr_get_interval' alias between functions of incompatible types 'long int(pid_t,  struct timespec *)' {aka 'long int(int,  struct timespec *)'} and 'long int(long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:212:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel/sched/core.c:5274:1: note: in expansion of macro 'SYSCALL_DEFINE2'
    SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:238:18: note: aliased declaration here
     asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                     ^~~~~~~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:212:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel/sched/core.c:5274:1: note: in expansion of macro 'SYSCALL_DEFINE2'
    SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:233:18: warning: 'sys_sched_getaffinity' alias between functions of incompatible types 'long int(pid_t,  unsigned int,  long unsigned int *)' {aka 'long int(int,  unsigned int,  long unsigned int *)'} and 'long int(long int,  long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel/sched/core.c:4909:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:238:18: note: aliased declaration here
     asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                     ^~~~~~~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel/sched/core.c:4909:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:233:18: warning: 'sys_sched_setaffinity' alias between functions of incompatible types 'long int(pid_t,  unsigned int,  long unsigned int *)' {aka 'long int(int,  unsigned int,  long unsigned int *)'} and 'long int(long int,  long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel/sched/core.c:4857:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:238:18: note: aliased declaration here
     asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                     ^~~~~~~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel/sched/core.c:4857:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:233:18: warning: 'sys_sched_getattr' alias between functions of incompatible types 'long int(pid_t,  struct sched_attr *, unsigned int,  unsigned int)' {aka 'long int(int,  struct sched_attr *, unsigned int,  unsigned int)'} and 'long int(long int,  long int,  long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:214:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
--
   In file included from include/linux/mm.h:478,
                    from arch/microblaze/include/asm/io.h:17,
                    from include/linux/clocksource.h:21,
                    from include/linux/clockchips.h:14,
                    from include/linux/tick.h:8,
                    from include/linux/sched/isolation.h:6,
                    from kernel//sched/sched.h:17,
                    from kernel//sched/core.c:8:
   include/linux/migrate.h: In function 'new_page_nodemask':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'NMI_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^~~~~~~~~
   include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
   include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^~~~~~~~~
   include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
   include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   In file included from kernel//sched/sched.h:63,
                    from kernel//sched/core.c:8:
   kernel//sched/core.c: At top level:
   include/linux/syscalls.h:233:18: warning: 'sys_sched_rr_get_interval' alias between functions of incompatible types 'long int(pid_t,  struct timespec *)' {aka 'long int(int,  struct timespec *)'} and 'long int(long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:212:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel//sched/core.c:5274:1: note: in expansion of macro 'SYSCALL_DEFINE2'
    SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:238:18: note: aliased declaration here
     asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                     ^~~~~~~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:212:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel//sched/core.c:5274:1: note: in expansion of macro 'SYSCALL_DEFINE2'
    SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:233:18: warning: 'sys_sched_getaffinity' alias between functions of incompatible types 'long int(pid_t,  unsigned int,  long unsigned int *)' {aka 'long int(int,  unsigned int,  long unsigned int *)'} and 'long int(long int,  long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel//sched/core.c:4909:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:238:18: note: aliased declaration here
     asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                     ^~~~~~~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel//sched/core.c:4909:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:233:18: warning: 'sys_sched_setaffinity' alias between functions of incompatible types 'long int(pid_t,  unsigned int,  long unsigned int *)' {aka 'long int(int,  unsigned int,  long unsigned int *)'} and 'long int(long int,  long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel//sched/core.c:4857:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:238:18: note: aliased declaration here
     asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
                     ^~~~~~~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:213:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
                                       ^~~~~~~~~~~~~~~
   kernel//sched/core.c:4857:1: note: in expansion of macro 'SYSCALL_DEFINE3'
    SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
    ^~~~~~~~~~~~~~~
   include/linux/syscalls.h:233:18: warning: 'sys_sched_getattr' alias between functions of incompatible types 'long int(pid_t,  struct sched_attr *, unsigned int,  unsigned int)' {aka 'long int(int,  struct sched_attr *, unsigned int,  unsigned int)'} and 'long int(long int,  long int,  long int,  long int)' [-Wattribute-alias]
     asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
                     ^~~
   include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
     __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
     ^~~~~~~~~~~~~~~~~
   include/linux/syscalls.h:214:36: note: in expansion of macro 'SYSCALL_DEFINEx'
    #define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)

vim +82 include/linux/huge_mm.h

d8c37c48 Naoya Horiguchi  2012-03-21  81  
fde52796 Aneesh Kumar K.V 2013-06-05 @82  #define HPAGE_PMD_SHIFT PMD_SHIFT
fde52796 Aneesh Kumar K.V 2013-06-05  83  #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
fde52796 Aneesh Kumar K.V 2013-06-05  84  #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
71e3aac0 Andrea Arcangeli 2011-01-13  85  

:::::: The code at line 82 was first introduced by commit
:::::: fde52796d487b675cde55427e3347ff3e59f9a7f mm/THP: don't use HPAGE_SHIFT in transparent hugepage code

:::::: TO: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
:::::: CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 12120 bytes --]

^ permalink raw reply

* Re: [PATCH v2] powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP
From: kbuild test robot @ 2018-06-14  5:43 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: kbuild-all, linuxppc-dev, Nicholas Piggin
In-Reply-To: <20180614032256.5440-1-npiggin@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 12185 bytes --]

Hi Nicholas,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on next-20180613]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-radix-Fix-MADV_-FREE-DONTNEED-TLB-flush-miss-problem-with-THP/20180614-114728
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: arm-allnoconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=arm 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mm.h:478:0,
                    from include/linux/memcontrol.h:29,
                    from include/linux/swap.h:9,
                    from mm/compaction.c:12:
   include/linux/migrate.h: In function 'new_page_nodemask':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
--
   In file included from include/linux/mm.h:478:0,
                    from include/linux/dax.h:6,
                    from mm/filemap.c:14:
   mm/filemap.c: In function 'page_cache_free_page':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
>> mm/filemap.c:279:22: note: in expansion of macro 'HPAGE_PMD_NR'
      page_ref_sub(page, HPAGE_PMD_NR);
                         ^~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
>> mm/filemap.c:279:22: note: in expansion of macro 'HPAGE_PMD_NR'
      page_ref_sub(page, HPAGE_PMD_NR);
                         ^~~~~~~~~~~~
   mm/filemap.c: In function 'page_cache_tree_delete_batch':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
   mm/filemap.c:351:18: note: in expansion of macro 'HPAGE_PMD_NR'
        tail_pages = HPAGE_PMD_NR - 1;
                     ^~~~~~~~~~~~
--
   In file included from include/linux/mm.h:478:0,
                    from include/linux/dax.h:6,
                    from mm/truncate.c:12:
   mm/truncate.c: In function 'truncate_cleanup_page':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
>> mm/truncate.c:182:38: note: in expansion of macro 'HPAGE_PMD_NR'
      pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
                                         ^~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
>> mm/truncate.c:182:38: note: in expansion of macro 'HPAGE_PMD_NR'
      pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
                                         ^~~~~~~~~~~~
   mm/truncate.c: In function 'invalidate_mapping_pages':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
   mm/truncate.c:580:14: note: in expansion of macro 'HPAGE_PMD_NR'
        index += HPAGE_PMD_NR - 1;
                 ^~~~~~~~~~~~
--
   In file included from include/linux/mm.h:478:0,
                    from mm/vmscan.c:17:
   include/linux/migrate.h: In function 'new_page_nodemask':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/migrate.h:47:11: note: in expansion of macro 'HPAGE_PMD_ORDER'
      order = HPAGE_PMD_ORDER;
              ^~~~~~~~~~~~~~~
   mm/vmscan.c: In function 'is_page_cache_freeable':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
>> mm/vmscan.c:579:3: note: in expansion of macro 'HPAGE_PMD_NR'
      HPAGE_PMD_NR : 1;
      ^~~~~~~~~~~~
   mm/vmscan.c: In function '__remove_mapping':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
>> include/linux/huge_mm.h:79:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:80:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
                             ^~~~~~~~~~~~~~~
   mm/vmscan.c:742:18: note: in expansion of macro 'HPAGE_PMD_NR'
      refcount = 1 + HPAGE_PMD_NR;
                     ^~~~~~~~~~~~
--
   In file included from include/linux/mm.h:478:0,
                    from include/linux/pagemap.h:8,
                    from mm/shmem.c:29:
   mm/shmem.c: In function 'shmem_zero_setup':
>> include/linux/huge_mm.h:82:25: error: 'PMD_SHIFT' undeclared (first use in this function); did you mean 'PUD_SHIFT'?
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
   include/linux/huge_mm.h:83:34: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
                                     ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:84:27: note: in expansion of macro 'HPAGE_PMD_SIZE'
    #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
                              ^~~~~~~~~~~~~~
>> mm/shmem.c:4322:23: note: in expansion of macro 'HPAGE_PMD_MASK'
       ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
                          ^~~~~~~~~~~~~~
   include/linux/huge_mm.h:82:25: note: each undeclared identifier is reported only once for each function it appears in
    #define HPAGE_PMD_SHIFT PMD_SHIFT
                            ^
   include/linux/huge_mm.h:83:34: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
                                     ^~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:84:27: note: in expansion of macro 'HPAGE_PMD_SIZE'
    #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
                              ^~~~~~~~~~~~~~
>> mm/shmem.c:4322:23: note: in expansion of macro 'HPAGE_PMD_MASK'
       ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
                          ^~~~~~~~~~~~~~

vim +82 include/linux/huge_mm.h

5a6e75f8 Kirill A. Shutemov 2016-07-26  78  
d8c37c48 Naoya Horiguchi    2012-03-21 @79  #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
d8c37c48 Naoya Horiguchi    2012-03-21 @80  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
d8c37c48 Naoya Horiguchi    2012-03-21  81  
fde52796 Aneesh Kumar K.V   2013-06-05 @82  #define HPAGE_PMD_SHIFT PMD_SHIFT
fde52796 Aneesh Kumar K.V   2013-06-05  83  #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
fde52796 Aneesh Kumar K.V   2013-06-05 @84  #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
71e3aac0 Andrea Arcangeli   2011-01-13  85  

:::::: The code at line 82 was first introduced by commit
:::::: fde52796d487b675cde55427e3347ff3e59f9a7f mm/THP: don't use HPAGE_SHIFT in transparent hugepage code

:::::: TO: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
:::::: CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 5439 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 22/23] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
From: Randy Dunlap @ 2018-06-14  3:30 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Ashok Raj, Borislav Petkov, Tony Luck, Ravi V. Shankar, x86,
	sparclinux, linuxppc-dev, linux-kernel, Jacob Pan,
	Rafael J. Wysocki, Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Davidlohr Bueso, Christoffer Dall, Marc Zyngier,
	Kai-Heng Feng, Konrad Rzeszutek Wilk, David Rientjes, iommu
In-Reply-To: <20180614005839.GA22358@voyager>

On 06/13/2018 05:58 PM, Ricardo Neri wrote:
> On Tue, Jun 12, 2018 at 10:26:57PM -0700, Randy Dunlap wrote:
>> On 06/12/2018 05:57 PM, Ricardo Neri wrote:
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>> index f2040d4..a8833c7 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -2577,7 +2577,7 @@
>>>  			Format: [state][,regs][,debounce][,die]
>>>  
>>>  	nmi_watchdog=	[KNL,BUGS=X86] Debugging features for SMP kernels
>>> -			Format: [panic,][nopanic,][num]
>>> +			Format: [panic,][nopanic,][num,][hpet]
>>>  			Valid num: 0 or 1
>>>  			0 - turn hardlockup detector in nmi_watchdog off
>>>  			1 - turn hardlockup detector in nmi_watchdog on
>>
>> This says that I can use "nmi_watchdog=hpet" without using 0 or 1.
>> Is that correct?
> 
> Yes, this what I meant. In my view, if you set nmi_watchdog=hpet it
> implies that you want to activate the NMI watchdog. In this case, perf.
> 
> I can see how this will be ambiguous for the case of perf and arch NMI
> watchdogs.
> 
> Alternative, a new parameter could be added; such as nmi_watchdog_type. I
> didn't want to add it in this patchset as I think that a single parameter
> can handle the enablement and type of the NMI watchdog.
> 
> What do you think?

I think it's OK like it is.

thanks,
-- 
~Randy

^ permalink raw reply

* [PATCH v2] powerpc/64s/radix: Fix MADV_[FREE|DONTNEED] TLB flush miss problem with THP
From: Nicholas Piggin @ 2018-06-14  3:22 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The patch 99baac21e4 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
problem") added a force flush mode to the mmu_gather flush, which
unconditionally flushes the entire address range being invalidated
(even if actual ptes only covered a smaller range), to solve a problem
with concurrent threads invalidating the same PTEs causing them to
miss TLBs that need flushing.

This does not work with powerpc that invalidates mmu_gather batches
according to page size. Have powerpc flush all possible page sizes in
the range if it encounters this concurrency condition.

Patch 4647706ebe ("mm: always flush VMA ranges affected by
zap_page_range") does add a TLB flush for all page sizes on powerpc for
the zap_page_range case, but that is to be removed and replaced with
the mmu_gather flush to avoid redundant flushing. It is also thought to
not cover other obscure race conditions:

https://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com

Hash does not have a problem because it invalidates TLBs inside the
page table locks.

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
Since v1:
- Compile fix or !THP
- Fixed missing PWC flush case
- Fixed concurrent TLB flush test
- Expanded changelog

I think this is a required fix for existing kernels, at least to be
safe and bring the flushig in to line with other architctures I
think we should add this as a fix. For the next kernel release I will
remove the duplicate flush in zap_page_range so this would definitely
be needed.

 arch/powerpc/mm/tlb-radix.c | 92 +++++++++++++++++++++++++++++--------
 include/linux/huge_mm.h     | 10 +---
 2 files changed, 76 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 67a6e86d3e7e..919232a59ea1 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -689,22 +689,17 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
 
-void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
-		     unsigned long end)
+static inline void __radix__flush_tlb_range(struct mm_struct *mm,
+					unsigned long start, unsigned long end,
+					bool flush_all_sizes)
 
 {
-	struct mm_struct *mm = vma->vm_mm;
 	unsigned long pid;
 	unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
 	unsigned long page_size = 1UL << page_shift;
 	unsigned long nr_pages = (end - start) >> page_shift;
 	bool local, full;
 
-#ifdef CONFIG_HUGETLB_PAGE
-	if (is_vm_hugetlb_page(vma))
-		return radix__flush_hugetlb_tlb_range(vma, start, end);
-#endif
-
 	pid = mm->context.id;
 	if (unlikely(pid == MMU_NO_CONTEXT))
 		return;
@@ -738,16 +733,27 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 				_tlbie_pid(pid, RIC_FLUSH_TLB);
 		}
 	} else {
-		bool hflush = false;
+		bool hflush = flush_all_sizes;
+		bool gflush = flush_all_sizes;
 		unsigned long hstart, hend;
+		unsigned long gstart, gend;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		hstart = (start + HPAGE_PMD_SIZE - 1) >> HPAGE_PMD_SHIFT;
-		hend = end >> HPAGE_PMD_SHIFT;
-		if (hstart < hend) {
-			hstart <<= HPAGE_PMD_SHIFT;
-			hend <<= HPAGE_PMD_SHIFT;
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
+		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 			hflush = true;
+
+		if (hflush) {
+			hstart = (start + HPAGE_PMD_SIZE - 1) & HPAGE_PMD_MASK;
+			hend = end & HPAGE_PMD_MASK;
+			if (hstart == hend)
+				hflush = false;
+		}
+
+		if (gflush) {
+			gstart = (start + HPAGE_PUD_SIZE - 1) & HPAGE_PUD_MASK;
+			gend = end & HPAGE_PUD_MASK;
+			if (gstart == gend)
+				gflush = false;
 		}
 #endif
 
@@ -757,18 +763,36 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 			if (hflush)
 				__tlbiel_va_range(hstart, hend, pid,
 						HPAGE_PMD_SIZE, MMU_PAGE_2M);
+			if (gflush)
+				__tlbiel_va_range(gstart, gend, pid,
+						HPAGE_PUD_SIZE, MMU_PAGE_1G);
 			asm volatile("ptesync": : :"memory");
 		} else {
 			__tlbie_va_range(start, end, pid, page_size, mmu_virtual_psize);
 			if (hflush)
 				__tlbie_va_range(hstart, hend, pid,
 						HPAGE_PMD_SIZE, MMU_PAGE_2M);
+			if (gflush)
+				__tlbie_va_range(gstart, gend, pid,
+						HPAGE_PUD_SIZE, MMU_PAGE_1G);
 			fixup_tlbie();
 			asm volatile("eieio; tlbsync; ptesync": : :"memory");
 		}
 	}
 	preempt_enable();
 }
+
+void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
+		     unsigned long end)
+
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	if (is_vm_hugetlb_page(vma))
+		return radix__flush_hugetlb_tlb_range(vma, start, end);
+#endif
+
+	__radix__flush_tlb_range(vma->vm_mm, start, end, false);
+}
 EXPORT_SYMBOL(radix__flush_tlb_range);
 
 static int radix_get_mmu_psize(int page_size)
@@ -837,6 +861,8 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	int psize = 0;
 	struct mm_struct *mm = tlb->mm;
 	int page_size = tlb->page_size;
+	unsigned long start = tlb->start;
+	unsigned long end = tlb->end;
 
 	/*
 	 * if page size is not something we understand, do a full mm flush
@@ -847,15 +873,45 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	 */
 	if (tlb->fullmm) {
 		__flush_all_mm(mm, true);
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
+	} else if (mm_tlb_flush_nested(mm)) {
+		/*
+		 * If there is a concurrent invalidation that is clearing ptes,
+		 * then it's possible this invalidation will miss one of those
+		 * cleared ptes and miss flushing the TLB. If this invalidate
+		 * returns before the other one flushes TLBs, that can result
+		 * in it returning while there are still valid TLBs inside the
+		 * range to be invalidated.
+		 *
+		 * See mm/memory.c:tlb_finish_mmu() for more details.
+		 *
+		 * The solution to this is ensure the entire range is always
+		 * flushed here. The problem for powerpc is that the flushes
+		 * are page size specific, so this "forced flush" would not
+		 * do the right thing if there are a mix of page sizes in
+		 * the range to be invalidated. So use __flush_tlb_range
+		 * which invalidates all possible page sizes in the range.
+		 *
+		 * PWC flush probably is not be required because the core code
+		 * shouldn't free page tables in this path, but accounting
+		 * for the possibility makes us a bit more robust.
+		 *
+		 * need_flush_all is an uncommon case because page table
+		 * teardown should be done with exclusive locks held (but
+		 * after locks are dropped another invalidate could come
+		 * in), it could be optimized further if necessary.
+		 */
+		if (!tlb->need_flush_all)
+			__radix__flush_tlb_range(mm, start, end, true);
+		else
+			radix__flush_all_mm(mm);
+#endif
 	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
 		if (!tlb->need_flush_all)
 			radix__flush_tlb_mm(mm);
 		else
 			radix__flush_all_mm(mm);
 	} else {
-		unsigned long start = tlb->start;
-		unsigned long end = tlb->end;
-
 		if (!tlb->need_flush_all)
 			radix__flush_tlb_range_psize(mm, start, end, psize);
 		else
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..f7fe2b20efb3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -79,7 +79,6 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
 #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
@@ -88,6 +87,8 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
@@ -246,13 +247,6 @@ static inline bool thp_migration_supported(void)
 }
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
 
-- 
2.17.0

^ permalink raw reply related

* Re: [RFC PATCH 3/3] powerpc/64s/radix: optimise TLB flush with precise TLB ranges in mmu_gather
From: Nicholas Piggin @ 2018-06-14  2:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, ppc-dev, linux-arch, Aneesh Kumar K. V, Minchan Kim,
	Mel Gorman, Nadav Amit, Andrew Morton
In-Reply-To: <CA+55aFzJRknbQD6Mv3OSOvUVozQ4H8ni8jPP7UEEi9wKXmVhQA@mail.gmail.com>

On Tue, 12 Jun 2018 18:10:26 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Jun 12, 2018 at 5:12 PM Nicholas Piggin <npiggin@gmail.com> wrote:
> > >
> > > And in _theory_, maybe you could have just used "invalpg" with a
> > > targeted address instead. In fact, I think a single invlpg invalidates
> > > _all_ caches for the associated MM, but don't quote me on that.  
> 
> Confirmed. The SDK says
> 
>  "INVLPG also invalidates all entries in all paging-structure caches
>   associated with the current PCID, regardless of the linear addresses
>   to which they correspond"

Interesting, so that's very much like powerpc.

> so if x86 wants to do this "separate invalidation for page directory
> entryes", then it would want to
> 
>  (a) remove the __tlb_adjust_range() operation entirely from
> pud_free_tlb() and friends

Revised patch below (only the generic part this time, but powerpc
implementation gives the same result as the last patch).

> 
>  (b) instead just have a single field for "invalidate_tlb_caches",
> which could be a boolean, or could just be one of the addresses

Yeah well powerpc hijacks one of the existing bools in the mmu_gather
for exactly that, and sets it when a page table page is to be freed.

> and then the logic would be that IFF no other tlb invalidate is done
> due to an actual page range, then we look at that
> invalidate_tlb_caches field, and do a single INVLPG instead.
> 
> I still am not sure if this would actually make a difference in
> practice, but I guess it does mean that x86 could at least participate
> in some kind of scheme where we have architecture-specific actions for
> those page directory entries.

I think it could. But yes I don't know how much it would help, I think
x86 tlb invalidation is very fast, and I noticed this mostly at exec
time when you probably lose all your TLBs anyway.

> 
> And we could make the default behavior - if no architecture-specific
> tlb page directory invalidation function exists - be the current
> "__tlb_adjust_range()" case. So the default would be to not change
> behavior, and architectures could opt in to something like this.
> 
>             Linus

Yep, is this a bit more to your liking?

---
 include/asm-generic/tlb.h | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faddde44de8c..fa44321bc8dd 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -262,36 +262,49 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
  * architecture to do its own odd thing, not cause pain for others
  * http://lkml.kernel.org/r/CA+55aFzBggoXtNXQeng5d_mRoDnaMBE5Y+URs+PHR67nUpMtaw@mail.gmail.com
  *
+ * Powerpc (Book3S 64-bit) with the radix MMU has an architected "page
+ * walk cache" that is invalidated with a specific instruction. It uses
+ * need_flush_all to issue this instruction, which is set by its own
+ * __p??_free_tlb functions.
+ *
  * For now w.r.t page table cache, mark the range_size as PAGE_SIZE
  */
 
+#ifndef pte_free_tlb
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
+#endif
 
+#ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
+#endif
 
 #ifndef __ARCH_HAS_4LEVEL_HACK
+#ifndef pud_free_tlb
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
+#endif
 
 #ifndef __ARCH_HAS_5LEVEL_HACK
+#ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)			\
 	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
 		__p4d_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
+#endif
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
-- 
2.17.0

^ permalink raw reply related

* Re: [RFC PATCH 12/23] kernel/watchdog: Introduce a struct for NMI watchdog operations
From: Nicholas Piggin @ 2018-06-14  2:32 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Don Zickus, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180614013117.GC22652@voyager>

On Wed, 13 Jun 2018 18:31:17 -0700
Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:

> On Wed, Jun 13, 2018 at 09:52:25PM +1000, Nicholas Piggin wrote:
> > On Wed, 13 Jun 2018 11:26:49 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> >   
> > > On Wed, 13 Jun 2018, Peter Zijlstra wrote:  
> > > > On Wed, Jun 13, 2018 at 05:41:41PM +1000, Nicholas Piggin wrote:    
> > > > > On Tue, 12 Jun 2018 17:57:32 -0700
> > > > > Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:
> > > > >     
> > > > > > Instead of exposing individual functions for the operations of the NMI
> > > > > > watchdog, define a common interface that can be used across multiple
> > > > > > implementations.
> > > > > > 
> > > > > > The struct nmi_watchdog_ops is defined for such operations. These initial
> > > > > > definitions include the enable, disable, start, stop, and cleanup
> > > > > > operations.
> > > > > > 
> > > > > > Only a single NMI watchdog can be used in the system. The operations of
> > > > > > this NMI watchdog are accessed via the new variable nmi_wd_ops. This
> > > > > > variable is set to point the operations of the first NMI watchdog that
> > > > > > initializes successfully. Even though at this moment, the only available
> > > > > > NMI watchdog is the perf-based hardlockup detector. More implementations
> > > > > > can be added in the future.    
> > > > > 
> > > > > Cool, this looks pretty nice at a quick glance. sparc and powerpc at
> > > > > least have their own NMI watchdogs, it would be good to have those
> > > > > converted as well.    
> > > > 
> > > > Yeah, agreed, this looks like half a patch.    
> > > 
> > > Though I'm not seeing the advantage of it. That kind of NMI watchdogs are
> > > low level architecture details so having yet another 'ops' data structure
> > > with a gazillion of callbacks, checks and indirections does not provide
> > > value over the currently available weak stubs.  
> > 
> > The other way to go of course is librify the perf watchdog and make an
> > x86 watchdog that selects between perf and hpet... I also probably
> > prefer that for code such as this, but I wouldn't strongly object to
> > ops struct if I'm not writing the code. It's not that bad is it?  
> 
> My motivation to add the ops was that the hpet and perf watchdog share
> significant portions of code.

Right, a good motivation.

> I could look into creating the library for
> common code and relocate the hpet watchdog into arch/x86 for the hpet-
> specific parts.

If you can investigate that approach, that would be appreciated. I hope
I did not misunderstand you there, Thomas.

Basically you would have perf infrastructure and hpet infrastructure,
and then the x86 watchdog driver will use one or the other of those. The
generic watchdog driver will be just a simple shim that uses the perf
infrastructure. Then hopefully the powerpc driver would require almost
no change.

Thanks,
Nick

^ permalink raw reply

* Re: [RFC PATCH 14/23] watchdog/hardlockup: Decouple the hardlockup detector from perf
From: Nicholas Piggin @ 2018-06-14  1:41 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Don Zickus, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180614011901.GA22652@voyager>

On Wed, 13 Jun 2018 18:19:01 -0700
Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:

> On Wed, Jun 13, 2018 at 10:43:24AM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 12, 2018 at 05:57:34PM -0700, Ricardo Neri wrote:  
> > > The current default implementation of the hardlockup detector assumes that
> > > it is implemented using perf events.  
> > 
> > The sparc and powerpc things are very much not using perf.  
> 
> Isn't it true that the current hardlockup detector
> (under kernel/watchdog_hld.c) is based on perf?

arch/powerpc/kernel/watchdog.c is a powerpc implementation that uses
the kernel/watchdog_hld.c framework.

> As far as I understand,
> this hardlockup detector is constructed using perf events for architectures
> that don't provide an NMI watchdog. Perhaps I can be more specific and say
> that this synthetized detector is based on perf.

The perf detector is like that, but we want NMI watchdogs to share
the watchdog_hld code as much as possible even for arch specific NMI
watchdogs, so that kernel and user interfaces and behaviour are
consistent.

Other arch watchdogs like sparc are a little older so they are not
using HLD. You don't have to change those for your series, but it
would be good to bring them into the fold if possible at some time.
IIRC sparc was slightly non-trivial because it has some differences
in sysctl or cmdline APIs that we don't want to break.

But powerpc at least needs to be updated if you change hld apis.

Thanks,
Nick

^ permalink raw reply

* Re: [RFC PATCH 12/23] kernel/watchdog: Introduce a struct for NMI watchdog operations
From: Ricardo Neri @ 2018-06-14  1:31 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Don Zickus, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180613215225.2a938abc@roar.ozlabs.ibm.com>

On Wed, Jun 13, 2018 at 09:52:25PM +1000, Nicholas Piggin wrote:
> On Wed, 13 Jun 2018 11:26:49 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 13 Jun 2018, Peter Zijlstra wrote:
> > > On Wed, Jun 13, 2018 at 05:41:41PM +1000, Nicholas Piggin wrote:  
> > > > On Tue, 12 Jun 2018 17:57:32 -0700
> > > > Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:
> > > >   
> > > > > Instead of exposing individual functions for the operations of the NMI
> > > > > watchdog, define a common interface that can be used across multiple
> > > > > implementations.
> > > > > 
> > > > > The struct nmi_watchdog_ops is defined for such operations. These initial
> > > > > definitions include the enable, disable, start, stop, and cleanup
> > > > > operations.
> > > > > 
> > > > > Only a single NMI watchdog can be used in the system. The operations of
> > > > > this NMI watchdog are accessed via the new variable nmi_wd_ops. This
> > > > > variable is set to point the operations of the first NMI watchdog that
> > > > > initializes successfully. Even though at this moment, the only available
> > > > > NMI watchdog is the perf-based hardlockup detector. More implementations
> > > > > can be added in the future.  
> > > > 
> > > > Cool, this looks pretty nice at a quick glance. sparc and powerpc at
> > > > least have their own NMI watchdogs, it would be good to have those
> > > > converted as well.  
> > > 
> > > Yeah, agreed, this looks like half a patch.  
> > 
> > Though I'm not seeing the advantage of it. That kind of NMI watchdogs are
> > low level architecture details so having yet another 'ops' data structure
> > with a gazillion of callbacks, checks and indirections does not provide
> > value over the currently available weak stubs.
> 
> The other way to go of course is librify the perf watchdog and make an
> x86 watchdog that selects between perf and hpet... I also probably
> prefer that for code such as this, but I wouldn't strongly object to
> ops struct if I'm not writing the code. It's not that bad is it?

My motivation to add the ops was that the hpet and perf watchdog share
significant portions of code. I could look into creating the library for
common code and relocate the hpet watchdog into arch/x86 for the hpet-
specific parts.

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 12/23] kernel/watchdog: Introduce a struct for NMI watchdog operations
From: Ricardo Neri @ 2018-06-14  1:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Don Zickus, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180613084219.GT12258@hirez.programming.kicks-ass.net>

On Wed, Jun 13, 2018 at 10:42:19AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 13, 2018 at 05:41:41PM +1000, Nicholas Piggin wrote:
> > On Tue, 12 Jun 2018 17:57:32 -0700
> > Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:
> > 
> > > Instead of exposing individual functions for the operations of the NMI
> > > watchdog, define a common interface that can be used across multiple
> > > implementations.
> > > 
> > > The struct nmi_watchdog_ops is defined for such operations. These initial
> > > definitions include the enable, disable, start, stop, and cleanup
> > > operations.
> > > 
> > > Only a single NMI watchdog can be used in the system. The operations of
> > > this NMI watchdog are accessed via the new variable nmi_wd_ops. This
> > > variable is set to point the operations of the first NMI watchdog that
> > > initializes successfully. Even though at this moment, the only available
> > > NMI watchdog is the perf-based hardlockup detector. More implementations
> > > can be added in the future.
> > 
> > Cool, this looks pretty nice at a quick glance. sparc and powerpc at
> > least have their own NMI watchdogs, it would be good to have those
> > converted as well.
> 
> Yeah, agreed, this looks like half a patch.

I planned to look into the conversion of sparc and powerpc. I just wanted
to see the reception to these patches before jumping and do potentially
useless work. Comments in this thread lean towards keep using the weak
stubs.

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 14/23] watchdog/hardlockup: Decouple the hardlockup detector from perf
From: Ricardo Neri @ 2018-06-14  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Ashok Raj, Borislav Petkov, Tony Luck, Ravi V. Shankar, x86,
	sparclinux, linuxppc-dev, linux-kernel, Jacob Pan, Don Zickus,
	Nicholas Piggin, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180613084324.GU12258@hirez.programming.kicks-ass.net>

On Wed, Jun 13, 2018 at 10:43:24AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 12, 2018 at 05:57:34PM -0700, Ricardo Neri wrote:
> > The current default implementation of the hardlockup detector assumes that
> > it is implemented using perf events.
> 
> The sparc and powerpc things are very much not using perf.

Isn't it true that the current hardlockup detector
(under kernel/watchdog_hld.c) is based on perf? As far as I understand,
this hardlockup detector is constructed using perf events for architectures
that don't provide an NMI watchdog. Perhaps I can be more specific and say
that this synthetized detector is based on perf.

On a side note, I saw that powerpc might use a perf-based hardlockup
detector if it has perf events [1].

Please let me know if my understanding is not correct.

Thanks and BR,
Ricardo

[1]. https://elixir.bootlin.com/linux/v4.17/source/arch/powerpc/Kconfig#L218

^ permalink raw reply

* Re: [RFC PATCH 16/23] watchdog/hardlockup: Add an HPET-based hardlockup detector
From: Ricardo Neri @ 2018-06-14  1:00 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Ashok Raj, Borislav Petkov, Tony Luck, Ravi V. Shankar, x86,
	sparclinux, linuxppc-dev, linux-kernel, Jacob Pan,
	Rafael J. Wysocki, Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Davidlohr Bueso, Christoffer Dall, Marc Zyngier,
	Kai-Heng Feng, Konrad Rzeszutek Wilk, David Rientjes, iommu
In-Reply-To: <1e5bc136-4123-328a-2d2e-e6f2faef5bf4@infradead.org>

On Tue, Jun 12, 2018 at 10:23:47PM -0700, Randy Dunlap wrote:
> Hi,

Hi Randy,

> 
> On 06/12/2018 05:57 PM, Ricardo Neri wrote:
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index c40c7b7..6e79833 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -828,6 +828,16 @@ config HARDLOCKUP_DETECTOR_PERF
> >  	bool
> >  	select SOFTLOCKUP_DETECTOR
> >  
> > +config HARDLOCKUP_DETECTOR_HPET
> > +	bool "Use HPET Timer for Hard Lockup Detection"
> > +	select SOFTLOCKUP_DETECTOR
> > +	select HARDLOCKUP_DETECTOR
> > +	depends on HPET_TIMER && HPET
> > +	help
> > +	  Say y to enable a hardlockup detector that is driven by an High-Precision
> > +	  Event Timer. In addition to selecting this option, the command-line
> > +	  parameter nmi_watchdog option. See Documentation/admin-guide/kernel-parameters.rst
> 
> The "In addition ..." thing is a broken (incomplete) sentence.

Oops. I apologize. I missed this I will fix it in my next version.

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 22/23] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
From: Ricardo Neri @ 2018-06-14  0:58 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Ashok Raj, Borislav Petkov, Tony Luck, Ravi V. Shankar, x86,
	sparclinux, linuxppc-dev, linux-kernel, Jacob Pan,
	Rafael J. Wysocki, Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Davidlohr Bueso, Christoffer Dall, Marc Zyngier,
	Kai-Heng Feng, Konrad Rzeszutek Wilk, David Rientjes, iommu
In-Reply-To: <c2edf778-79cf-009d-6617-13e54ad8b93b@infradead.org>

On Tue, Jun 12, 2018 at 10:26:57PM -0700, Randy Dunlap wrote:
> On 06/12/2018 05:57 PM, Ricardo Neri wrote:
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index f2040d4..a8833c7 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2577,7 +2577,7 @@
> >  			Format: [state][,regs][,debounce][,die]
> >  
> >  	nmi_watchdog=	[KNL,BUGS=X86] Debugging features for SMP kernels
> > -			Format: [panic,][nopanic,][num]
> > +			Format: [panic,][nopanic,][num,][hpet]
> >  			Valid num: 0 or 1
> >  			0 - turn hardlockup detector in nmi_watchdog off
> >  			1 - turn hardlockup detector in nmi_watchdog on
> 
> This says that I can use "nmi_watchdog=hpet" without using 0 or 1.
> Is that correct?

Yes, this what I meant. In my view, if you set nmi_watchdog=hpet it
implies that you want to activate the NMI watchdog. In this case, perf.

I can see how this will be ambiguous for the case of perf and arch NMI
watchdogs.

Alternative, a new parameter could be added; such as nmi_watchdog_type. I
didn't want to add it in this patchset as I think that a single parameter
can handle the enablement and type of the NMI watchdog.

What do you think?

Thanks and BR,
Ricardo

^ permalink raw reply

* [PATCH v13 24/24] selftests/vm: test correct behavior of pkey-0
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

Ensure pkey-0 is allocated on start.  Ensure pkey-0 can be attached
dynamically in various modes, without failures.  Ensure pkey-0 can be
freed and allocated.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 tools/testing/selftests/vm/protection_keys.c |   66 +++++++++++++++++++++++++-
 1 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index cbd87f8..f37b031 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -1003,6 +1003,67 @@ void close_test_fds(void)
 	return *ptr;
 }
 
+void test_pkey_alloc_free_attach_pkey0(int *ptr, u16 pkey)
+{
+	int i, err;
+	int max_nr_pkey_allocs;
+	int alloced_pkeys[NR_PKEYS];
+	int nr_alloced = 0;
+	int newpkey;
+	long size;
+
+	assert(pkey_last_malloc_record);
+	size = pkey_last_malloc_record->size;
+	/*
+	 * This is a bit of a hack.  But mprotect() requires
+	 * huge-page-aligned sizes when operating on hugetlbfs.
+	 * So, make sure that we use something that's a multiple
+	 * of a huge page when we can.
+	 */
+	if (size >= HPAGE_SIZE)
+		size = HPAGE_SIZE;
+
+
+	/* allocate every possible key and make sure key-0 never got allocated */
+	max_nr_pkey_allocs = NR_PKEYS;
+	for (i = 0; i < max_nr_pkey_allocs; i++) {
+		int new_pkey = alloc_pkey();
+		assert(new_pkey != 0);
+
+		if (new_pkey < 0)
+			break;
+		alloced_pkeys[nr_alloced++] = new_pkey;
+	}
+	/* free all the allocated keys */
+	for (i = 0; i < nr_alloced; i++) {
+		int free_ret;
+
+		if (!alloced_pkeys[i])
+			continue;
+		free_ret = sys_pkey_free(alloced_pkeys[i]);
+		pkey_assert(!free_ret);
+	}
+
+	/* attach key-0 in various modes */
+	err = sys_mprotect_pkey(ptr, size, PROT_READ, 0);
+	pkey_assert(!err);
+	err = sys_mprotect_pkey(ptr, size, PROT_WRITE, 0);
+	pkey_assert(!err);
+	err = sys_mprotect_pkey(ptr, size, PROT_EXEC, 0);
+	pkey_assert(!err);
+	err = sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE, 0);
+	pkey_assert(!err);
+	err = sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE|PROT_EXEC, 0);
+	pkey_assert(!err);
+
+	/* free key-0 */
+	err = sys_pkey_free(0);
+	pkey_assert(!err);
+
+	newpkey = sys_pkey_alloc(0, 0x0);
+	assert(newpkey == 0);
+}
+
 void test_read_of_write_disabled_region(int *ptr, u16 pkey)
 {
 	int ptr_contents;
@@ -1153,10 +1214,10 @@ void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
 void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
 {
 	int err;
-	int i = get_start_key();
+	int i;
 
 	/* Note: 0 is the default pkey, so don't mess with it */
-	for (; i < NR_PKEYS; i++) {
+	for (i=1; i < NR_PKEYS; i++) {
 		if (pkey == i)
 			continue;
 
@@ -1465,6 +1526,7 @@ void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
 	test_pkey_syscalls_on_non_allocated_pkey,
 	test_pkey_syscalls_bad_args,
 	test_pkey_alloc_exhaust,
+	test_pkey_alloc_free_attach_pkey0,
 };
 
 void run_tests_once(void)
-- 
1.7.1

^ permalink raw reply related

* [PATCH v13 23/24] selftests/vm: sub-page allocator
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

introduce a new allocator that allocates 4k hardware-pages to back
64k linux-page. This allocator is only applicable on powerpc.

cc: Dave Hansen <dave.hansen@intel.com>
cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
---
 tools/testing/selftests/vm/pkey-helpers.h    |    6 ++++++
 tools/testing/selftests/vm/pkey-powerpc.h    |   25 +++++++++++++++++++++++++
 tools/testing/selftests/vm/pkey-x86.h        |    5 +++++
 tools/testing/selftests/vm/protection_keys.c |    1 +
 4 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/tools/testing/selftests/vm/pkey-helpers.h b/tools/testing/selftests/vm/pkey-helpers.h
index 288ccff..a00eee6 100644
--- a/tools/testing/selftests/vm/pkey-helpers.h
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -28,6 +28,9 @@
 extern int dprint_in_signal;
 extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
 
+extern int test_nr;
+extern int iteration_nr;
+
 #ifdef __GNUC__
 __attribute__((format(printf, 1, 2)))
 #endif
@@ -78,6 +81,9 @@ static inline void sigsafe_printf(const char *format, ...)
 void expected_pkey_fault(int pkey);
 int sys_pkey_alloc(unsigned long flags, u64 init_val);
 int sys_pkey_free(unsigned long pkey);
+int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+		unsigned long pkey);
+void record_pkey_malloc(void *ptr, long size, int prot);
 
 #if defined(__i386__) || defined(__x86_64__) /* arch */
 #include "pkey-x86.h"
diff --git a/tools/testing/selftests/vm/pkey-powerpc.h b/tools/testing/selftests/vm/pkey-powerpc.h
index 957f6f6..af44eed 100644
--- a/tools/testing/selftests/vm/pkey-powerpc.h
+++ b/tools/testing/selftests/vm/pkey-powerpc.h
@@ -98,4 +98,29 @@ void expect_fault_on_read_execonly_key(void *p1, u16 pkey)
 /* 8-bytes of instruction * 16384bytes = 1 page */
 #define __page_o_noops() asm(".rept 16384 ; nop; .endr")
 
+void *malloc_pkey_with_mprotect_subpage(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int ret;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
+			size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+
+	ret = syscall(__NR_subpage_prot, ptr, size, NULL);
+	if (ret) {
+		perror("subpage_perm");
+		return PTR_ERR_ENOTSUP;
+	}
+
+	ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
+	pkey_assert(!ret);
+	record_pkey_malloc(ptr, size, prot);
+
+	dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
+	return ptr;
+}
+
 #endif /* _PKEYS_POWERPC_H */
diff --git a/tools/testing/selftests/vm/pkey-x86.h b/tools/testing/selftests/vm/pkey-x86.h
index 6820c10..322da49 100644
--- a/tools/testing/selftests/vm/pkey-x86.h
+++ b/tools/testing/selftests/vm/pkey-x86.h
@@ -176,4 +176,9 @@ void expect_fault_on_read_execonly_key(void *p1, u16 pkey)
 	expected_pkey_fault(pkey);
 }
 
+void *malloc_pkey_with_mprotect_subpage(long size, int prot, u16 pkey)
+{
+	return PTR_ERR_ENOTSUP;
+}
+
 #endif /* _PKEYS_X86_H */
diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index b5a9e6c..cbd87f8 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -887,6 +887,7 @@ void setup_hugetlbfs(void)
 void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
 
 	malloc_pkey_with_mprotect,
+	malloc_pkey_with_mprotect_subpage,
 	malloc_pkey_anon_huge,
 	malloc_pkey_hugetlb
 /* can not do direct with the pkey_mprotect() API:
-- 
1.7.1

^ permalink raw reply related

* [PATCH v13 22/24] selftests/vm: testcases must restore pkey-permissions
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

Generally the signal handler restores the state of the pkey register
before returning. However there are times when the read/write operation
can legitamely fail without invoking the signal handler.  Eg: A
sys_read() operaton to a write-protected page should be disallowed.  In
such a case the state of the pkey register is not restored to its
original state.  The test case is responsible for restoring the key
register state to its original value.

cc: Dave Hansen <dave.hansen@intel.com>
cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 tools/testing/selftests/vm/protection_keys.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index caf634e..b5a9e6c 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -1011,6 +1011,7 @@ void test_read_of_write_disabled_region(int *ptr, u16 pkey)
 	ptr_contents = read_ptr(ptr);
 	dprintf1("*ptr: %d\n", ptr_contents);
 	dprintf1("\n");
+	pkey_write_allow(pkey);
 }
 void test_read_of_access_disabled_region(int *ptr, u16 pkey)
 {
@@ -1090,6 +1091,7 @@ void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
 	ret = read(test_fd, ptr, 1);
 	dprintf1("read ret: %d\n", ret);
 	pkey_assert(ret);
+	pkey_access_allow(pkey);
 }
 void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
 {
@@ -1102,6 +1104,7 @@ void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
 	if (ret < 0 && (DEBUG_LEVEL > 0))
 		perror("verbose read result (OK for this to be bad)");
 	pkey_assert(ret);
+	pkey_write_allow(pkey);
 }
 
 void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
@@ -1121,6 +1124,7 @@ void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
 	vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
 	dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
 	pkey_assert(vmsplice_ret == -1);
+	pkey_access_allow(pkey);
 
 	close(pipe_fds[0]);
 	close(pipe_fds[1]);
@@ -1141,6 +1145,7 @@ void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
 	if (DEBUG_LEVEL > 0)
 		perror("futex");
 	dprintf1("futex() ret: %d\n", futex_ret);
+	pkey_write_allow(pkey);
 }
 
 /* Assumes that all pkeys other than 'pkey' are unallocated */
-- 
1.7.1

^ permalink raw reply related

* [PATCH v13 21/24] selftests/vm: detect write violation on a mapped access-denied-key page
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

detect write-violation on a page to which access-disabled
key is associated much after the page is mapped.

cc: Dave Hansen <dave.hansen@intel.com>
cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 tools/testing/selftests/vm/protection_keys.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index f4acd72..caf634e 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -1067,6 +1067,18 @@ void test_write_of_access_disabled_region(int *ptr, u16 pkey)
 	*ptr = __LINE__;
 	expected_pkey_fault(pkey);
 }
+
+void test_write_of_access_disabled_region_with_page_already_mapped(int *ptr,
+			u16 pkey)
+{
+	*ptr = __LINE__;
+	dprintf1("disabling access; after accessing the page, "
+		" to PKEY[%02d], doing write\n", pkey);
+	pkey_access_deny(pkey);
+	*ptr = __LINE__;
+	expected_pkey_fault(pkey);
+}
+
 void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
 {
 	int ret;
@@ -1435,6 +1447,7 @@ void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
 	test_write_of_write_disabled_region,
 	test_write_of_write_disabled_region_with_page_already_mapped,
 	test_write_of_access_disabled_region,
+	test_write_of_access_disabled_region_with_page_already_mapped,
 	test_kernel_write_of_access_disabled_region,
 	test_kernel_write_of_write_disabled_region,
 	test_kernel_gup_of_access_disabled_region,
-- 
1.7.1

^ permalink raw reply related

* [PATCH v13 20/24] selftests/vm: associate key on a mapped page and detect write violation
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

detect write-violation on a page to which write-disabled
key is associated much after the page is mapped.

cc: Dave Hansen <dave.hansen@intel.com>
cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 tools/testing/selftests/vm/protection_keys.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index 04d0249..f4acd72 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -1042,6 +1042,17 @@ void test_read_of_access_disabled_region_with_page_already_mapped(int *ptr,
 	expected_pkey_fault(pkey);
 }
 
+void test_write_of_write_disabled_region_with_page_already_mapped(int *ptr,
+		u16 pkey)
+{
+	*ptr = __LINE__;
+	dprintf1("disabling write access; after accessing the page, "
+		"to PKEY[%02d], doing write\n", pkey);
+	pkey_write_deny(pkey);
+	*ptr = __LINE__;
+	expected_pkey_fault(pkey);
+}
+
 void test_write_of_write_disabled_region(int *ptr, u16 pkey)
 {
 	dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
@@ -1422,6 +1433,7 @@ void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
 	test_read_of_access_disabled_region,
 	test_read_of_access_disabled_region_with_page_already_mapped,
 	test_write_of_write_disabled_region,
+	test_write_of_write_disabled_region_with_page_already_mapped,
 	test_write_of_access_disabled_region,
 	test_kernel_write_of_access_disabled_region,
 	test_kernel_write_of_write_disabled_region,
-- 
1.7.1

^ permalink raw reply related

* [PATCH v13 19/24] selftests/vm: associate key on a mapped page and detect access violation
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

detect access-violation on a page to which access-disabled
key is associated much after the page is mapped.

cc: Dave Hansen <dave.hansen@intel.com>
cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 tools/testing/selftests/vm/protection_keys.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index e8ad970..04d0249 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -1024,6 +1024,24 @@ void test_read_of_access_disabled_region(int *ptr, u16 pkey)
 	dprintf1("*ptr: %d\n", ptr_contents);
 	expected_pkey_fault(pkey);
 }
+
+void test_read_of_access_disabled_region_with_page_already_mapped(int *ptr,
+		u16 pkey)
+{
+	int ptr_contents;
+
+	dprintf1("disabling access to PKEY[%02d], doing read @ %p\n",
+				pkey, ptr);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("reading ptr before disabling the read : %d\n",
+			ptr_contents);
+	read_pkey_reg();
+	pkey_access_deny(pkey);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("*ptr: %d\n", ptr_contents);
+	expected_pkey_fault(pkey);
+}
+
 void test_write_of_write_disabled_region(int *ptr, u16 pkey)
 {
 	dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
@@ -1402,6 +1420,7 @@ void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
 void (*pkey_tests[])(int *ptr, u16 pkey) = {
 	test_read_of_write_disabled_region,
 	test_read_of_access_disabled_region,
+	test_read_of_access_disabled_region_with_page_already_mapped,
 	test_write_of_write_disabled_region,
 	test_write_of_access_disabled_region,
 	test_kernel_write_of_access_disabled_region,
-- 
1.7.1

^ permalink raw reply related

* [PATCH v13 18/24] selftests/vm: fix an assertion in test_pkey_alloc_exhaust()
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

The maximum number of keys that can be allocated has to
take into consideration, that some keys are reserved by
the architecture for   specific   purpose. Hence cannot
be allocated.

Fix the assertion in test_pkey_alloc_exhaust()

cc: Dave Hansen <dave.hansen@intel.com>
cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 tools/testing/selftests/vm/protection_keys.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index cb81a47..e8ad970 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -1175,15 +1175,12 @@ void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
 	pkey_assert(i < NR_PKEYS*2);
 
 	/*
-	 * There are 16 pkeys supported in hardware.  Three are
-	 * allocated by the time we get here:
-	 *   1. The default key (0)
-	 *   2. One possibly consumed by an execute-only mapping.
-	 *   3. One allocated by the test code and passed in via
-	 *      'pkey' to this function.
-	 * Ensure that we can allocate at least another 13 (16-3).
+	 * There are NR_PKEYS pkeys supported in hardware. arch_reserved_keys()
+	 * are reserved. One of which is the default key(0). One can be taken
+	 * up by an execute-only mapping.
+	 * Ensure that we can allocate at least the remaining.
 	 */
-	pkey_assert(i >= NR_PKEYS-3);
+	pkey_assert(i >= (NR_PKEYS-arch_reserved_keys()-1));
 
 	for (i = 0; i < nr_allocated_pkeys; i++) {
 		err = sys_pkey_free(allocated_pkeys[i]);
-- 
1.7.1

^ permalink raw reply related

* [PATCH v13 17/24] selftests/vm: powerpc implementation to check support for pkey
From: Ram Pai @ 2018-06-14  0:45 UTC (permalink / raw)
  To: shuahkh, linux-kselftest
  Cc: mpe, linuxppc-dev, linux-mm, x86, linux-arch, mingo, dave.hansen,
	mhocko, bauerman, linuxram, fweimer, msuchanek, aneesh.kumar
In-Reply-To: <1528937115-10132-1-git-send-email-linuxram@us.ibm.com>

pkey subsystem is supported if the hardware and kernel has support.
We determine that by checking if allocation of a key succeeds or not.

cc: Dave Hansen <dave.hansen@intel.com>
cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 tools/testing/selftests/vm/pkey-helpers.h    |    2 ++
 tools/testing/selftests/vm/pkey-powerpc.h    |   14 ++++++++++++--
 tools/testing/selftests/vm/pkey-x86.h        |    8 ++++----
 tools/testing/selftests/vm/protection_keys.c |    9 +++++----
 4 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/vm/pkey-helpers.h b/tools/testing/selftests/vm/pkey-helpers.h
index 321bbbd..288ccff 100644
--- a/tools/testing/selftests/vm/pkey-helpers.h
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -76,6 +76,8 @@ static inline void sigsafe_printf(const char *format, ...)
 
 __attribute__((noinline)) int read_ptr(int *ptr);
 void expected_pkey_fault(int pkey);
+int sys_pkey_alloc(unsigned long flags, u64 init_val);
+int sys_pkey_free(unsigned long pkey);
 
 #if defined(__i386__) || defined(__x86_64__) /* arch */
 #include "pkey-x86.h"
diff --git a/tools/testing/selftests/vm/pkey-powerpc.h b/tools/testing/selftests/vm/pkey-powerpc.h
index ec6f5d7..957f6f6 100644
--- a/tools/testing/selftests/vm/pkey-powerpc.h
+++ b/tools/testing/selftests/vm/pkey-powerpc.h
@@ -62,9 +62,19 @@ static inline void __write_pkey_reg(pkey_reg_t pkey_reg)
 			pkey_reg);
 }
 
-static inline int cpu_has_pku(void)
+static inline bool is_pkey_supported(void)
 {
-	return 1;
+	/*
+	 * No simple way to determine this.
+	 * Lets try allocating a key and see if it succeeds.
+	 */
+	int ret = sys_pkey_alloc(0, 0);
+
+	if (ret > 0) {
+		sys_pkey_free(ret);
+		return true;
+	}
+	return false;
 }
 
 static inline int arch_reserved_keys(void)
diff --git a/tools/testing/selftests/vm/pkey-x86.h b/tools/testing/selftests/vm/pkey-x86.h
index 95ee952..6820c10 100644
--- a/tools/testing/selftests/vm/pkey-x86.h
+++ b/tools/testing/selftests/vm/pkey-x86.h
@@ -105,7 +105,7 @@ static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
 #define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
 #define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
 
-static inline int cpu_has_pku(void)
+static inline bool is_pkey_supported(void)
 {
 	unsigned int eax;
 	unsigned int ebx;
@@ -118,13 +118,13 @@ static inline int cpu_has_pku(void)
 
 	if (!(ecx & X86_FEATURE_PKU)) {
 		dprintf2("cpu does not have PKU\n");
-		return 0;
+		return false;
 	}
 	if (!(ecx & X86_FEATURE_OSPKE)) {
 		dprintf2("cpu does not have OSPKE\n");
-		return 0;
+		return false;
 	}
-	return 1;
+	return true;
 }
 
 #define XSTATE_PKEY_BIT	(9)
diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
index ba184ca..cb81a47 100644
--- a/tools/testing/selftests/vm/protection_keys.c
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -1393,8 +1393,8 @@ void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
 	int size = PAGE_SIZE;
 	int sret;
 
-	if (cpu_has_pku()) {
-		dprintf1("SKIP: %s: no CPU support\n", __func__);
+	if (is_pkey_supported()) {
+		dprintf1("SKIP: %s: no CPU/kernel support\n", __func__);
 		return;
 	}
 
@@ -1458,12 +1458,13 @@ void run_tests_once(void)
 int main(void)
 {
 	int nr_iterations = 22;
+	int pkey_supported = is_pkey_supported();
 
 	setup_handlers();
 
-	printf("has pkey: %d\n", cpu_has_pku());
+	printf("has pkey: %s\n", pkey_supported ? "Yes" : "No");
 
-	if (!cpu_has_pku()) {
+	if (!pkey_supported) {
 		int size = PAGE_SIZE;
 		int *ptr;
 
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox