LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v5 7/7] KVM: PPC: Book3S HV: rework secure mem slot dropping
From: Bharata B Rao @ 2020-07-24  3:03 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1595534844-16188-8-git-send-email-linuxram@us.ibm.com>

On Thu, Jul 23, 2020 at 01:07:24PM -0700, Ram Pai wrote:
> From: Laurent Dufour <ldufour@linux.ibm.com>
> 
> When a secure memslot is dropped, all the pages backed in the secure
> device (aka really backed by secure memory by the Ultravisor)
> should be paged out to a normal page. Previously, this was
> achieved by triggering the page fault mechanism which is calling
> kvmppc_svm_page_out() on each pages.
> 
> This can't work when hot unplugging a memory slot because the memory
> slot is flagged as invalid and gfn_to_pfn() is then not trying to access
> the page, so the page fault mechanism is not triggered.
> 
> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> simpler to call directly instead of triggering such a mechanism. This
> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> memslot.
> 
> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> the call to __kvmppc_svm_page_out() is made.  As
> __kvmppc_svm_page_out needs the vma pointer to migrate the pages,
> the VMA is fetched in a lazy way, to not trigger find_vma() all
> the time. In addition, the mmap_sem is held in read mode during
> that time, not in write mode since the virual memory layout is not
> impacted, and kvm->arch.uvmem_lock prevents concurrent operation
> on the secure device.
> 
> Cc: Ram Pai <linuxram@us.ibm.com>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> 	[modified the changelog description]
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 ++++++++++++++++++++++++++------------
>  1 file changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index c772e92..daffa6e 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -632,35 +632,55 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>   * fault on them, do fault time migration to replace the device PTEs in
>   * QEMU page table with normal PTEs from newly allocated pages.
>   */
> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>  			     struct kvm *kvm, bool skip_page_out)
>  {
>  	int i;
>  	struct kvmppc_uvmem_page_pvt *pvt;
> -	unsigned long pfn, uvmem_pfn;
> -	unsigned long gfn = free->base_gfn;
> +	struct page *uvmem_page;
> +	struct vm_area_struct *vma = NULL;
> +	unsigned long uvmem_pfn, gfn;
> +	unsigned long addr, end;
> +
> +	mmap_read_lock(kvm->mm);
> +
> +	addr = slot->userspace_addr;
> +	end = addr + (slot->npages * PAGE_SIZE);
>  
> -	for (i = free->npages; i; --i, ++gfn) {
> -		struct page *uvmem_page;
> +	gfn = slot->base_gfn;
> +	for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
> +
> +		/* Fetch the VMA if addr is not in the latest fetched one */
> +		if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
> +			vma = find_vma_intersection(kvm->mm, addr, end);
> +			if (!vma ||
> +			    vma->vm_start > addr || vma->vm_end < end) {
> +				pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
> +				break;
> +			}

There is a potential issue with the boundary condition check here
which I discussed with Laurent yesterday. Guess he hasn't gotten around
to look at it yet.

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state
From: kernel test robot @ 2020-07-24  3:15 UTC (permalink / raw)
  To: Nicholas Piggin, linux-kernel
  Cc: linux-arch, kbuild-all, Peter Zijlstra, Will Deacon,
	Nicholas Piggin, Alexey Kardashevskiy, Ingo Molnar, linuxppc-dev
In-Reply-To: <20200723105615.1268126-1-npiggin@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6352 bytes --]

Hi Nicholas,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on powerpc/next linus/master v5.8-rc6 next-20200723]
[cannot apply to tip/locking/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Nicholas-Piggin/lockdep-improve-current-hard-soft-irqs_enabled-synchronisation-with-actual-irq-state/20200723-185938
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68
config: i386-randconfig-s001-20200723 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-14) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-93-g4c6cbe55-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   kernel/locking/spinlock.c:149:17: sparse: sparse: context imbalance in '_raw_spin_lock' - wrong count at exit
   kernel/locking/spinlock.c: note: in included file (through include/linux/preempt.h):
>> arch/x86/include/asm/preempt.h:79:9: sparse: sparse: context imbalance in '_raw_spin_lock_irqsave' - wrong count at exit
>> arch/x86/include/asm/preempt.h:79:9: sparse: sparse: context imbalance in '_raw_spin_lock_irq' - wrong count at exit
   kernel/locking/spinlock.c:173:17: sparse: sparse: context imbalance in '_raw_spin_lock_bh' - wrong count at exit
   kernel/locking/spinlock.c:181:17: sparse: sparse: context imbalance in '_raw_spin_unlock' - unexpected unlock
   kernel/locking/spinlock.c:189:17: sparse: sparse: context imbalance in '_raw_spin_unlock_irqrestore' - unexpected unlock
   kernel/locking/spinlock.c:197:17: sparse: sparse: context imbalance in '_raw_spin_unlock_irq' - unexpected unlock
   kernel/locking/spinlock.c:205:17: sparse: sparse: context imbalance in '_raw_spin_unlock_bh' - unexpected unlock
   kernel/locking/spinlock.c:221:17: sparse: sparse: context imbalance in '_raw_read_lock' - wrong count at exit
>> arch/x86/include/asm/preempt.h:79:9: sparse: sparse: context imbalance in '_raw_read_lock_irqsave' - wrong count at exit
>> arch/x86/include/asm/preempt.h:79:9: sparse: sparse: context imbalance in '_raw_read_lock_irq' - wrong count at exit
   kernel/locking/spinlock.c:245:17: sparse: sparse: context imbalance in '_raw_read_lock_bh' - wrong count at exit
   kernel/locking/spinlock.c:253:17: sparse: sparse: context imbalance in '_raw_read_unlock' - unexpected unlock
   kernel/locking/spinlock.c:261:17: sparse: sparse: context imbalance in '_raw_read_unlock_irqrestore' - unexpected unlock
   kernel/locking/spinlock.c:269:17: sparse: sparse: context imbalance in '_raw_read_unlock_irq' - unexpected unlock
   kernel/locking/spinlock.c:277:17: sparse: sparse: context imbalance in '_raw_read_unlock_bh' - unexpected unlock
   kernel/locking/spinlock.c:293:17: sparse: sparse: context imbalance in '_raw_write_lock' - wrong count at exit
>> arch/x86/include/asm/preempt.h:79:9: sparse: sparse: context imbalance in '_raw_write_lock_irqsave' - wrong count at exit
>> arch/x86/include/asm/preempt.h:79:9: sparse: sparse: context imbalance in '_raw_write_lock_irq' - wrong count at exit
   kernel/locking/spinlock.c:317:17: sparse: sparse: context imbalance in '_raw_write_lock_bh' - wrong count at exit
   kernel/locking/spinlock.c:325:17: sparse: sparse: context imbalance in '_raw_write_unlock' - unexpected unlock
   kernel/locking/spinlock.c:333:17: sparse: sparse: context imbalance in '_raw_write_unlock_irqrestore' - unexpected unlock
   kernel/locking/spinlock.c:341:17: sparse: sparse: context imbalance in '_raw_write_unlock_irq' - unexpected unlock
   kernel/locking/spinlock.c:349:17: sparse: sparse: context imbalance in '_raw_write_unlock_bh' - unexpected unlock
   kernel/locking/spinlock.c:358:17: sparse: sparse: context imbalance in '_raw_spin_lock_nested' - wrong count at exit
>> arch/x86/include/asm/preempt.h:79:9: sparse: sparse: context imbalance in '_raw_spin_lock_irqsave_nested' - wrong count at exit
   kernel/locking/spinlock.c:380:17: sparse: sparse: context imbalance in '_raw_spin_lock_nest_lock' - wrong count at exit
--
   kernel/trace/ring_buffer.c:699:32: sparse: sparse: incorrect type in return expression (different base types) @@     expected restricted __poll_t @@     got int @@
   kernel/trace/ring_buffer.c:699:32: sparse:     expected restricted __poll_t
   kernel/trace/ring_buffer.c:699:32: sparse:     got int
   kernel/trace/ring_buffer.c: note: in included file (through include/linux/irqflags.h, arch/x86/include/asm/special_insns.h, arch/x86/include/asm/processor.h, ...):
>> arch/x86/include/asm/irqflags.h:162:28: sparse: sparse: context imbalance in 'ring_buffer_peek' - different lock contexts for basic block
>> arch/x86/include/asm/irqflags.h:162:28: sparse: sparse: context imbalance in 'ring_buffer_consume' - different lock contexts for basic block
>> arch/x86/include/asm/irqflags.h:162:28: sparse: sparse: context imbalance in 'ring_buffer_empty' - different lock contexts for basic block
>> arch/x86/include/asm/irqflags.h:162:28: sparse: sparse: context imbalance in 'ring_buffer_empty_cpu' - different lock contexts for basic block

vim +/_raw_spin_lock_irqsave +79 arch/x86/include/asm/preempt.h

c2daa3bed53a811 Peter Zijlstra    2013-08-14  72  
c2daa3bed53a811 Peter Zijlstra    2013-08-14  73  /*
c2daa3bed53a811 Peter Zijlstra    2013-08-14  74   * The various preempt_count add/sub methods
c2daa3bed53a811 Peter Zijlstra    2013-08-14  75   */
c2daa3bed53a811 Peter Zijlstra    2013-08-14  76  
c2daa3bed53a811 Peter Zijlstra    2013-08-14  77  static __always_inline void __preempt_count_add(int val)
c2daa3bed53a811 Peter Zijlstra    2013-08-14  78  {
b3ca1c10d7b32fd Christoph Lameter 2014-04-07 @79  	raw_cpu_add_4(__preempt_count, val);
c2daa3bed53a811 Peter Zijlstra    2013-08-14  80  }
c2daa3bed53a811 Peter Zijlstra    2013-08-14  81  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35341 bytes --]

^ permalink raw reply

* Re: [PATCH 15/15] powerpc/powernv/sriov: Make single PE mode a per-BAR setting
From: Oliver O'Halloran @ 2020-07-24  3:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <fdd88062-0b62-fc6b-4de7-a4e099768cd9@ozlabs.ru>

On Wed, Jul 22, 2020 at 8:06 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> >> Well, realistically the segment size should be 8MB to make this matter
> >> (or the whole window 2GB) which does not seem to happen so it does not
> >> matter.
> >
> > I'm not sure what you mean.
>
> I mean how can we possibly hit this case, what m64_segsize would the
> platform have to trigger this. The whole check seems useless but whatever.

Yeah maybe.

IIRC some old P8 FSP systems had tiny M64 windows so it might have
been an issue there. Maybe we can get rid of it., but I'd rather just
leave the behaviour as-is for now.

^ permalink raw reply

* Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state
From: Alexey Kardashevskiy @ 2020-07-24  4:16 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra
  Cc: linux-arch, Will Deacon, Ingo Molnar, linuxppc-dev, linux-kernel
In-Reply-To: <1595506730.3mvrxktem5.astroid@bobo.none>



On 23/07/2020 23:11, Nicholas Piggin wrote:
> Excerpts from Peter Zijlstra's message of July 23, 2020 9:40 pm:
>> On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:
>>
>>> diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h
>>> index 3a0db7b0b46e..35060be09073 100644
>>> --- a/arch/powerpc/include/asm/hw_irq.h
>>> +++ b/arch/powerpc/include/asm/hw_irq.h
>>> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
>>>  #define powerpc_local_irq_pmu_save(flags)			\
>>>  	 do {							\
>>>  		raw_local_irq_pmu_save(flags);			\
>>> -		trace_hardirqs_off();				\
>>> +		if (!raw_irqs_disabled_flags(flags))		\
>>> +			trace_hardirqs_off();			\
>>>  	} while(0)
>>>  #define powerpc_local_irq_pmu_restore(flags)			\
>>>  	do {							\
>>> -		if (raw_irqs_disabled_flags(flags)) {		\
>>> -			raw_local_irq_pmu_restore(flags);	\
>>> -			trace_hardirqs_off();			\
>>> -		} else {					\
>>> +		if (!raw_irqs_disabled_flags(flags))		\
>>>  			trace_hardirqs_on();			\
>>> -			raw_local_irq_pmu_restore(flags);	\
>>> -		}						\
>>> +		raw_local_irq_pmu_restore(flags);		\
>>>  	} while(0)
>>
>> You shouldn't be calling lockdep from NMI context!
> 
> After this patch it doesn't.
> 
> trace_hardirqs_on/off implementation appears to expect to be called in NMI 
> context though, for some reason.
> 
>> That is, I recently
>> added suport for that on x86:
>>
>>   https://lkml.kernel.org/r/20200623083721.155449112@infradead.org
>>   https://lkml.kernel.org/r/20200623083721.216740948@infradead.org
>>
>> But you need to be very careful on how you order things, as you can see
>> the above relies on preempt_count() already having been incremented with
>> NMI_MASK.
> 
> Hmm. My patch seems simpler.

And your patches fix my error while Peter's do not:


IRQs not enabled as expected
WARNING: CPU: 0 PID: 1377 at /home/aik/p/kernel/kernel/softirq.c:169
__local_bh_enable_ip+0x118/0x190


> 
> I don't know this stuff very well, I don't really understand what your patch 
> enables for x86 but at least it shouldn't be incompatible with this one 
> AFAIKS.
> 
> Thanks,
> Nick
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH v5 4/7] KVM: PPC: Book3S HV: in H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs.
From: Bharata B Rao @ 2020-07-24  4:27 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1595534844-16188-5-git-send-email-linuxram@us.ibm.com>

On Thu, Jul 23, 2020 at 01:07:21PM -0700, Ram Pai wrote:
> The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the
> pages of the SVM before calling H_SVM_INIT_DONE. This causes a huge
> delay in tranistioning the VM to SVM. The Ultravisor is only interested
> in the pages that contain the kernel, initrd and other important data
> structures. The rest contain throw-away content.
> 
> However if not all pages are requested by the Ultravisor, the Hypervisor
> continues to consider the GFNs corresponding to the non-requested pages
> as normal GFNs. This can lead to data-corruption and undefined behavior.
> 
> In H_SVM_INIT_DONE handler, move all the PFNs associated with the SVM's
> GFNs to secure-PFNs. Skip the GFNs that are already Paged-in or Shared
> or Paged-in followed by a Paged-out.
> 
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Claudio Carvalho <cclaudio@linux.ibm.com>
> Cc: kvm-ppc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  Documentation/powerpc/ultravisor.rst        |   2 +
>  arch/powerpc/include/asm/kvm_book3s_uvmem.h |   2 +
>  arch/powerpc/kvm/book3s_hv_uvmem.c          | 136 +++++++++++++++++++++++++---
>  3 files changed, 127 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst
> index a1c8c37..ba6b1bf 100644
> --- a/Documentation/powerpc/ultravisor.rst
> +++ b/Documentation/powerpc/ultravisor.rst
> @@ -934,6 +934,8 @@ Return values
>  	* H_UNSUPPORTED		if called from the wrong context (e.g.
>  				from an SVM or before an H_SVM_INIT_START
>  				hypercall).
> +	* H_STATE		if the hypervisor could not successfully
> +                                transition the VM to Secure VM.
>  
>  Description
>  ~~~~~~~~~~~
> diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> index 9cb7d8b..f229ab5 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> @@ -23,6 +23,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
>  unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
>  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>  			     struct kvm *kvm, bool skip_page_out);
> +int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
> +			const struct kvm_memory_slot *memslot);

I still don't see why this be a global function. You should be able
to move around a few functions in book3s_hv_uvmem.c up/down and
satisfy the calling order dependencies.

Otherwise, Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>

^ permalink raw reply

* [powerpc:fixes-test] BUILD SUCCESS 590ce02bd148cd35721560c140e3759e39a6e56a
From: kernel test robot @ 2020-07-24  4:54 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  fixes-test
branch HEAD: 590ce02bd148cd35721560c140e3759e39a6e56a  powerpc/64s: Fix irq tracing corruption in interrupt/syscall return caused by perf interrupts

elapsed time: 905m

configs tested: 74
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm                              allyesconfig
arm                              allmodconfig
arm                               allnoconfig
arm64                            allyesconfig
arm64                               defconfig
arm64                            allmodconfig
arm64                             allnoconfig
i386                             allyesconfig
i386                                defconfig
i386                              debian-10.3
i386                              allnoconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                              allnoconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                              allnoconfig
m68k                           sun3_defconfig
m68k                                defconfig
m68k                             allyesconfig
nds32                               defconfig
nds32                             allnoconfig
csky                             allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
h8300                            allmodconfig
xtensa                              defconfig
arc                                 defconfig
arc                              allyesconfig
sh                               allmodconfig
sh                                allnoconfig
microblaze                        allnoconfig
nios2                               defconfig
nios2                            allyesconfig
openrisc                            defconfig
c6x                              allyesconfig
c6x                               allnoconfig
openrisc                         allyesconfig
mips                             allyesconfig
mips                              allnoconfig
mips                             allmodconfig
parisc                            allnoconfig
parisc                              defconfig
parisc                           allyesconfig
parisc                           allmodconfig
powerpc                             defconfig
powerpc                          allyesconfig
powerpc                          rhel-kconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
s390                             allyesconfig
s390                              allnoconfig
s390                             allmodconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
sparc64                             defconfig
sparc64                           allnoconfig
sparc64                          allyesconfig
sparc64                          allmodconfig
x86_64                    rhel-7.6-kselftests
x86_64                               rhel-8.3
x86_64                                  kexec
x86_64                                   rhel
x86_64                                    lkp
x86_64                              fedora-25

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH v2 14/14] powerpc/eeh: Move PE tree setup into the platform
From: Alexey Kardashevskiy @ 2020-07-24  5:01 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20200722042628.1425880-14-oohall@gmail.com>



On 22/07/2020 14:26, Oliver O'Halloran wrote:
> The EEH core has a concept of a "PE tree" to support PowerNV. The PE tree
> follows the PCI bus structures because a reset asserted on an upstream
> bridge will be propagated to the downstream bridges. On pseries there's a
> 1-1 correspondence between what the guest sees are a PHB and a PE so the
> "tree" is really just a single node.
> 
> Current the EEH core is reponsible for setting up this PE tree which it
> does by traversing the pci_dn tree. The structure of the pci_dn tree
> matches the bus tree on PowerNV which leads to the PE tree being "correct"
> this setup method doesn't make a whole lot of sense and it's actively
> confusing for the pseries case where it doesn't really do anything.
> 
> We want to remove the dependence on pci_dn anyway so this patch move
> choosing where to insert a new PE into the platform code rather than
> being part of the generic EEH code. For PowerNV this simplifies the
> tree building logic and removes the use of pci_dn. For pseries we
> keep the existing logic. I'm not really convinced it does anything
> due to the 1-1 PE-to-PHB correspondence so every device under that
> PHB should be in the same PE, but I'd rather not remove it entirely
> until we've had a chance to look at it more deeply.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> v2: Reworked pseries PE setup slightly. NOT DONE YET. mostly done needs test

So far it was looking good and now this :)

When is it going to be done? Is this the broken stuff you mentioned
elsewhere?


-- 
Alexey

^ permalink raw reply

* Re: [PATCH v2 01/14] powerpc/eeh: Remove eeh_dev_phb_init_dynamic()
From: Alexey Kardashevskiy @ 2020-07-24  5:05 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20200722042628.1425880-1-oohall@gmail.com>



On 22/07/2020 14:26, Oliver O'Halloran wrote:
> This function is a one line wrapper around eeh_phb_pe_create() and despite
> the name it doesn't create any eeh_dev structures.

The "eeh_dev_phb_init_dynamic" name does not suggest anything really but
the comment does.


Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>



> Replace it with direct
> calls to eeh_phb_pe_create() since that does what it says on the tin
> and removes a layer of indirection.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> v2: Added sub prototype of eeh_phb_pe_create() for the !CONFIG_EEH case.
> ---
>  arch/powerpc/include/asm/eeh.h             |  3 ++-
>  arch/powerpc/kernel/eeh.c                  |  2 +-
>  arch/powerpc/kernel/eeh_dev.c              | 13 -------------
>  arch/powerpc/kernel/of_platform.c          |  4 ++--
>  arch/powerpc/platforms/pseries/pci_dlpar.c |  2 +-
>  5 files changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 964a54292b36..64487b88c569 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -294,7 +294,6 @@ const char *eeh_pe_loc_get(struct eeh_pe *pe);
>  struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
>  
>  struct eeh_dev *eeh_dev_init(struct pci_dn *pdn);
> -void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
>  void eeh_show_enabled(void);
>  int __init eeh_ops_register(struct eeh_ops *ops);
>  int __exit eeh_ops_unregister(const char *name);
> @@ -370,6 +369,8 @@ void pseries_eeh_init_edev_recursive(struct pci_dn *pdn);
>  #else
>  static inline void pseries_eeh_add_device_early(struct pci_dn *pdn) { }
>  static inline void pseries_eeh_add_device_tree_early(struct pci_dn *pdn) { }
> +
> +static inline int eeh_phb_pe_create(struct pci_controller *phb) { return 0; }
>  #endif
>  
>  #ifdef CONFIG_PPC64
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index d407981dec76..859f76020256 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1096,7 +1096,7 @@ static int eeh_init(void)
>  
>  	/* Initialize PHB PEs */
>  	list_for_each_entry_safe(hose, tmp, &hose_list, list_node)
> -		eeh_dev_phb_init_dynamic(hose);
> +		eeh_phb_pe_create(hose);
>  
>  	eeh_addr_cache_init();
>  
> diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
> index 7370185c7a05..8e159a12f10c 100644
> --- a/arch/powerpc/kernel/eeh_dev.c
> +++ b/arch/powerpc/kernel/eeh_dev.c
> @@ -52,16 +52,3 @@ struct eeh_dev *eeh_dev_init(struct pci_dn *pdn)
>  
>  	return edev;
>  }
> -
> -/**
> - * eeh_dev_phb_init_dynamic - Create EEH devices for devices included in PHB
> - * @phb: PHB
> - *
> - * Scan the PHB OF node and its child association, then create the
> - * EEH devices accordingly
> - */
> -void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
> -{
> -	/* EEH PE for PHB */
> -	eeh_phb_pe_create(phb);
> -}
> diff --git a/arch/powerpc/kernel/of_platform.c b/arch/powerpc/kernel/of_platform.c
> index 71a3f97dc988..f89376ff633e 100644
> --- a/arch/powerpc/kernel/of_platform.c
> +++ b/arch/powerpc/kernel/of_platform.c
> @@ -62,8 +62,8 @@ static int of_pci_phb_probe(struct platform_device *dev)
>  	/* Init pci_dn data structures */
>  	pci_devs_phb_init_dynamic(phb);
>  
> -	/* Create EEH PEs for the PHB */
> -	eeh_dev_phb_init_dynamic(phb);
> +	/* Create EEH PE for the PHB */
> +	eeh_phb_pe_create(phb);
>  
>  	/* Scan the bus */
>  	pcibios_scan_phb(phb);
> diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
> index b3a38f5a6b68..f9ae17e8a0f4 100644
> --- a/arch/powerpc/platforms/pseries/pci_dlpar.c
> +++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
> @@ -34,7 +34,7 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn)
>  	pci_devs_phb_init_dynamic(phb);
>  
>  	/* Create EEH devices for the PHB */
> -	eeh_dev_phb_init_dynamic(phb);
> +	eeh_phb_pe_create(phb);
>  
>  	if (dn->child)
>  		pseries_eeh_init_edev_recursive(PCI_DN(dn));
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH v2 14/14] powerpc/eeh: Move PE tree setup into the platform
From: Oliver O'Halloran @ 2020-07-24  5:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev
In-Reply-To: <983435c2-0e6b-ded7-d28d-e6728c0a001e@ozlabs.ru>

On Fri, Jul 24, 2020 at 3:01 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>
>
> On 22/07/2020 14:26, Oliver O'Halloran wrote:
> > The EEH core has a concept of a "PE tree" to support PowerNV. The PE tree
> > follows the PCI bus structures because a reset asserted on an upstream
> > bridge will be propagated to the downstream bridges. On pseries there's a
> > 1-1 correspondence between what the guest sees are a PHB and a PE so the
> > "tree" is really just a single node.
> >
> > Current the EEH core is reponsible for setting up this PE tree which it
> > does by traversing the pci_dn tree. The structure of the pci_dn tree
> > matches the bus tree on PowerNV which leads to the PE tree being "correct"
> > this setup method doesn't make a whole lot of sense and it's actively
> > confusing for the pseries case where it doesn't really do anything.
> >
> > We want to remove the dependence on pci_dn anyway so this patch move
> > choosing where to insert a new PE into the platform code rather than
> > being part of the generic EEH code. For PowerNV this simplifies the
> > tree building logic and removes the use of pci_dn. For pseries we
> > keep the existing logic. I'm not really convinced it does anything
> > due to the 1-1 PE-to-PHB correspondence so every device under that
> > PHB should be in the same PE, but I'd rather not remove it entirely
> > until we've had a chance to look at it more deeply.
> >
> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> > Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > v2: Reworked pseries PE setup slightly. NOT DONE YET. mostly done needs test
>
> So far it was looking good and now this :)
>
> When is it going to be done? Is this the broken stuff you mentioned
> elsewhere?

I am a dumb.

I put those in there to remind myself what I have / haven't done when
respinning a series. I added that before I tested it and forgot to
remove the comment.

>
>
> --
> Alexey

^ permalink raw reply

* Re: [PATCH] powerpc/64s: Fix irq tracing corruption in interrupt/syscall return caused by perf interrupts
From: Alexey Kardashevskiy @ 2020-07-24  5:14 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <1595499918.mg25810wnp.astroid@bobo.none>



On 23/07/2020 20:29, Nicholas Piggin wrote:
> Excerpts from Alexey Kardashevskiy's message of July 22, 2020 8:50 pm:
>>
>>
>> On 22/07/2020 17:34, Nicholas Piggin wrote:
>>> Alexey reports lockdep_assert_irqs_enabled() warnings when stress testing perf, e.g.,
>>>
>>> WARNING: CPU: 0 PID: 1556 at kernel/softirq.c:169 __local_bh_enable_ip+0x258/0x270
>>> CPU: 0 PID: 1556 Comm: syz-executor
>>> NIP:  c0000000001ec888 LR: c0000000001ec884 CTR: c000000000ef0610
>>> REGS: c000000022d4f8a0 TRAP: 0700   Not tainted  (5.8.0-rc3-x)
>>> MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28008844  XER: 20040000
>>> CFAR: c0000000001dc1d0 IRQMASK: 0
>>>
>>> The interesting thing is MSR[EE] and IRQMASK shows interrupts are enabled,
>>> suggesting the current->hardirqs_enabled irq tracing state is going out of sync
>>> with the actual interrupt enable state.
>>>
>>> The cause is a window in interrupt/syscall return where irq tracing state is being
>>> adjusted for an irqs-enabled return while MSR[EE] is still enabled. A perf
>>> interrupt hits and ends up calling trace_hardirqs_off() when restoring
>>> interrupt flags to a disable state.
>>>
>>> Fix this by disabling perf interrupts as well while adjusting irq tracing state.
>>>
>>> Add a debug check that catches the condition sooner.
>>>
>>> Fixes: 68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
>>> Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>>
>>> I can reproduce similar symptoms and this patch fixes my test case,
>>> still trying to confirm Alexey's test case or whether there's another
>>> similar bug causing it.
>>
>>
>> This does not fix my testcase. I applied this on top of 4fa640dc5230
>> ("Merge tag 'vfio-v5.8-rc7' of git://github.com/awilliam/linux-vfio into
>> master")  without any of my testing code, just to be clear. Sorry...
> 
> Okay it seems to be a bigger problem and not actually caused by that
> patch but was possible for lockdep hardirqs_enabled state to get out 
> of synch with the local_irq_disable() state before that too. Root
> cause is similar -- perf interrupts hitting between updating the two
> different bits of state.
> 
> Not quite sure why Alexey's test wasn't hitting it before the patch,
> but possibly the way masked interrupts get replayed. But I was able
> to hit the problem with a different assertion.
> 
> I think I have a fix, but it seems to be a generic irq tracing code
> issue. So this patch can be dropped, and it's not an urgent issue for
> the next release (it only triggers warns on rare occasions and only
> when lockdep is enabled).


I would still like to understand how the last
curr->hardirq_disable_event misses the ftrace buffer and we end up in
the original interrupted kernel code...


-- 
Alexey

^ permalink raw reply

* Re: [PATCH v2 10/16] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()
From: Alexey Kardashevskiy @ 2020-07-24  5:20 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20200722065715.1432738-10-oohall@gmail.com>



On 22/07/2020 16:57, Oliver O'Halloran wrote:
> Rework the PE allocation logic to allow allocating blocks of PEs rather
> than individually. We'll use this to allocate contigious blocks of PEs for
> the SR-IOVs.
> 
> This patch also adds code to pnv_ioda_alloc_pe() and pnv_ioda_reserve_pe() to
> use the existing, but unused, phb->pe_alloc_mutex. Currently these functions
> use atomic bit ops to release a currently allocated PE number. However,
> the pnv_ioda_alloc_pe() wants to have exclusive access to the bit map while
> scanning for hole large enough to accomodate the allocation size.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> v2: Add some details about the pe_alloc mutex and why we're using it.


Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>




> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 41 ++++++++++++++++++-----
>  arch/powerpc/platforms/powernv/pci.h      |  2 +-
>  2 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 2d36a9ebf0e9..c9c25fb0783c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -145,23 +145,45 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>  		return;
>  	}
>  
> +	mutex_lock(&phb->ioda.pe_alloc_mutex);
>  	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc))
>  		pr_debug("%s: PE %x was reserved on PHB#%x\n",
>  			 __func__, pe_no, phb->hose->global_number);
> +	mutex_unlock(&phb->ioda.pe_alloc_mutex);
>  
>  	pnv_ioda_init_pe(phb, pe_no);
>  }
>  
> -struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
> +struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count)
>  {
> -	long pe;
> +	struct pnv_ioda_pe *ret = NULL;
> +	int run = 0, pe, i;
>  
> +	mutex_lock(&phb->ioda.pe_alloc_mutex);
> +
> +	/* scan backwards for a run of @count cleared bits */
>  	for (pe = phb->ioda.total_pe_num - 1; pe >= 0; pe--) {
> -		if (!test_and_set_bit(pe, phb->ioda.pe_alloc))
> -			return pnv_ioda_init_pe(phb, pe);
> +		if (test_bit(pe, phb->ioda.pe_alloc)) {
> +			run = 0;
> +			continue;
> +		}
> +
> +		run++;
> +		if (run == count)
> +			break;
>  	}
> +	if (run != count)
> +		goto out;
>  
> -	return NULL;
> +	for (i = pe; i < pe + count; i++) {
> +		set_bit(i, phb->ioda.pe_alloc);
> +		pnv_ioda_init_pe(phb, i);
> +	}
> +	ret = &phb->ioda.pe_array[pe];
> +
> +out:
> +	mutex_unlock(&phb->ioda.pe_alloc_mutex);
> +	return ret;
>  }
>  
>  void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
> @@ -173,7 +195,10 @@ void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
>  	WARN_ON(pe->npucomp); /* NPUs for nvlink are not supposed to be freed */
>  	kfree(pe->npucomp);
>  	memset(pe, 0, sizeof(struct pnv_ioda_pe));
> +
> +	mutex_lock(&phb->ioda.pe_alloc_mutex);
>  	clear_bit(pe_num, phb->ioda.pe_alloc);
> +	mutex_unlock(&phb->ioda.pe_alloc_mutex);
>  }
>  
>  /* The default M64 BAR is shared by all PEs */
> @@ -976,7 +1001,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>  	if (pdn->pe_number != IODA_INVALID_PE)
>  		return NULL;
>  
> -	pe = pnv_ioda_alloc_pe(phb);
> +	pe = pnv_ioda_alloc_pe(phb, 1);
>  	if (!pe) {
>  		pr_warn("%s: Not enough PE# available, disabling device\n",
>  			pci_name(dev));
> @@ -1047,7 +1072,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, bool all)
>  
>  	/* The PE number isn't pinned by M64 */
>  	if (!pe)
> -		pe = pnv_ioda_alloc_pe(phb);
> +		pe = pnv_ioda_alloc_pe(phb, 1);
>  
>  	if (!pe) {
>  		pr_warn("%s: Not enough PE# available for PCI bus %04x:%02x\n",
> @@ -3065,7 +3090,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  		pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_idx);
>  	} else {
>  		/* otherwise just allocate one */
> -		root_pe = pnv_ioda_alloc_pe(phb);
> +		root_pe = pnv_ioda_alloc_pe(phb, 1);
>  		phb->ioda.root_pe_idx = root_pe->pe_number;
>  	}
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 23fc5e391c7f..06431a452130 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -224,7 +224,7 @@ int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe);
>  void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe);
>  void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe);
>  
> -struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb);
> +struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count);
>  void pnv_ioda_free_pe(struct pnv_ioda_pe *pe);
>  
>  #ifdef CONFIG_PCI_IOV
> 

-- 
Alexey

^ permalink raw reply

* [powerpc:next] BUILD SUCCESS 335aca5f65f1a39670944930131da5f2276f888f
From: kernel test robot @ 2020-07-24  5:41 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  next
branch HEAD: 335aca5f65f1a39670944930131da5f2276f888f  Merge branch 'scv' support into next

elapsed time: 951m

configs tested: 74
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm64                            allyesconfig
arm64                               defconfig
arm64                            allmodconfig
arm64                             allnoconfig
arm                                 defconfig
arm                              allyesconfig
arm                              allmodconfig
arm                               allnoconfig
i386                                defconfig
i386                             allyesconfig
i386                              debian-10.3
i386                              allnoconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                              allnoconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                              allnoconfig
m68k                           sun3_defconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
nios2                            allyesconfig
c6x                              allyesconfig
c6x                               allnoconfig
openrisc                         allyesconfig
openrisc                            defconfig
nds32                               defconfig
nds32                             allnoconfig
csky                             allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
h8300                            allmodconfig
xtensa                              defconfig
arc                                 defconfig
arc                              allyesconfig
sh                               allmodconfig
sh                                allnoconfig
microblaze                        allnoconfig
mips                             allyesconfig
mips                              allnoconfig
mips                             allmodconfig
parisc                            allnoconfig
parisc                              defconfig
parisc                           allyesconfig
parisc                           allmodconfig
powerpc                             defconfig
powerpc                          allyesconfig
powerpc                          rhel-kconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
s390                             allyesconfig
s390                              allnoconfig
s390                             allmodconfig
s390                                defconfig
sparc                            allyesconfig
sparc                               defconfig
sparc64                           allnoconfig
sparc64                          allmodconfig
sparc64                             defconfig
sparc64                          allyesconfig
x86_64                    rhel-7.6-kselftests
x86_64                               rhel-8.3
x86_64                                  kexec
x86_64                                   rhel
x86_64                                    lkp
x86_64                              fedora-25

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state
From: Nicholas Piggin @ 2020-07-24  5:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Peter Zijlstra
  Cc: linux-arch, Will Deacon, Ingo Molnar, linuxppc-dev, linux-kernel
In-Reply-To: <1884dcea-9ecd-a1f3-21bb-213c655e2480@ozlabs.ru>

Excerpts from Alexey Kardashevskiy's message of July 24, 2020 2:16 pm:
> 
> 
> On 23/07/2020 23:11, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 23, 2020 9:40 pm:
>>> On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:
>>>
>>>> diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h
>>>> index 3a0db7b0b46e..35060be09073 100644
>>>> --- a/arch/powerpc/include/asm/hw_irq.h
>>>> +++ b/arch/powerpc/include/asm/hw_irq.h
>>>> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
>>>>  #define powerpc_local_irq_pmu_save(flags)			\
>>>>  	 do {							\
>>>>  		raw_local_irq_pmu_save(flags);			\
>>>> -		trace_hardirqs_off();				\
>>>> +		if (!raw_irqs_disabled_flags(flags))		\
>>>> +			trace_hardirqs_off();			\
>>>>  	} while(0)
>>>>  #define powerpc_local_irq_pmu_restore(flags)			\
>>>>  	do {							\
>>>> -		if (raw_irqs_disabled_flags(flags)) {		\
>>>> -			raw_local_irq_pmu_restore(flags);	\
>>>> -			trace_hardirqs_off();			\
>>>> -		} else {					\
>>>> +		if (!raw_irqs_disabled_flags(flags))		\
>>>>  			trace_hardirqs_on();			\
>>>> -			raw_local_irq_pmu_restore(flags);	\
>>>> -		}						\
>>>> +		raw_local_irq_pmu_restore(flags);		\
>>>>  	} while(0)
>>>
>>> You shouldn't be calling lockdep from NMI context!
>> 
>> After this patch it doesn't.
>> 
>> trace_hardirqs_on/off implementation appears to expect to be called in NMI 
>> context though, for some reason.
>> 
>>> That is, I recently
>>> added suport for that on x86:
>>>
>>>   https://lkml.kernel.org/r/20200623083721.155449112@infradead.org
>>>   https://lkml.kernel.org/r/20200623083721.216740948@infradead.org
>>>
>>> But you need to be very careful on how you order things, as you can see
>>> the above relies on preempt_count() already having been incremented with
>>> NMI_MASK.
>> 
>> Hmm. My patch seems simpler.
> 
> And your patches fix my error while Peter's do not:
> 
> 
> IRQs not enabled as expected
> WARNING: CPU: 0 PID: 1377 at /home/aik/p/kernel/kernel/softirq.c:169
> __local_bh_enable_ip+0x118/0x190

I think they would have needed some powerpc bits as well. But I don't
see a reason we can't merge my patches, at least they fix this case and
don't seem to make things worse in any way.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state
From: Athira Rajeev @ 2020-07-24  6:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linux-arch, Peter Zijlstra, linuxppc-dev, linux-kernel,
	Nicholas Piggin, Ingo Molnar, Will Deacon
In-Reply-To: <1884dcea-9ecd-a1f3-21bb-213c655e2480@ozlabs.ru>



> On 24-Jul-2020, at 9:46 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> 
> 
> On 23/07/2020 23:11, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 23, 2020 9:40 pm:
>>> On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:
>>> 
>>>> diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h
>>>> index 3a0db7b0b46e..35060be09073 100644
>>>> --- a/arch/powerpc/include/asm/hw_irq.h
>>>> +++ b/arch/powerpc/include/asm/hw_irq.h
>>>> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
>>>> #define powerpc_local_irq_pmu_save(flags)			\
>>>> 	 do {							\
>>>> 		raw_local_irq_pmu_save(flags);			\
>>>> -		trace_hardirqs_off();				\
>>>> +		if (!raw_irqs_disabled_flags(flags))		\
>>>> +			trace_hardirqs_off();			\
>>>> 	} while(0)
>>>> #define powerpc_local_irq_pmu_restore(flags)			\
>>>> 	do {							\
>>>> -		if (raw_irqs_disabled_flags(flags)) {		\
>>>> -			raw_local_irq_pmu_restore(flags);	\
>>>> -			trace_hardirqs_off();			\
>>>> -		} else {					\
>>>> +		if (!raw_irqs_disabled_flags(flags))		\
>>>> 			trace_hardirqs_on();			\
>>>> -			raw_local_irq_pmu_restore(flags);	\
>>>> -		}						\
>>>> +		raw_local_irq_pmu_restore(flags);		\
>>>> 	} while(0)
>>> 
>>> You shouldn't be calling lockdep from NMI context!
>> 
>> After this patch it doesn't.
>> 
>> trace_hardirqs_on/off implementation appears to expect to be called in NMI 
>> context though, for some reason.
>> 
>>> That is, I recently
>>> added suport for that on x86:
>>> 
>>>  https://lkml.kernel.org/r/20200623083721.155449112@infradead.org
>>>  https://lkml.kernel.org/r/20200623083721.216740948@infradead.org
>>> 
>>> But you need to be very careful on how you order things, as you can see
>>> the above relies on preempt_count() already having been incremented with
>>> NMI_MASK.
>> 
>> Hmm. My patch seems simpler.
> 
> And your patches fix my error while Peter's do not:
> 
> 
> IRQs not enabled as expected
> WARNING: CPU: 0 PID: 1377 at /home/aik/p/kernel/kernel/softirq.c:169
> __local_bh_enable_ip+0x118/0x190

Hi Nicholas, Alexey

I was able to reproduce the warning which Alexey reported using perf_fuzzer test suite. 
With the patch provided by Nick, I don’t see the issue anymore. This patch fixes the
warnings I got with perf fuzzer run.

Thanks Nick for the fix. 

Tested-by: Athira Rajeev<atrajeev@linux.ibm.com>


> 
> 
>> 
>> I don't know this stuff very well, I don't really understand what your patch 
>> enables for x86 but at least it shouldn't be incompatible with this one 
>> AFAIKS.
>> 
>> Thanks,
>> Nick
>> 
> 
> -- 
> Alexey


^ permalink raw reply

* Re: [v4] powerpc/perf: Initialize power10 PMU registers in cpu setup routine
From: Athira Rajeev @ 2020-07-24  6:32 UTC (permalink / raw)
  To: kernel test robot
  Cc: Michael Neuling, maddy, kbuild-all, Jordan Niethe, linuxppc-dev
In-Reply-To: <202007241030.wtjRnkIR%lkp@intel.com>



> On 24-Jul-2020, at 8:32 AM, kernel test robot <lkp@intel.com> wrote:
> 
> Hi Athira,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v5.8-rc6 next-20200723]
> [cannot apply to mpe/next scottwood/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/0day-ci/linux/commits/Athira-Rajeev/powerpc-perf-Initialize-power10-PMU-registers-in-cpu-setup-routine/20200723-153537
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-defconfig (attached as .config)
> compiler: powerpc64-linux-gcc (GCC) 9.3.0
> reproduce (this is a W=1 build):
>        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>        chmod +x ~/bin/make.cross
>        # save the attached .config to linux build tree
>        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>   arch/powerpc/kernel/cpu_setup_power.S: Assembler messages:
>>> arch/powerpc/kernel/cpu_setup_power.S:244: Error: non-constant expression in ".if" statement
>>> arch/powerpc/kernel/cpu_setup_power.S:244: Error: non-constant expression in ".if" statement
>>> arch/powerpc/kernel/cpu_setup_power.S:243: Error: unsupported relocation against SPRN_MMCR3

Hi,

As mentioned in changelog:

This patch was initially part of Power10 PMU base enablement
 series ( https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=190462 ).
 Moving this as separate patch as suggested by Michael
 Ellerman. Hence there is a dependency on initial series ( Patch 7 ) which defines
 MMCRA_BHRB_DISABLE and new SPR’s. And the build error is because of that.

Thanks
Athira

> 
> vim +244 arch/powerpc/kernel/cpu_setup_power.S
> 
>    14	
>    15	/* Entry: r3 = crap, r4 = ptr to cputable entry
>    16	 *
>    17	 * Note that we can be called twice for pseudo-PVRs
>    18	 */
>    19	_GLOBAL(__setup_cpu_power7)
>    20		mflr	r11
>    21		bl	__init_hvmode_206
>    22		mtlr	r11
>    23		beqlr
>    24		li	r0,0
>    25		mtspr	SPRN_LPID,r0
>    26		LOAD_REG_IMMEDIATE(r0, PCR_MASK)
>    27		mtspr	SPRN_PCR,r0
>    28		mfspr	r3,SPRN_LPCR
>    29		li	r4,(LPCR_LPES1 >> LPCR_LPES_SH)
>    30		bl	__init_LPCR_ISA206
>    31		mtlr	r11
>    32		blr
>    33	
>    34	_GLOBAL(__restore_cpu_power7)
>    35		mflr	r11
>    36		mfmsr	r3
>    37		rldicl.	r0,r3,4,63
>    38		beqlr
>    39		li	r0,0
>    40		mtspr	SPRN_LPID,r0
>    41		LOAD_REG_IMMEDIATE(r0, PCR_MASK)
>    42		mtspr	SPRN_PCR,r0
>    43		mfspr	r3,SPRN_LPCR
>    44		li	r4,(LPCR_LPES1 >> LPCR_LPES_SH)
>    45		bl	__init_LPCR_ISA206
>    46		mtlr	r11
>    47		blr
>    48	
>    49	_GLOBAL(__setup_cpu_power8)
>    50		mflr	r11
>    51		bl	__init_FSCR
>    52		bl	__init_PMU
>    53		bl	__init_PMU_ISA207
>    54		bl	__init_hvmode_206
>    55		mtlr	r11
>    56		beqlr
>    57		li	r0,0
>    58		mtspr	SPRN_LPID,r0
>    59		LOAD_REG_IMMEDIATE(r0, PCR_MASK)
>    60		mtspr	SPRN_PCR,r0
>    61		mfspr	r3,SPRN_LPCR
>    62		ori	r3, r3, LPCR_PECEDH
>    63		li	r4,0 /* LPES = 0 */
>    64		bl	__init_LPCR_ISA206
>    65		bl	__init_HFSCR
>    66		bl	__init_PMU_HV
>    67		bl	__init_PMU_HV_ISA207
>    68		mtlr	r11
>    69		blr
>    70	
>    71	_GLOBAL(__restore_cpu_power8)
>    72		mflr	r11
>    73		bl	__init_FSCR
>    74		bl	__init_PMU
>    75		bl	__init_PMU_ISA207
>    76		mfmsr	r3
>    77		rldicl.	r0,r3,4,63
>    78		mtlr	r11
>    79		beqlr
>    80		li	r0,0
>    81		mtspr	SPRN_LPID,r0
>    82		LOAD_REG_IMMEDIATE(r0, PCR_MASK)
>    83		mtspr	SPRN_PCR,r0
>    84		mfspr   r3,SPRN_LPCR
>    85		ori	r3, r3, LPCR_PECEDH
>    86		li	r4,0 /* LPES = 0 */
>    87		bl	__init_LPCR_ISA206
>    88		bl	__init_HFSCR
>    89		bl	__init_PMU_HV
>    90		bl	__init_PMU_HV_ISA207
>    91		mtlr	r11
>    92		blr
>    93	
>    94	_GLOBAL(__setup_cpu_power10)
>    95		mflr	r11
>    96		bl	__init_FSCR_power10
>    97		bl	__init_PMU
>    98		bl	__init_PMU_ISA31
>    99		b	1f
>   100	
>   101	_GLOBAL(__setup_cpu_power9)
>   102		mflr	r11
>   103		bl	__init_FSCR
>   104		bl	__init_PMU
>   105	1:	bl	__init_hvmode_206
>   106		mtlr	r11
>   107		beqlr
>   108		li	r0,0
>   109		mtspr	SPRN_PSSCR,r0
>   110		mtspr	SPRN_LPID,r0
>   111		mtspr	SPRN_PID,r0
>   112		LOAD_REG_IMMEDIATE(r0, PCR_MASK)
>   113		mtspr	SPRN_PCR,r0
>   114		mfspr	r3,SPRN_LPCR
>   115		LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE  | LPCR_HEIC)
>   116		or	r3, r3, r4
>   117		LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
>   118		andc	r3, r3, r4
>   119		li	r4,0 /* LPES = 0 */
>   120		bl	__init_LPCR_ISA300
>   121		bl	__init_HFSCR
>   122		bl	__init_PMU_HV
>   123		mtlr	r11
>   124		blr
>   125	
>   126	_GLOBAL(__restore_cpu_power10)
>   127		mflr	r11
>   128		bl	__init_FSCR_power10
>   129		bl	__init_PMU
>   130		bl	__init_PMU_ISA31
>   131		b	1f
>   132	
>   133	_GLOBAL(__restore_cpu_power9)
>   134		mflr	r11
>   135		bl	__init_FSCR
>   136		bl	__init_PMU
>   137	1:	mfmsr	r3
>   138		rldicl.	r0,r3,4,63
>   139		mtlr	r11
>   140		beqlr
>   141		li	r0,0
>   142		mtspr	SPRN_PSSCR,r0
>   143		mtspr	SPRN_LPID,r0
>   144		mtspr	SPRN_PID,r0
>   145		LOAD_REG_IMMEDIATE(r0, PCR_MASK)
>   146		mtspr	SPRN_PCR,r0
>   147		mfspr   r3,SPRN_LPCR
>   148		LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE | LPCR_HEIC)
>   149		or	r3, r3, r4
>   150		LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
>   151		andc	r3, r3, r4
>   152		li	r4,0 /* LPES = 0 */
>   153		bl	__init_LPCR_ISA300
>   154		bl	__init_HFSCR
>   155		bl	__init_PMU_HV
>   156		mtlr	r11
>   157		blr
>   158	
>   159	__init_hvmode_206:
>   160		/* Disable CPU_FTR_HVMODE and exit if MSR:HV is not set */
>   161		mfmsr	r3
>   162		rldicl.	r0,r3,4,63
>   163		bnelr
>   164		ld	r5,CPU_SPEC_FEATURES(r4)
>   165		LOAD_REG_IMMEDIATE(r6,CPU_FTR_HVMODE | CPU_FTR_P9_TM_HV_ASSIST)
>   166		andc	r5,r5,r6
>   167		std	r5,CPU_SPEC_FEATURES(r4)
>   168		blr
>   169	
>   170	__init_LPCR_ISA206:
>   171		/* Setup a sane LPCR:
>   172		 *   Called with initial LPCR in R3 and desired LPES 2-bit value in R4
>   173		 *
>   174		 *   LPES = 0b01 (HSRR0/1 used for 0x500)
>   175		 *   PECE = 0b111
>   176		 *   DPFD = 4
>   177		 *   HDICE = 0
>   178		 *   VC = 0b100 (VPM0=1, VPM1=0, ISL=0)
>   179		 *   VRMASD = 0b10000 (L=1, LP=00)
>   180		 *
>   181		 * Other bits untouched for now
>   182		 */
>   183		li	r5,0x10
>   184		rldimi	r3,r5, LPCR_VRMASD_SH, 64-LPCR_VRMASD_SH-5
>   185	
>   186		/* POWER9 has no VRMASD */
>   187	__init_LPCR_ISA300:
>   188		rldimi	r3,r4, LPCR_LPES_SH, 64-LPCR_LPES_SH-2
>   189		ori	r3,r3,(LPCR_PECE0|LPCR_PECE1|LPCR_PECE2)
>   190		li	r5,4
>   191		rldimi	r3,r5, LPCR_DPFD_SH, 64-LPCR_DPFD_SH-3
>   192		clrrdi	r3,r3,1		/* clear HDICE */
>   193		li	r5,4
>   194		rldimi	r3,r5, LPCR_VC_SH, 0
>   195		mtspr	SPRN_LPCR,r3
>   196		isync
>   197		blr
>   198	
>   199	__init_FSCR_power10:
>   200		mfspr	r3, SPRN_FSCR
>   201		ori	r3, r3, FSCR_PREFIX
>   202		mtspr	SPRN_FSCR, r3
>   203		// fall through
>   204	
>   205	__init_FSCR:
>   206		mfspr	r3,SPRN_FSCR
>   207		ori	r3,r3,FSCR_TAR|FSCR_EBB
>   208		mtspr	SPRN_FSCR,r3
>   209		blr
>   210	
>   211	__init_HFSCR:
>   212		mfspr	r3,SPRN_HFSCR
>   213		ori	r3,r3,HFSCR_TAR|HFSCR_TM|HFSCR_BHRB|HFSCR_PM|\
>   214			      HFSCR_DSCR|HFSCR_VECVSX|HFSCR_FP|HFSCR_EBB|HFSCR_MSGP
>   215		mtspr	SPRN_HFSCR,r3
>   216		blr
>   217	
>   218	__init_PMU_HV:
>   219		li	r5,0
>   220		mtspr	SPRN_MMCRC,r5
>   221		blr
>   222	
>   223	__init_PMU_HV_ISA207:
>   224		li	r5,0
>   225		mtspr	SPRN_MMCRH,r5
>   226		blr
>   227	
>   228	__init_PMU:
>   229		li	r5,0
>   230		mtspr	SPRN_MMCRA,r5
>   231		mtspr	SPRN_MMCR0,r5
>   232		mtspr	SPRN_MMCR1,r5
>   233		mtspr	SPRN_MMCR2,r5
>   234		blr
>   235	
>   236	__init_PMU_ISA207:
>   237		li	r5,0
>   238		mtspr	SPRN_MMCRS,r5
>   239		blr
>   240	
>   241	__init_PMU_ISA31:
>   242		li	r5,0
>> 243		mtspr	SPRN_MMCR3,r5
>> 244		LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> <.config.gz>


^ permalink raw reply

* Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
From: Pratik Sampat @ 2020-07-24  6:34 UTC (permalink / raw)
  To: Michael Neuling, mpe, benh, paulus, ravi.bangoria, ego, svaidy,
	pratik.r.sampat, linuxppc-dev, linux-kernel
In-Reply-To: <b9507631629bfc1f36893a280b2b83ea484516f9.camel@neuling.org>



On 24/07/20 6:55 am, Michael Neuling wrote:
> On Fri, 2020-07-10 at 10:52 +0530, Pratik Rajesh Sampat wrote:
>> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
>> stop levels < 4.
>> Therefore save the values of these SPRs before entering a  "stop"
>> state and restore their values on wakeup.
>>
>> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
>> ---
>>   arch/powerpc/platforms/powernv/idle.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/powernv/idle.c
>> b/arch/powerpc/platforms/powernv/idle.c
>> index 19d94d021357..f2e2a6a4c274 100644
>> --- a/arch/powerpc/platforms/powernv/idle.c
>> +++ b/arch/powerpc/platforms/powernv/idle.c
>> @@ -600,6 +600,8 @@ struct p9_sprs {
>>   	u64 iamr;
>>   	u64 amor;
>>   	u64 uamor;
>> +	u64 dawr0;
>> +	u64 dawrx0;
>>   };
>>   
>>   static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>> @@ -687,6 +689,10 @@ static unsigned long power9_idle_stop(unsigned long
>> psscr, bool mmu_on)
>>   	sprs.iamr	= mfspr(SPRN_IAMR);
>>   	sprs.amor	= mfspr(SPRN_AMOR);
>>   	sprs.uamor	= mfspr(SPRN_UAMOR);
>> +	if (cpu_has_feature(CPU_FTR_ARCH_31)) {

You are actually viewing an old version of the patches
The main point of change were based on comments from Nick Piggin, I
have changed the top level function check from ARCH_300 to a P9 PVR
check instead.

A similar thing needs to be done for P10, however as the P10 PVR isn't
exposed yet, I've shelved this particular patch.

Nick's comment to check based on PVR:https://lkml.org/lkml/2020/7/13/1018
v4 of the series:https://lkml.org/lkml/2020/7/21/784

Thanks for your review,
Pratik

> Can you add a comment here saying even though DAWR0 is ARCH_30, it's only
> required to be saved on 31. Otherwise this looks pretty odd.
>
>> +		sprs.dawr0 = mfspr(SPRN_DAWR0);
>> +		sprs.dawrx0 = mfspr(SPRN_DAWRX0);
>> +	}
>>   
>>   	srr1 = isa300_idle_stop_mayloss(psscr);		/* go idle */
>>   
>> @@ -710,6 +716,10 @@ static unsigned long power9_idle_stop(unsigned long
>> psscr, bool mmu_on)
>>   		mtspr(SPRN_IAMR,	sprs.iamr);
>>   		mtspr(SPRN_AMOR,	sprs.amor);
>>   		mtspr(SPRN_UAMOR,	sprs.uamor);
>> +		if (cpu_has_feature(CPU_FTR_ARCH_31)) {
>> +			mtspr(SPRN_DAWR0, sprs.dawr0);
>> +			mtspr(SPRN_DAWRX0, sprs.dawrx0);
>> +		}
>>   
>>   		/*
>>   		 * Workaround for POWER9 DD2.0, if we lost resources, the ERAT


^ permalink raw reply

* [PATCH v4] powerpc/pseries: Avoid using addr_to_pfn in real mode
From: Ganesh Goudar @ 2020-07-24  6:39 UTC (permalink / raw)
  To: mpe, linuxppc-dev; +Cc: mahesh, Ganesh Goudar, npiggin, aneesh.kumar

When an UE or memory error exception is encountered the MCE handler
tries to find the pfn using addr_to_pfn() which takes effective
address as an argument, later pfn is used to poison the page where
memory error occurred, recent rework in this area made addr_to_pfn
to run in real mode, which can be fatal as it may try to access
memory outside RMO region.

Have two helper functions to separate things to be done in real mode
and virtual mode without changing any functionality. This also fixes
the following error as the use of addr_to_pfn is now moved to virtual
mode.

Without this change following kernel crash is seen on hitting UE.

[  485.128036] Oops: Kernel access of bad area, sig: 11 [#1]
[  485.128040] LE SMP NR_CPUS=2048 NUMA pSeries
[  485.128047] Modules linked in:
[  485.128067] CPU: 15 PID: 6536 Comm: insmod Kdump: loaded Tainted: G OE 5.7.0 #22
[  485.128074] NIP:  c00000000009b24c LR: c0000000000398d8 CTR: c000000000cd57c0
[  485.128078] REGS: c000000003f1f970 TRAP: 0300   Tainted: G OE (5.7.0)
[  485.128082] MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 28008284  XER: 00000001
[  485.128088] CFAR: c00000000009b190 DAR: c0000001fab00000 DSISR: 40000000 IRQMASK: 1
[  485.128088] GPR00: 0000000000000001 c000000003f1fbf0 c000000001634300 0000b0fa01000000
[  485.128088] GPR04: d000000002220000 0000000000000000 00000000fab00000 0000000000000022
[  485.128088] GPR08: c0000001fab00000 0000000000000000 c0000001fab00000 c000000003f1fc14
[  485.128088] GPR12: 0000000000000008 c000000003ff5880 d000000002100008 0000000000000000
[  485.128088] GPR16: 000000000000ff20 000000000000fff1 000000000000fff2 d0000000021a1100
[  485.128088] GPR20: d000000002200000 c00000015c893c50 c000000000d49b28 c00000015c893c50
[  485.128088] GPR24: d0000000021a0d08 c0000000014e5da8 d0000000021a0818 000000000000000a
[  485.128088] GPR28: 0000000000000008 000000000000000a c0000000017e2970 000000000000000a
[  485.128125] NIP [c00000000009b24c] __find_linux_pte+0x11c/0x310
[  485.128130] LR [c0000000000398d8] addr_to_pfn+0x138/0x170
[  485.128133] Call Trace:
[  485.128135] Instruction dump:
[  485.128138] 3929ffff 7d4a3378 7c883c36 7d2907b4 794a1564 7d294038 794af082 3900ffff
[  485.128144] 79291f24 790af00e 78e70020 7d095214 <7c69502a> 2fa30000 419e011c 70690040
[  485.128152] ---[ end trace d34b27e29ae0e340 ]---

Fixes: 9ca766f9891d ("powerpc/64s/pseries: machine check convert to use common event code")
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
---
V2: Leave bare metal code and save_mce_event as is.

V3: Have separate functions for realmode and virtual mode handling.

V4: Fix build warning, rephrase commit message.
---
 arch/powerpc/platforms/pseries/ras.c | 118 ++++++++++++++++-----------
 1 file changed, 69 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index f3736fcd98fc..c509e43bac23 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -522,18 +522,55 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
 	return 0; /* need to perform reset */
 }
 
+static int mce_handle_err_realmode(int disposition, u8 error_type)
+{
+#ifdef CONFIG_PPC_BOOK3S_64
+	if (disposition == RTAS_DISP_NOT_RECOVERED) {
+		switch (error_type) {
+		case	MC_ERROR_TYPE_SLB:
+		case	MC_ERROR_TYPE_ERAT:
+			/*
+			 * Store the old slb content in paca before flushing.
+			 * Print this when we go to virtual mode.
+			 * There are chances that we may hit MCE again if there
+			 * is a parity error on the SLB entry we trying to read
+			 * for saving. Hence limit the slb saving to single
+			 * level of recursion.
+			 */
+			if (local_paca->in_mce == 1)
+				slb_save_contents(local_paca->mce_faulty_slbs);
+			flush_and_reload_slb();
+			disposition = RTAS_DISP_FULLY_RECOVERED;
+			break;
+		default:
+			break;
+		}
+	} else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
+		/* Platform corrected itself but could be degraded */
+		pr_err("MCE: limited recovery, system may be degraded\n");
+		disposition = RTAS_DISP_FULLY_RECOVERED;
+	}
+#endif
+	return disposition;
+}
 
-static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
+static int mce_handle_err_virtmode(struct pt_regs *regs,
+				   struct rtas_error_log *errp,
+				   struct pseries_mc_errorlog *mce_log,
+				   int disposition)
 {
 	struct mce_error_info mce_err = { 0 };
-	unsigned long eaddr = 0, paddr = 0;
-	struct pseries_errorlog *pseries_log;
-	struct pseries_mc_errorlog *mce_log;
-	int disposition = rtas_error_disposition(errp);
 	int initiator = rtas_error_initiator(errp);
 	int severity = rtas_error_severity(errp);
+	unsigned long eaddr = 0, paddr = 0;
 	u8 error_type, err_sub_type;
 
+	if (!mce_log)
+		goto out;
+
+	error_type = mce_log->error_type;
+	err_sub_type = rtas_mc_error_sub_type(mce_log);
+
 	if (initiator == RTAS_INITIATOR_UNKNOWN)
 		mce_err.initiator = MCE_INITIATOR_UNKNOWN;
 	else if (initiator == RTAS_INITIATOR_CPU)
@@ -572,18 +609,7 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
 	mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
 	mce_err.error_class = MCE_ECLASS_UNKNOWN;
 
-	if (!rtas_error_extended(errp))
-		goto out;
-
-	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
-	if (pseries_log == NULL)
-		goto out;
-
-	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
-	error_type = mce_log->error_type;
-	err_sub_type = rtas_mc_error_sub_type(mce_log);
-
-	switch (mce_log->error_type) {
+	switch (error_type) {
 	case MC_ERROR_TYPE_UE:
 		mce_err.error_type = MCE_ERROR_TYPE_UE;
 		mce_common_process_ue(regs, &mce_err);
@@ -683,37 +709,31 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
 		mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
 		break;
 	}
+out:
+	save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
+		       &mce_err, regs->nip, eaddr, paddr);
+	return disposition;
+}
 
-#ifdef CONFIG_PPC_BOOK3S_64
-	if (disposition == RTAS_DISP_NOT_RECOVERED) {
-		switch (error_type) {
-		case	MC_ERROR_TYPE_SLB:
-		case	MC_ERROR_TYPE_ERAT:
-			/*
-			 * Store the old slb content in paca before flushing.
-			 * Print this when we go to virtual mode.
-			 * There are chances that we may hit MCE again if there
-			 * is a parity error on the SLB entry we trying to read
-			 * for saving. Hence limit the slb saving to single
-			 * level of recursion.
-			 */
-			if (local_paca->in_mce == 1)
-				slb_save_contents(local_paca->mce_faulty_slbs);
-			flush_and_reload_slb();
-			disposition = RTAS_DISP_FULLY_RECOVERED;
-			break;
-		default:
-			break;
-		}
-	} else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
-		/* Platform corrected itself but could be degraded */
-		printk(KERN_ERR "MCE: limited recovery, system may "
-		       "be degraded\n");
-		disposition = RTAS_DISP_FULLY_RECOVERED;
-	}
-#endif
+static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
+{
+	struct pseries_errorlog *pseries_log;
+	struct pseries_mc_errorlog *mce_log = NULL;
+	int disposition = rtas_error_disposition(errp);
+	u8 error_type;
+
+	if (!rtas_error_extended(errp))
+		goto out;
+
+	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+	if (!pseries_log)
+		goto out;
+
+	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+	error_type = mce_log->error_type;
+
+	disposition = mce_handle_err_realmode(disposition, error_type);
 
-out:
 	/*
 	 * Enable translation as we will be accessing per-cpu variables
 	 * in save_mce_event() which may fall outside RMO region, also
@@ -724,10 +744,10 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
 	 * Note: All the realmode handling like flushing SLB entries for
 	 *       SLB multihit is done by now.
 	 */
+out:
 	mtmsr(mfmsr() | MSR_IR | MSR_DR);
-	save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
-			&mce_err, regs->nip, eaddr, paddr);
-
+	disposition = mce_handle_err_virtmode(regs, errp, mce_log,
+					      disposition);
 	return disposition;
 }
 
-- 
2.17.2


^ permalink raw reply related

* Re: [PATCH v3 02/10] powerpc/smp: Merge Power9 topology with Power topology
From: Gautham R Shenoy @ 2020-07-24  7:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200723085116.4731-3-srikar@linux.vnet.ibm.com>

On Thu, Jul 23, 2020 at 02:21:08PM +0530, Srikar Dronamraju wrote:
> A new sched_domain_topology_level was added just for Power9. However the
> same can be achieved by merging powerpc_topology with power9_topology
> and makes the code more simpler especially when adding a new sched
> domain.
> 
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

LGTM.


Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

> ---
> Changelog v1 -> v2:
> 	Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
> 	since cpu_smt_mask is only defined under CONFIG_SCHED_SMT
> 
>  arch/powerpc/kernel/smp.c | 33 ++++++++++-----------------------
>  1 file changed, 10 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index edf94ca64eea..283a04e54f52 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1313,7 +1313,7 @@ int setup_profiling_timer(unsigned int multiplier)
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> -/* cpumask of CPUs with asymetric SMT dependancy */
> +/* cpumask of CPUs with asymmetric SMT dependency */
>  static int powerpc_smt_flags(void)
>  {
>  	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
> @@ -1326,14 +1326,6 @@ static int powerpc_smt_flags(void)
>  }
>  #endif
> 
> -static struct sched_domain_topology_level powerpc_topology[] = {
> -#ifdef CONFIG_SCHED_SMT
> -	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
> -#endif
> -	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> -	{ NULL, },
> -};
> -
>  /*
>   * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
>   * This topology makes it *much* cheaper to migrate tasks between adjacent cores
> @@ -1351,7 +1343,13 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> -	return cpu_l2_cache_mask(cpu);
> +	if (shared_caches)
> +		return cpu_l2_cache_mask(cpu);
> +
> +	if (has_big_cores)
> +		return cpu_smallcore_mask(cpu);
> +
> +	return per_cpu(cpu_sibling_map, cpu);
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> @@ -1361,7 +1359,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> -static struct sched_domain_topology_level power9_topology[] = {
> +static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>  	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> @@ -1386,21 +1384,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
>  #ifdef CONFIG_SCHED_SMT
>  	if (has_big_cores) {
>  		pr_info("Big cores detected but using small core scheduling\n");
> -		power9_topology[0].mask = smallcore_smt_mask;
>  		powerpc_topology[0].mask = smallcore_smt_mask;
>  	}
>  #endif
> -	/*
> -	 * If any CPU detects that it's sharing a cache with another CPU then
> -	 * use the deeper topology that is aware of this sharing.
> -	 */
> -	if (shared_caches) {
> -		pr_info("Using shared cache scheduler topology\n");
> -		set_sched_topology(power9_topology);
> -	} else {
> -		pr_info("Using standard scheduler topology\n");
> -		set_sched_topology(powerpc_topology);
> -	}
> +	set_sched_topology(powerpc_topology);
>  }
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> -- 
> 2.18.2
> 

^ permalink raw reply

* Re: [PATCH v3 04/10] powerpc/smp: Move topology fixups into  a new function
From: Gautham R Shenoy @ 2020-07-24  7:08 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200723085116.4731-5-srikar@linux.vnet.ibm.com>

On Thu, Jul 23, 2020 at 02:21:10PM +0530, Srikar Dronamraju wrote:
> Move topology fixup based on the platform attributes into its own
> function which is called just before set_sched_topology.
> 
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>


Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
> Changelog v2 -> v3:
> 	Rewrote changelog (Gautham)
> 	Renamed to powerpc/smp: Move topology fixups into  a new function
> 
>  arch/powerpc/kernel/smp.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index a685915e5941..da27f6909be1 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1368,6 +1368,16 @@ int setup_profiling_timer(unsigned int multiplier)
>  	return 0;
>  }
> 
> +static void fixup_topology(void)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +	if (has_big_cores) {
> +		pr_info("Big cores detected but using small core scheduling\n");
> +		powerpc_topology[0].mask = smallcore_smt_mask;
> +	}
> +#endif
> +}
> +
>  void __init smp_cpus_done(unsigned int max_cpus)
>  {
>  	/*
> @@ -1381,12 +1391,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
> 
>  	dump_numa_cpu_topology();
> 
> -#ifdef CONFIG_SCHED_SMT
> -	if (has_big_cores) {
> -		pr_info("Big cores detected but using small core scheduling\n");
> -		powerpc_topology[0].mask = smallcore_smt_mask;
> -	}
> -#endif
> +	fixup_topology();
>  	set_sched_topology(powerpc_topology);
>  }
> 
> -- 
> 2.18.2
> 

^ permalink raw reply

* Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Gautham R Shenoy @ 2020-07-24  7:10 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200722065747.GB9290@linux.vnet.ibm.com>

On Wed, Jul 22, 2020 at 12:27:47PM +0530, Srikar Dronamraju wrote:
> * Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-22 11:51:14]:
> 
> > Hi Srikar,
> > 
> > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > > index 72f16dc0cb26..57468877499a 100644
> > > --- a/arch/powerpc/kernel/smp.c
> > > +++ b/arch/powerpc/kernel/smp.c
> > > @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
> > >  	if (!l2_cache)
> > >  		return false;
> > > 
> > > +	cpumask_set_cpu(cpu, mask_fn(cpu));
> > 
> > 
> > Ok, we need to do this because "cpu" is not yet set in the
> > cpu_online_mask. Prior to your patch the "cpu" was getting set in
> > cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in
> > the patch.
> > 
> 
> Right.
> 
> > 
> > >  	for_each_cpu(i, cpu_online_mask) {
> > >  		/*
> > >  		 * when updating the marks the current CPU has not been marked
> > > @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu)
> > >  	 * add it to it's own thread sibling mask.
> > >  	 */
> > >  	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> > > +	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> 
> Note: Above, we are explicitly setting the cpu_core_mask.

You are right. I missed this.

> 
> > > 
> > >  	for (i = first_thread; i < first_thread + threads_per_core; i++)
> > >  		if (cpu_online(i))
> > >  			set_cpus_related(i, cpu, cpu_sibling_mask);
> > > 
> > >  	add_cpu_to_smallcore_masks(cpu);
> > > -	/*
> > > -	 * Copy the thread sibling mask into the cache sibling mask
> > > -	 * and mark any CPUs that share an L2 with this CPU.
> > > -	 */
> > > -	for_each_cpu(i, cpu_sibling_mask(cpu))
> > > -		set_cpus_related(cpu, i, cpu_l2_cache_mask);
> > >  	update_mask_by_l2(cpu, cpu_l2_cache_mask);
> > > 
> > > -	/*
> > > -	 * Copy the cache sibling mask into core sibling mask and mark
> > > -	 * any CPUs on the same chip as this CPU.
> > > -	 */
> > > -	for_each_cpu(i, cpu_l2_cache_mask(cpu))
> > > -		set_cpus_related(cpu, i, cpu_core_mask);
> > > +	if (pkg_id == -1) {
> > 
> > I suppose this "if" condition is an optimization, since if pkg_id != -1,
> > we anyway set these CPUs in the cpu_core_mask below.
> > 
> > However...
> 
> This is not just an optimization.
> The hunk removed would only work if cpu_l2_cache_mask is bigger than
> cpu_sibling_mask. (this was the previous assumption that we want to break)
> If the cpu_sibling_mask is bigger than cpu_l2_cache_mask and pkg_id is -1,
> then setting only cpu_l2_cache_mask in cpu_core_mask will result in a broken 
> topology.
> 
> > 
> > > +		struct cpumask *(*mask)(int) = cpu_sibling_mask;
> > > +
> > > +		/*
> > > +		 * Copy the sibling mask into core sibling mask and
> > > +		 * mark any CPUs on the same chip as this CPU.
> > > +		 */
> > > +		if (shared_caches)
> > > +			mask = cpu_l2_cache_mask;
> > > +
> > > +		for_each_cpu(i, mask(cpu))
> > > +			set_cpus_related(cpu, i, cpu_core_mask);
> > > 
> > > -	if (pkg_id == -1)
> > >  		return;
> > > +	}
> > 
> > 
> > ... since "cpu" is not yet set in the cpu_online_mask, do we not miss setting
> > "cpu" in the cpu_core_mask(cpu) in the for-loop below ?
> > 
> > 
> 
> As noted above, we are setting before. So we don't missing the cpu and hence
> have not different from before.


Fair enough.

> 
> > --
> > Thanks and Regards
> > gautham.
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v3 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Gautham R Shenoy @ 2020-07-24  7:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200723085116.4731-6-srikar@linux.vnet.ibm.com>


On Thu, Jul 23, 2020 at 02:21:11PM +0530, Srikar Dronamraju wrote:
> Current code assumes that cpumask of cpus sharing a l2-cache mask will
> always be a superset of cpu_sibling_mask.
> 
> Lets stop that assumption. cpu_l2_cache_mask is a superset of
> cpu_sibling_mask if and only if shared_caches is set.
> 
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

> ---
> Changelog v1 -> v2:
> 	Set cpumask after verifying l2-cache. (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index da27f6909be1..d997c7411664 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1194,6 +1194,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
>  	if (!l2_cache)
>  		return false;
> 
> +	cpumask_set_cpu(cpu, mask_fn(cpu));
>  	for_each_cpu(i, cpu_online_mask) {
>  		/*
>  		 * when updating the marks the current CPU has not been marked
> @@ -1276,29 +1277,30 @@ static void add_cpu_to_masks(int cpu)
>  	 * add it to it's own thread sibling mask.
>  	 */
>  	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> +	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> 
>  	for (i = first_thread; i < first_thread + threads_per_core; i++)
>  		if (cpu_online(i))
>  			set_cpus_related(i, cpu, cpu_sibling_mask);
> 
>  	add_cpu_to_smallcore_masks(cpu);
> -	/*
> -	 * Copy the thread sibling mask into the cache sibling mask
> -	 * and mark any CPUs that share an L2 with this CPU.
> -	 */
> -	for_each_cpu(i, cpu_sibling_mask(cpu))
> -		set_cpus_related(cpu, i, cpu_l2_cache_mask);
>  	update_mask_by_l2(cpu, cpu_l2_cache_mask);
> 
> -	/*
> -	 * Copy the cache sibling mask into core sibling mask and mark
> -	 * any CPUs on the same chip as this CPU.
> -	 */
> -	for_each_cpu(i, cpu_l2_cache_mask(cpu))
> -		set_cpus_related(cpu, i, cpu_core_mask);
> +	if (pkg_id == -1) {
> +		struct cpumask *(*mask)(int) = cpu_sibling_mask;
> +
> +		/*
> +		 * Copy the sibling mask into core sibling mask and
> +		 * mark any CPUs on the same chip as this CPU.
> +		 */
> +		if (shared_caches)
> +			mask = cpu_l2_cache_mask;
> +
> +		for_each_cpu(i, mask(cpu))
> +			set_cpus_related(cpu, i, cpu_core_mask);
> 
> -	if (pkg_id == -1)
>  		return;
> +	}
> 
>  	for_each_cpu(i, cpu_online_mask)
>  		if (get_physical_package_id(i) == pkg_id)
> -- 
> 2.18.2
> 

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Arnd Bergmann @ 2020-07-24  7:20 UTC (permalink / raw)
  To: Atish Patra
  Cc: Albert Ou, Alexandre Ghiti, Linux-MM, Anup Patel,
	linux-kernel@vger.kernel.org, Atish Patra, Palmer Dabbelt,
	Zong Li, Paul Walmsley, Paul Mackerras, linux-riscv, linuxppc-dev
In-Reply-To: <CAOnJCULmX+vUcpEmBd5w7xjtZSFk=Ju2V=wBJCOXHQ8m9yG9-Q@mail.gmail.com>

On Wed, Jul 22, 2020 at 11:06 PM Atish Patra <atishp@atishpatra.org> wrote:
>
> On Wed, Jul 22, 2020 at 1:23 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > I just noticed that rv32 allows 2GB of lowmem rather than just the usual
> > 768MB or 1GB, at the expense of addressable user memory. This seems
> > like an unusual choice, but I also don't see any reason to change this
> > or make it more flexible unless actual users appear.
> >
>
> I am a bit confused here. As per my understanding, RV32 supports 1GB
> of lowmem only
> as the page offset is set to 0xC0000000. The config option
> MAXPHYSMEM_2GB is misleading
> as RV32 actually allows 1GB of physical memory only.

Ok, in that case I was apparently misled by the Kconfig option name.

I just tried building a kernel to see what the boundaries actually are,
as this is not the only confusing bit. Here is what I see:

0x9dc00000 TASK_SIZE/FIXADDR_START   /* code comment says 0x9fc00000 */
0x9e000000 FIXADDR_TOP/PCI_IO_START
0x9f000000 PCI_IO_END/VMEMMAP_START
0xa0000000 VMEMMAP_END/VMALLOC_START
0xc0000000 VMALLOC_END/PAGE_OFFSET

Having exactly 1GB of linear map does make a lot of sense. Having PCI I/O,
vmemmap and fixmap come out of the user range means you get slightly
different behavior in user space if there are any changes to that set, but
that is probably fine as well, if you want the flexibility to go to a 2GB linear
map and expect user space to deal with that as well.

There is one common trick from arm32 however that you might want to
consider: if vmalloc was moved above the linear map rather than below,
the size of the vmalloc area can dynamically depend on the amount of
RAM that is actually present rather than be set to a fixed value.

On arm32, there is around 240MB of vmalloc space if the linear map
is fully populated with RAM, but it can grow to use all of the avaialable
address space if less RAM was detected at boot time (up to 3GB
depending on CONFIG_VMSPLIT).

> Any memory blocks beyond
> DRAM + 1GB are removed in setup_bootmem. IMHO, The current config
> should clarify that.
>
> Moreover, we should add 2G split under a separate configuration if we
> want to support that.

Right. It's probably not needed immediately, but can't hurt either.

    Arnd

^ permalink raw reply

* Re: [PATCH v5 7/7] KVM: PPC: Book3S HV: rework secure mem slot dropping
From: Laurent Dufour @ 2020-07-24  7:43 UTC (permalink / raw)
  To: bharata, Ram Pai
  Cc: cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <20200724030337.GC1082478@in.ibm.com>

Le 24/07/2020 à 05:03, Bharata B Rao a écrit :
> On Thu, Jul 23, 2020 at 01:07:24PM -0700, Ram Pai wrote:
>> From: Laurent Dufour <ldufour@linux.ibm.com>
>>
>> When a secure memslot is dropped, all the pages backed in the secure
>> device (aka really backed by secure memory by the Ultravisor)
>> should be paged out to a normal page. Previously, this was
>> achieved by triggering the page fault mechanism which is calling
>> kvmppc_svm_page_out() on each pages.
>>
>> This can't work when hot unplugging a memory slot because the memory
>> slot is flagged as invalid and gfn_to_pfn() is then not trying to access
>> the page, so the page fault mechanism is not triggered.
>>
>> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
>> simpler to call directly instead of triggering such a mechanism. This
>> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
>> memslot.
>>
>> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
>> the call to __kvmppc_svm_page_out() is made.  As
>> __kvmppc_svm_page_out needs the vma pointer to migrate the pages,
>> the VMA is fetched in a lazy way, to not trigger find_vma() all
>> the time. In addition, the mmap_sem is held in read mode during
>> that time, not in write mode since the virual memory layout is not
>> impacted, and kvm->arch.uvmem_lock prevents concurrent operation
>> on the secure device.
>>
>> Cc: Ram Pai <linuxram@us.ibm.com>
>> Cc: Bharata B Rao <bharata@linux.ibm.com>
>> Cc: Paul Mackerras <paulus@ozlabs.org>
>> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
>> 	[modified the changelog description]
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv_uvmem.c | 54 ++++++++++++++++++++++++++------------
>>   1 file changed, 37 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> index c772e92..daffa6e 100644
>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>> @@ -632,35 +632,55 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>    * fault on them, do fault time migration to replace the device PTEs in
>>    * QEMU page table with normal PTEs from newly allocated pages.
>>    */
>> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>>   			     struct kvm *kvm, bool skip_page_out)
>>   {
>>   	int i;
>>   	struct kvmppc_uvmem_page_pvt *pvt;
>> -	unsigned long pfn, uvmem_pfn;
>> -	unsigned long gfn = free->base_gfn;
>> +	struct page *uvmem_page;
>> +	struct vm_area_struct *vma = NULL;
>> +	unsigned long uvmem_pfn, gfn;
>> +	unsigned long addr, end;
>> +
>> +	mmap_read_lock(kvm->mm);
>> +
>> +	addr = slot->userspace_addr;
>> +	end = addr + (slot->npages * PAGE_SIZE);
>>   
>> -	for (i = free->npages; i; --i, ++gfn) {
>> -		struct page *uvmem_page;
>> +	gfn = slot->base_gfn;
>> +	for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
>> +
>> +		/* Fetch the VMA if addr is not in the latest fetched one */
>> +		if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
>> +			vma = find_vma_intersection(kvm->mm, addr, end);
>> +			if (!vma ||
>> +			    vma->vm_start > addr || vma->vm_end < end) {
>> +				pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
>> +				break;
>> +			}
> 
> There is a potential issue with the boundary condition check here
> which I discussed with Laurent yesterday. Guess he hasn't gotten around
> to look at it yet.

Right, I'm working on that..



^ permalink raw reply

* [PATCH] powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE
From: Aneesh Kumar K.V @ 2020-07-24  7:56 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao

This adds a kernel command line option that can be used to disable GTSE support.
Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.

This was done so that we can do VM migration between configs that enable/disable
GTSE support via hypervisor. To migrate a VM from a system that supports
GTSE to a system that doesn't, we can boot the guest with radix_gtse=off, thereby
forcing the guest to use hcalls for TLB invalidates.

The check for hcall availability is done in pSeries_setup_arch so that
the panic message appears on the console. This should only happen on
a hypervisor that doesn't force the guest to hash translation even
though it can't handle the radix GTSE=0 request via CAS. With radix_gtse=off
if the hypervisor doesn't support hcall_rpt_invalidate hcall it should
force the LPAR to hash translation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 arch/powerpc/include/asm/firmware.h             |  4 +++-
 arch/powerpc/kernel/prom_init.c                 | 13 +++++++++----
 arch/powerpc/platforms/pseries/firmware.c       |  1 +
 arch/powerpc/platforms/pseries/setup.c          |  5 +++++
 5 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..df20c98a8920 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -896,6 +896,9 @@
 	disable_radix	[PPC]
 			Disable RADIX MMU mode on POWER9
 
+	radix_gtse=off	[PPC/PSERIES]
+			Disable RADIX GTSE feature.
+
 	disable_tlbie	[PPC]
 			Disable TLBIE instruction. Currently does not work
 			with KVM, with HASH MMU, or with coherent accelerators.
diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index 6003c2e533a0..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_PAPR_SCM 	ASM_CONST(0x0000002000000000)
 #define FW_FEATURE_ULTRAVISOR	ASM_CONST(0x0000004000000000)
 #define FW_FEATURE_STUFF_TCE	ASM_CONST(0x0000008000000000)
+#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0000010000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -71,7 +72,8 @@ enum {
 		FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
 		FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
 		FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
-		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR,
+		FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
+		FW_FEATURE_RPT_INVALIDATE,
 	FW_FEATURE_PSERIES_ALWAYS = 0,
 	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
 	FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index cbc605cfdec0..e7e91965fe6c 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -169,6 +169,7 @@ static unsigned long __prombss prom_tce_alloc_end;
 
 #ifdef CONFIG_PPC_PSERIES
 static bool __prombss prom_radix_disable;
+static bool __prombss prom_radix_gtse_disable;
 static bool __prombss prom_xive_disable;
 #endif
 
@@ -823,6 +824,12 @@ static void __init early_cmdline_parse(void)
 	if (prom_radix_disable)
 		prom_debug("Radix disabled from cmdline\n");
 
+	opt = prom_strstr(prom_cmd_line, "radix_gtse=off");
+	if (opt) {
+		prom_radix_gtse_disable = true;
+		prom_debug("Radix GTSE disabled from cmdline\n");
+	}
+
 	opt = prom_strstr(prom_cmd_line, "xive=off");
 	if (opt) {
 		prom_xive_disable = true;
@@ -1285,10 +1292,8 @@ static void __init prom_parse_platform_support(u8 index, u8 val,
 		prom_parse_mmu_model(val & OV5_FEAT(OV5_MMU_SUPPORT), support);
 		break;
 	case OV5_INDX(OV5_RADIX_GTSE): /* Radix Extensions */
-		if (val & OV5_FEAT(OV5_RADIX_GTSE)) {
-			prom_debug("Radix - GTSE supported\n");
-			support->radix_gtse = true;
-		}
+		if (val & OV5_FEAT(OV5_RADIX_GTSE))
+			support->radix_gtse = !prom_radix_gtse_disable;
 		break;
 	case OV5_INDX(OV5_XIVE_SUPPORT): /* Interrupt mode */
 		prom_parse_xive_model(val & OV5_FEAT(OV5_XIVE_SUPPORT),
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 3e49cc23a97a..4c7b7f5a2ebc 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -65,6 +65,7 @@ hypertas_fw_features_table[] = {
 	{FW_FEATURE_HPT_RESIZE,		"hcall-hpt-resize"},
 	{FW_FEATURE_BLOCK_REMOVE,	"hcall-block-remove"},
 	{FW_FEATURE_PAPR_SCM,		"hcall-scm"},
+	{FW_FEATURE_RPT_INVALIDATE,	"hcall-rpt-invalidate"},
 };
 
 /* Build up the firmware features bitmask using the contents of
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 2db8469e475f..9e5e098665f7 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -745,6 +745,11 @@ static void __init pSeries_setup_arch(void)
 	smp_init_pseries();
 
 
+	if (radix_enabled() && !mmu_has_feature(MMU_FTR_GTSE))
+		if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+			panic("BUG: Radix support requires either GTSE or RPT_INVALIDATE\n");
+
+
 	/* openpic global configuration register (64-bit format). */
 	/* openpic Interrupt Source Unit pointer (64-bit format). */
 	/* python0 facility area (mmio) (64-bit format) REAL address. */
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Arnd Bergmann @ 2020-07-24  8:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Albert Ou, Alex Ghiti, Atish Patra, Anup Patel,
	linux-kernel@vger.kernel.org, Linux-MM, Palmer Dabbelt, Zong Li,
	Paul Walmsley, Paul Mackerras, linux-riscv, linuxppc-dev
In-Reply-To: <418d5f3d3f42bbc79c5cf30e18ec89edfe2dbd26.camel@kernel.crashing.org>

On Fri, Jul 24, 2020 at 12:34 AM Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Thu, 2020-07-23 at 01:21 -0400, Alex Ghiti wrote:
> > > works fine with huge pages, what is your problem there ? You rely on
> > > punching small-page size holes in there ?
> > >
> >
> > ARCH_HAS_STRICT_KERNEL_RWX prevents the use of a hugepage for the kernel
> > mapping in the direct mapping as it sets different permissions to
> > different part of the kernel (data, text..etc).
>
> Ah ok, that can be solved in a couple of ways...
>
> One is to use the linker script to ensure those sections are linked
> HUGE_PAGE_SIZE appart and moved appropriately by early boot code. One
> is to selectively degrade just those huge pages.
>
> I'm not familiar with the RiscV MMU (I should probably go have a look)
> but if it's a classic radix tree with huge pages at PUD/PMD level, then
> you could just degrade the one(s) that cross those boundaries.

That would work, but if the system can otherwise use 1GB-sized pages,
that might mean degrading the first gigabyte into a mix of 2MB pages
and 4KB pages.

If the kernel is in vmalloc space and vmap is able to use 2MB pages
for contiguous chunks of the mapping, you get a somewhat better
TLB usage.

However, this also means that a writable mapping exists in the
linear mapping for any executable part of the kernel (.text in
both vmlinux and modules). Do we have that on other architectures
as well, or is this something that ought to be prevented with
STRICT_KERNEL_RWX/STRICT_MODULE_RWX?

     Arnd

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox