LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
From: Suraj Jitindar Singh @ 2019-06-20  1:46 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: clg, kvm-ppc, sjitindarsingh

When a guest vcpu moves from one physical thread to another it is
necessary for the host to perform a tlb flush on the previous core if
another vcpu from the same guest is going to run there. This is because the
guest may use the local form of the tlb invalidation instruction meaning
stale tlb entries would persist where it previously ran. This is handled
on guest entry in kvmppc_check_need_tlb_flush() which calls
flush_guest_tlb() to perform the tlb flush.

Previously the generic radix__local_flush_tlb_lpid_guest() function was
used, however the functionality was reimplemented in flush_guest_tlb()
to avoid the trace_tlbie() call as the flushing may be done in real
mode. The reimplementation in flush_guest_tlb() was missing an erat
invalidation after flushing the tlb.

This lead to observable memory corruption in the guest due to the
caching of stale translations. Fix this by adding the erat invalidation.

Fixes: 70ea13f6e609 "KVM: PPC: Book3S HV: Flush TLB on secondary radix threads"

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
---
 arch/powerpc/kvm/book3s_hv_builtin.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
index 6035d24f1d1d..a46286f73eec 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -833,6 +833,7 @@ static void flush_guest_tlb(struct kvm *kvm)
 		}
 	}
 	asm volatile("ptesync": : :"memory");
+	asm volatile(PPC_INVALIDATE_ERAT : : :"memory");
 }

 void kvmppc_check_need_tlb_flush(struct kvm *kvm, int pcpu,
-- 
2.13.6

^ permalink raw reply related

* Re: [PATCH 3/4] powerpc/powernv: remove dead NPU DMA code
From: Alexey Kardashevskiy @ 2019-06-20  1:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <20190619072837.GA6858@lst.de>



On 19/06/2019 17:28, Christoph Hellwig wrote:
> On Wed, Jun 19, 2019 at 10:34:54AM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 23/05/2019 17:49, Christoph Hellwig wrote:
>>> None of these routines were ever used since they were added to the
>>> kernel.
>>
>>
>> It is still being used exactly in the way as it was explained before in
>> previous respins. Thanks.
> 
> Please point to the in-kernel user, because that is the only relevant
> one.  This is not just my opinion but we had a clear discussion on that
> at least years kernel summit.


There is no in-kernel user which still does not mean that the code is
dead. If it is irrelevant - put this to the commit log instead of saying
it is dead; also if there was a clear outcome from that discussion, then
please point me to that, I do not get to attend these discussions. Thanks,


-- 
Alexey

^ permalink raw reply

* Re: [PATCH 0/2] Fix handling of h_set_dawr
From: Suraj Jitindar Singh @ 2019-06-20  1:45 UTC (permalink / raw)
  To: Cédric Le Goater, linuxppc-dev; +Cc: mikey, kvm-ppc
In-Reply-To: <87e219c8-1db7-9976-03ce-5a566a8df7ab@kaod.org>

On Mon, 2019-06-17 at 11:06 +0200, Cédric Le Goater wrote:
> On 17/06/2019 09:16, Suraj Jitindar Singh wrote:
> > Series contains 2 patches to fix the host in kernel handling of the
> > hcall
> > h_set_dawr.
> > 
> > First patch from Michael Neuling is just a resend added here for
> > clarity.
> > 
> > Michael Neuling (1):
> >   KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
> > 
> > Suraj Jitindar Singh (1):
> >   KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr
> > in
> >     real mode
> 
> 
> 
> Reviewed-by: Cédric Le Goater <clg@kaod.org>
> 
> and 
> 
> Tested-by: Cédric Le Goater <clg@kaod.org>
> 
> 
> but I see slowdowns in nested as if the IPIs were not delivered. Have
> we
> touch this part in 5.2 ? 

Hi,

I've seen the same and tracked it down to decrementer exceptions not
being delivered when the guest is using large decrementer. I've got a
patch I'm about to send so I'll CC you.

Another option is to disable the large decrementer with:
-machine pseries,cap-large-decr=false

Thanks,
Suraj

> 
> Thanks,
> 
> C.
> 

^ permalink raw reply

* Re: [PATCH v5 2/2] powerpc: Fix compile issue with force DAWR
From: Christophe Leroy @ 2019-06-19 18:03 UTC (permalink / raw)
  To: Michael Neuling, mpe; +Cc: Mathieu Malaterre, hch, linuxppc-dev
In-Reply-To: <3426e38c9028694f2ea55f6adaf3b679a1bce19f.camel@neuling.org>



Le 19/06/2019 à 03:11, Michael Neuling a écrit :
> On Tue, 2019-06-18 at 18:28 +0200, Christophe Leroy wrote:
>>
>> Le 04/06/2019 à 05:00, Michael Neuling a écrit :
>>> If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail
>>> at linking with:
>>>     arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined
>>> reference to `dawr_force_enable'
>>>
>>> This was caused by commit c1fe190c0672 ("powerpc: Add force enable of
>>> DAWR on P9 option").
>>>
>>> This moves a bunch of code around to fix this. It moves a lot of the
>>> DAWR code in a new file and creates a new CONFIG_PPC_DAWR to enable
>>> compiling it.
>>
>> After looking at all this once more, I'm just wondering: why are we
>> creating stuff specific to DAWR ?
>>
>> In the old days, we only add DABR, and everything was named on DABR.
>> When DAWR was introduced some years ago we renamed stuff like do_dabr()
>> to do_break() so that we could regroup things together. And now we are
>> taking dawr() out of the rest. Why not keep dabr() stuff and dawr()
>> stuff all together in something dedicated to breakpoints, and try to
>> regroup all breakpoint stuff in a single place ? I see some
>> breakpointing stuff done in kernel/process.c and other things done in
>> hw_breakpoint.c, to common functions call from one file to the other,
>> preventing GCC to fully optimise, etc ...
>>
>> Also, behing this thinking, I have the idea that we could easily
>> implement 512 bytes breakpoints on the 8xx too. The 8xx have neither
>> DABR nor DAWR, but is using a set of comparators. And as you can see in
>> the 8xx version of __set_dabr() in kernel/process.c, we emulate the DABR
>> behaviour by setting two comparators. By using the same comparators with
>> a different setup, we should be able to implement breakpoints on larger
>> ranges of address.
> 
> Christophe
> 
> I agree that their are opportunities to refactor this code and I appreciate your
> efforts in making this code better but...
> 
> We have a problem here of not being able to compile an odd ball case that almost
> no one ever hits (it was just an odd mpe CI case). We're up to v5 of a simple
> fix which is just silly.
> 
> So let's get this fix in and move on to the whole bunch of refactoring we can do
> in this code which is already documented in the github issue tracking.
> 

Agreed.

I've filed the following issue to keep that in mind: 
https://github.com/linuxppc/issues/issues/251

Thanks
Christophe

^ permalink raw reply

* Re: [PATCH 4/7] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel
From: Naveen N. Rao @ 2019-06-19 17:14 UTC (permalink / raw)
  To: Masami Hiramatsu, Ingo Molnar, Michael Ellerman, Nicholas Piggin,
	Steven Rostedt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1560939496.ovo51ph4i4.astroid@bobo.none>

Nicholas Piggin wrote:
> Naveen N. Rao's on June 19, 2019 7:53 pm:
>> Nicholas Piggin wrote:
>>> Michael Ellerman's on June 19, 2019 3:14 pm:
>>>> 
>>>> I'm also not convinced the ordering between the two patches is
>>>> guaranteed by the ISA, given that there's possibly no isync on the other
>>>> CPU.
>>> 
>>> Will they go through a context synchronizing event?
>>> 
>>> synchronize_rcu_tasks() should ensure a thread is scheduled away, but
>>> I'm not actually sure it guarantees CSI if it's kernel->kernel. Could
>>> do a smp_call_function to do the isync on each CPU to be sure.
>> 
>> Good point. Per 
>> Documentation/RCU/Design/Requirements/Requirements.html#Tasks RCU:
>> "The solution, in the form of Tasks RCU, is to have implicit read-side 
>> critical sections that are delimited by voluntary context switches, that 
>> is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In 
>> addition, transitions to and from userspace execution also delimit 
>> tasks-RCU read-side critical sections."
>> 
>> I suppose transitions to/from userspace, as well as calls to schedule() 
>> result in context synchronizing instruction being executed. But, if some 
>> tasks call cond_resched() and synchronize_rcu_tasks(), we probably won't 
>> have a CSI executed.
>> 
>> Also:
>> "In CONFIG_PREEMPT=n kernels, trampolines cannot be preempted, so these 
>> APIs map to call_rcu(), synchronize_rcu(), and rcu_barrier(), 
>> respectively."
>> 
>> In this scenario as well, I think we won't have a CSI executed in case 
>> of cond_resched().
>> 
>> Should we enhance patch_instruction() to handle that?
> 
> Well, not sure. Do we have many post-boot callers of it? Should
> they take care of their own synchronization requirements?

Kprobes and ftrace are the two users (along with anything else that may 
use jump labels).

Looking at this from the CMODX perspective: the main example quoted of 
an erratic behavior is when any variant of the patched instruction 
causes an exception.

With ftrace, I think we are ok since we only ever patch a 'nop' or a 
'bl' (and the 'mflr' now), none of which should cause an exception. As 
such, the existing patch_instruction() should suffice.

However, with kprobes, we patch a 'trap' (or a branch in case of 
optprobes) on most instructions. I wonder if we should be issuing an 
'isync' on all cpus in this case. Or, even if that is sufficient or 
necessary.


Thanks,
Naveen



^ permalink raw reply

* Re: [PATCH 2/3] powerpc/64s/radix: ioremap use ioremap_page_range
From: Christophe Leroy @ 2019-06-19 16:25 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <1560915874.eudrz3r20a.astroid@bobo.none>

Le 19/06/2019 à 05:59, Nicholas Piggin a écrit :
> Christophe Leroy's on June 11, 2019 4:46 pm:
>>
>>
>> Le 10/06/2019 à 05:08, Nicholas Piggin a écrit :
> I would like to remove the early ioremap or make it into its own
> function. Re-implement map_kernel_page with ioremap_page_range,
> allow page tables that don't use slab to avoid the early check,
> unbolt the hptes mapped in early boot, etc.

Getting early ioremap out of the picture is a very good idea, it will 
help making things more common between all platform types. Today we face 
the fact that PPC32 allocates early io from the top of memory while 
PPC64 allocates it from the bottom of memory.

Any idea on how to proceed ?

Christophe

^ permalink raw reply

* [PATCH v3] KVM: PPC: Report single stepping capability
From: Fabiano Rosas @ 2019-06-19 16:01 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm, rkrcmar, aik, pbonzini, linuxppc-dev, david

When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request
the next instruction to be single stepped via the
KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure.

This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order
to inform userspace about the state of single stepping support.

We currently don't have support for guest single stepping implemented
in Book3S HV so the capability is only present for Book3S PR and
BookE.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---

v1 -> v2:
 - add capability description to Documentation/virtual/kvm/api.txt

v2 -> v3:
 - be explicit in the commit message about when the capability is
   present
 - remove unnecessary check for CONFIG_BOOKE

 Documentation/virtual/kvm/api.txt | 3 +++
 arch/powerpc/kvm/powerpc.c        | 2 ++
 include/uapi/linux/kvm.h          | 1 +
 3 files changed, 6 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ba6c42c576dd..a77643bfa917 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2969,6 +2969,9 @@ can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and
 KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number
 indicating the number of supported registers.

+For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether
+the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported.
+
 When debug events exit the main run loop with the reason
 KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
 structure containing architecture specific debug information.
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6d704ad2472b..bd0a73eaf7ba 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -527,6 +527,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_IMMEDIATE_EXIT:
 		r = 1;
 		break;
+	case KVM_CAP_PPC_GUEST_DEBUG_SSTEP:
+		/* fall through */
 	case KVM_CAP_PPC_PAIRED_SINGLES:
 	case KVM_CAP_PPC_OSI:
 	case KVM_CAP_PPC_GET_PVINFO:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2fe12b40d503..cad9fcd90f39 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -993,6 +993,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_SVE 170
 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171
 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172
+#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 173

 #ifdef KVM_CAP_IRQ_ROUTING

--
2.20.1


^ permalink raw reply related

* Re: [RFC PATCH v0] powerpc: Fix BUG_ON during memory unplug on radix
From: Bharata B Rao @ 2019-06-19 14:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, aneesh.kumar, npiggin
In-Reply-To: <87tvcm0wr5.fsf@linux.ibm.com>

On Wed, Jun 19, 2019 at 02:36:54PM +0530, Aneesh Kumar K.V wrote:
> Bharata B Rao <bharata@linux.ibm.com> writes:
> 
> > We hit the following BUG_ON when memory hotplugged before reboot
> > is unplugged after reboot:
> >
> > kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
> >
> >  remove_pagetable+0x594/0x6a0
> >  (unreliable)
> >  remove_pagetable+0x94/0x6a0
> >  vmemmap_free+0x394/0x410
> >  sparse_remove_one_section+0x26c/0x2e8
> >  __remove_pages+0x428/0x540
> >  arch_remove_memory+0xd0/0x170
> >  __remove_memory+0xd4/0x1a0
> >  dlpar_remove_lmb+0xbc/0x110
> >  dlpar_memory+0xa80/0xd20
> >  handle_dlpar_errorlog+0xa8/0x160
> >  pseries_hp_work_fn+0x2c/0x60
> >  process_one_work+0x46c/0x860
> >  worker_thread+0x364/0x5e0
> >  kthread+0x1b0/0x1c0
> >  ret_from_kernel_thread+0x5c/0x68
> >
> > This occurs because, during reboot-after-hotplug, the hotplugged
> > memory range gets initialized as regular memory and page
> > tables are setup using memblock allocator. This means that we
> > wouldn't have initialized the PMD or PTE fragment count for
> > those PMD or PTE pages.
> >
> > Fixing this includes 3 aspects:
> >
> > - Walk the init_mm page tables from mem_init() and initialize
> >   the PMD and PTE fragment counts appropriately.
> > - When we do early allocation of PMD (and PGD as well) pages,
> >   allocate in page size PAGE_SIZE granularity so that we are
> >   sure that the complete page is available for us to set the
> >   fragment count which is part of struct page.
> 
> 
> That is an important change now. For early page table we now allocate
> PAGE_SIZE tables and hencec we consider then as pages with fragment
> count 1. You also may want to explain here why.

Sure will make this clear in my next version.

> I guess the challenge is
> due to the fact that we can't clearly control how the rest of the page
> will get used and we are not sure they all will be allocated for backing
> page table pages.
> 
> > - When PMD or PTE page is freed, check if it comes from memblock
> >   allocator and free it appropriately.
> >
> > Reported-by: Srikanth Aithal <sraithal@linux.vnet.ibm.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > ---
> >  arch/powerpc/include/asm/book3s/64/radix.h |  1 +
> >  arch/powerpc/include/asm/sparsemem.h       |  1 +
> >  arch/powerpc/mm/book3s64/pgtable.c         | 12 +++-
> >  arch/powerpc/mm/book3s64/radix_pgtable.c   | 67 +++++++++++++++++++++-
> >  arch/powerpc/mm/mem.c                      |  5 ++
> >  arch/powerpc/mm/pgtable-frag.c             |  5 +-
> >  6 files changed, 87 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> > index 574eca33f893..4320f2790e8d 100644
> > --- a/arch/powerpc/include/asm/book3s/64/radix.h
> > +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> > @@ -285,6 +285,7 @@ static inline unsigned long radix__get_tree_size(void)
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> >  int radix__create_section_mapping(unsigned long start, unsigned long end, int nid);
> >  int radix__remove_section_mapping(unsigned long start, unsigned long end);
> > +void radix__fixup_pgtable_fragments(void);
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  #endif /* __ASSEMBLY__ */
> >  #endif
> > diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
> > index 3192d454a733..e662f9232d35 100644
> > --- a/arch/powerpc/include/asm/sparsemem.h
> > +++ b/arch/powerpc/include/asm/sparsemem.h
> > @@ -15,6 +15,7 @@
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> >  extern int create_section_mapping(unsigned long start, unsigned long end, int nid);
> >  extern int remove_section_mapping(unsigned long start, unsigned long end);
> > +void fixup_pgtable_fragments(void);
> >
> >  #ifdef CONFIG_PPC_BOOK3S_64
> >  extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
> > diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> > index 01bc9663360d..7efe9cc16b39 100644
> > --- a/arch/powerpc/mm/book3s64/pgtable.c
> > +++ b/arch/powerpc/mm/book3s64/pgtable.c
> > @@ -186,6 +186,13 @@ int __meminit remove_section_mapping(unsigned long start, unsigned long end)
> >
> >  	return hash__remove_section_mapping(start, end);
> >  }
> > +
> > +void fixup_pgtable_fragments(void)
> > +{
> > +	if (radix_enabled())
> > +		radix__fixup_pgtable_fragments();
> > +}
> > +
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >
> >  void __init mmu_partition_table_init(void)
> > @@ -320,7 +327,10 @@ void pmd_fragment_free(unsigned long *pmd)
> >  	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> >  	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> >  		pgtable_pmd_page_dtor(page);
> > -		__free_page(page);
> > +		if (PageReserved(page))
> > +			free_reserved_page(page);
> > +		else
> > +			__free_page(page);
> >  	}
> >  }
> >
> > diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> > index 273ae66a9a45..402e8da28cab 100644
> > --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> > +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> > @@ -32,6 +32,69 @@
> >  unsigned int mmu_pid_bits;
> >  unsigned int mmu_base_pid;
> >
> > +static void fixup_pmd_fragments(pmd_t *pmd)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
> > +		pte_t *pte;
> > +		struct page *page;
> > +
> > +		if (pmd_none(*pmd))
> > +			continue;
> > +		if (pmd_huge(*pmd))
> > +			continue;
> > +
> > +		pte = pte_offset_kernel(pmd, 0);
> > +		if (!pte_none(*pte)) {
> > +			page = virt_to_page(pte);
> > +			atomic_inc(&page->pt_frag_refcount);
> > +		}
> > +	}
> > +}
> 
> 
> That naming is confusing. You are fixing up fragemts used for level 4
> table here. I would call them pte fragments? 

Sure, will rename.

> 
> 
> > +
> > +static void fixup_pud_fragments(pud_t *pud)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
> > +		pmd_t *pmd;
> > +		struct page *page;
> > +
> > +		if (pud_none(*pud))
> > +			continue;
> > +		if (pud_huge(*pud))
> > +			continue;
> > +
> > +		pmd = pmd_offset(pud, 0);
> > +		if (!pmd_none(*pmd)) {
> > +			page = virt_to_page(pmd);
> > +			atomic_inc(&page->pt_frag_refcount);
> > +		}
> > +		fixup_pmd_fragments(pmd);
> > +	}
> > +}
> > +
> 
> 
> How do we handle pud table free. That is going to be tricky for general
> allocation they are allocated out of slab and we free them back to slab.
> With pages allocated out of memblock, we need to special case them? 

Yes. With out addressing the freeing of pud table, we can't say that
memory unplug on radix is working fully. However with this fixing of
frag counts, it does work for a few cases.

May be we handle the pud case separately?

Regards,
Bharata.


^ permalink raw reply

* Re: [RFC PATCH v0] powerpc: Fix BUG_ON during memory unplug on radix
From: Bharata B Rao @ 2019-06-19 14:36 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev, aneesh.kumar
In-Reply-To: <1560939185.n3y8722qvc.astroid@bobo.none>

On Wed, Jun 19, 2019 at 08:17:01PM +1000, Nicholas Piggin wrote:
> Bharata B Rao's on June 19, 2019 5:45 pm:
> > We hit the following BUG_ON when memory hotplugged before reboot
> > is unplugged after reboot:
> > 
> > kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
> > 
> >  remove_pagetable+0x594/0x6a0
> >  (unreliable)
> >  remove_pagetable+0x94/0x6a0
> >  vmemmap_free+0x394/0x410
> >  sparse_remove_one_section+0x26c/0x2e8
> >  __remove_pages+0x428/0x540
> >  arch_remove_memory+0xd0/0x170
> >  __remove_memory+0xd4/0x1a0
> >  dlpar_remove_lmb+0xbc/0x110
> >  dlpar_memory+0xa80/0xd20
> >  handle_dlpar_errorlog+0xa8/0x160
> >  pseries_hp_work_fn+0x2c/0x60
> >  process_one_work+0x46c/0x860
> >  worker_thread+0x364/0x5e0
> >  kthread+0x1b0/0x1c0
> >  ret_from_kernel_thread+0x5c/0x68
> > 
> > This occurs because, during reboot-after-hotplug, the hotplugged
> > memory range gets initialized as regular memory and page
> > tables are setup using memblock allocator. This means that we
> > wouldn't have initialized the PMD or PTE fragment count for
> > those PMD or PTE pages.
> > 
> > Fixing this includes 3 aspects:
> > 
> > - Walk the init_mm page tables from mem_init() and initialize
> >   the PMD and PTE fragment counts appropriately.
> > - When we do early allocation of PMD (and PGD as well) pages,
> >   allocate in page size PAGE_SIZE granularity so that we are
> >   sure that the complete page is available for us to set the
> >   fragment count which is part of struct page.
> > - When PMD or PTE page is freed, check if it comes from memblock
> >   allocator and free it appropriately.
> > 
> > Reported-by: Srikanth Aithal <sraithal@linux.vnet.ibm.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > ---
> >  arch/powerpc/include/asm/book3s/64/radix.h |  1 +
> >  arch/powerpc/include/asm/sparsemem.h       |  1 +
> >  arch/powerpc/mm/book3s64/pgtable.c         | 12 +++-
> >  arch/powerpc/mm/book3s64/radix_pgtable.c   | 67 +++++++++++++++++++++-
> >  arch/powerpc/mm/mem.c                      |  5 ++
> >  arch/powerpc/mm/pgtable-frag.c             |  5 +-
> >  6 files changed, 87 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> > index 574eca33f893..4320f2790e8d 100644
> > --- a/arch/powerpc/include/asm/book3s/64/radix.h
> > +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> > @@ -285,6 +285,7 @@ static inline unsigned long radix__get_tree_size(void)
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> >  int radix__create_section_mapping(unsigned long start, unsigned long end, int nid);
> >  int radix__remove_section_mapping(unsigned long start, unsigned long end);
> > +void radix__fixup_pgtable_fragments(void);
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  #endif /* __ASSEMBLY__ */
> >  #endif
> > diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
> > index 3192d454a733..e662f9232d35 100644
> > --- a/arch/powerpc/include/asm/sparsemem.h
> > +++ b/arch/powerpc/include/asm/sparsemem.h
> > @@ -15,6 +15,7 @@
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> >  extern int create_section_mapping(unsigned long start, unsigned long end, int nid);
> >  extern int remove_section_mapping(unsigned long start, unsigned long end);
> > +void fixup_pgtable_fragments(void);
> >  
> >  #ifdef CONFIG_PPC_BOOK3S_64
> >  extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
> > diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> > index 01bc9663360d..7efe9cc16b39 100644
> > --- a/arch/powerpc/mm/book3s64/pgtable.c
> > +++ b/arch/powerpc/mm/book3s64/pgtable.c
> > @@ -186,6 +186,13 @@ int __meminit remove_section_mapping(unsigned long start, unsigned long end)
> >  
> >  	return hash__remove_section_mapping(start, end);
> >  }
> > +
> > +void fixup_pgtable_fragments(void)
> > +{
> > +	if (radix_enabled())
> > +		radix__fixup_pgtable_fragments();
> > +}
> > +
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  
> >  void __init mmu_partition_table_init(void)
> > @@ -320,7 +327,10 @@ void pmd_fragment_free(unsigned long *pmd)
> >  	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> >  	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> >  		pgtable_pmd_page_dtor(page);
> > -		__free_page(page);
> > +		if (PageReserved(page))
> > +			free_reserved_page(page);
> 
> Hmm. Rather than adding this special case here, I wonder if you can
> just go along in your fixup walk and convert all these pages to
> non-reserved pages?
> 
> ClearPageReserved ; init_page_count ; adjust_managed_page_count ; 
> should do the trick, right?

Yes, that should. We are anyway fixing the frag count during
the walk, might as well do all the above too and avoid the special
case in the free path.

Regards,
Bharata.


^ permalink raw reply

* Re: [PATCH] ocxl: Update for AFU descriptor template version 1.1
From: christophe lombard @ 2019-06-19 14:31 UTC (permalink / raw)
  To: Frederic Barrat, linuxppc-dev, alastair, ajd, clombard; +Cc: groug
In-Reply-To: <20190605111545.19762-1-fbarrat@linux.ibm.com>

On 05/06/2019 13:15, Frederic Barrat wrote:
> From: Alastair D'Silva <alastair@d-silva.org>
> 
> The OpenCAPI discovery and configuration specification has been
> updated and introduces version 1.1 of the AFU descriptor template,
> with new fields to better define the memory layout of an OpenCAPI
> adapter.
> 
> The ocxl driver doesn't do much yet to support LPC memory but as we
> start seeing (non-LPC) AFU images using the new template, this patch
> updates the config space parsing code to avoid spitting a warning.
> 
> Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
> 

The content of the patch sounds good. Thanks.

Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com>


^ permalink raw reply

* Re: [PATCH] powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
From: Mathieu Malaterre @ 2019-06-19 14:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aaro.koskinen, LKML, Paul Mackerras, linuxppc-dev,
	Christoph Hellwig, Larry Finger
In-Reply-To: <a5fc355e44fb5edea41274329f7c5d04a8dff6fc.camel@kernel.crashing.org>

On Wed, Jun 19, 2019 at 4:18 PM Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> On Wed, 2019-06-19 at 22:32 +1000, Michael Ellerman wrote:
> > Christoph Hellwig <hch@lst.de> writes:
> > > Any chance this could get picked up to fix the regression?
> >
> > Was hoping Ben would Ack it. He's still powermac maintainer :)
> >
> > I guess he OK'ed it in the other thread, will add it to my queue.
>
> Yeah ack. If I had written it myself, I would have made the DMA bits a
> variable and only set it down to 30 if I see that device in the DT
> early on, but I can't be bothered now, if it works, ship it :-)
>
> Note: The patch affects all ppc32, though I don't think it will cause
> any significant issue on those who don't need it.

Thanks, that answer my earlier question.

> Cheers,
> Ben.
>
> > cheers
> >
> > > On Thu, Jun 13, 2019 at 10:24:46AM +0200, Christoph Hellwig wrote:
> > > > With the strict dma mask checking introduced with the switch to
> > > > the generic DMA direct code common wifi chips on 32-bit
> > > > powerbooks
> > > > stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
> > > > to allow them to reliably allocate dma coherent memory.
> > > >
> > > > Fixes: 65a21b71f948 ("powerpc/dma: remove
> > > > dma_nommu_dma_supported")
> > > > Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > ---
> > > >  arch/powerpc/include/asm/page.h         | 7 +++++++
> > > >  arch/powerpc/mm/mem.c                   | 3 ++-
> > > >  arch/powerpc/platforms/powermac/Kconfig | 1 +
> > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/powerpc/include/asm/page.h
> > > > b/arch/powerpc/include/asm/page.h
> > > > index b8286a2013b4..0d52f57fca04 100644
> > > > --- a/arch/powerpc/include/asm/page.h
> > > > +++ b/arch/powerpc/include/asm/page.h
> > > > @@ -319,6 +319,13 @@ struct vm_area_struct;
> > > >  #endif /* __ASSEMBLY__ */
> > > >  #include <asm/slice.h>
> > > >
> > > > +/*
> > > > + * Allow 30-bit DMA for very limited Broadcom wifi chips on many
> > > > powerbooks.
> > > > + */
> > > > +#ifdef CONFIG_PPC32
> > > > +#define ARCH_ZONE_DMA_BITS 30
> > > > +#else
> > > >  #define ARCH_ZONE_DMA_BITS 31
> > > > +#endif
> > > >
> > > >  #endif /* _ASM_POWERPC_PAGE_H */
> > > > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> > > > index cba29131bccc..2540d3b2588c 100644
> > > > --- a/arch/powerpc/mm/mem.c
> > > > +++ b/arch/powerpc/mm/mem.c
> > > > @@ -248,7 +248,8 @@ void __init paging_init(void)
> > > >          (long int)((top_of_ram - total_ram) >> 20));
> > > >
> > > >  #ifdef CONFIG_ZONE_DMA
> > > > - max_zone_pfns[ZONE_DMA] = min(max_low_pfn, 0x7fffffffUL
> > > > >> PAGE_SHIFT);
> > > > + max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
> > > > +                 ((1UL << ARCH_ZONE_DMA_BITS) - 1) >>
> > > > PAGE_SHIFT);
> > > >  #endif
> > > >   max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
> > > >  #ifdef CONFIG_HIGHMEM
> > > > diff --git a/arch/powerpc/platforms/powermac/Kconfig
> > > > b/arch/powerpc/platforms/powermac/Kconfig
> > > > index f834a19ed772..c02d8c503b29 100644
> > > > --- a/arch/powerpc/platforms/powermac/Kconfig
> > > > +++ b/arch/powerpc/platforms/powermac/Kconfig
> > > > @@ -7,6 +7,7 @@ config PPC_PMAC
> > > >   select PPC_INDIRECT_PCI if PPC32
> > > >   select PPC_MPC106 if PPC32
> > > >   select PPC_NATIVE
> > > > + select ZONE_DMA if PPC32
> > > >   default y
> > > >
> > > >  config PPC_PMAC64
> > > > --
> > > > 2.20.1
> > >
> > > ---end quoted text---
>

^ permalink raw reply

* Re: [PATCH] powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
From: Benjamin Herrenschmidt @ 2019-06-19 13:31 UTC (permalink / raw)
  To: Michael Ellerman, Christoph Hellwig, paulus
  Cc: aaro.koskinen, linuxppc-dev, linux-kernel, Larry.Finger
In-Reply-To: <87tvcldacn.fsf@concordia.ellerman.id.au>

On Wed, 2019-06-19 at 22:32 +1000, Michael Ellerman wrote:
> Christoph Hellwig <hch@lst.de> writes:
> > Any chance this could get picked up to fix the regression?
> 
> Was hoping Ben would Ack it. He's still powermac maintainer :)
> 
> I guess he OK'ed it in the other thread, will add it to my queue.

Yeah ack. If I had written it myself, I would have made the DMA bits a
variable and only set it down to 30 if I see that device in the DT
early on, but I can't be bothered now, if it works, ship it :-)

Note: The patch affects all ppc32, though I don't think it will cause
any significant issue on those who don't need it.

Cheers,
Ben.

> cheers
> 
> > On Thu, Jun 13, 2019 at 10:24:46AM +0200, Christoph Hellwig wrote:
> > > With the strict dma mask checking introduced with the switch to
> > > the generic DMA direct code common wifi chips on 32-bit
> > > powerbooks
> > > stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
> > > to allow them to reliably allocate dma coherent memory.
> > > 
> > > Fixes: 65a21b71f948 ("powerpc/dma: remove
> > > dma_nommu_dma_supported")
> > > Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  arch/powerpc/include/asm/page.h         | 7 +++++++
> > >  arch/powerpc/mm/mem.c                   | 3 ++-
> > >  arch/powerpc/platforms/powermac/Kconfig | 1 +
> > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/page.h
> > > b/arch/powerpc/include/asm/page.h
> > > index b8286a2013b4..0d52f57fca04 100644
> > > --- a/arch/powerpc/include/asm/page.h
> > > +++ b/arch/powerpc/include/asm/page.h
> > > @@ -319,6 +319,13 @@ struct vm_area_struct;
> > >  #endif /* __ASSEMBLY__ */
> > >  #include <asm/slice.h>
> > >  
> > > +/*
> > > + * Allow 30-bit DMA for very limited Broadcom wifi chips on many
> > > powerbooks.
> > > + */
> > > +#ifdef CONFIG_PPC32
> > > +#define ARCH_ZONE_DMA_BITS 30
> > > +#else
> > >  #define ARCH_ZONE_DMA_BITS 31
> > > +#endif
> > >  
> > >  #endif /* _ASM_POWERPC_PAGE_H */
> > > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> > > index cba29131bccc..2540d3b2588c 100644
> > > --- a/arch/powerpc/mm/mem.c
> > > +++ b/arch/powerpc/mm/mem.c
> > > @@ -248,7 +248,8 @@ void __init paging_init(void)
> > >  	       (long int)((top_of_ram - total_ram) >> 20));
> > >  
> > >  #ifdef CONFIG_ZONE_DMA
> > > -	max_zone_pfns[ZONE_DMA]	= min(max_low_pfn, 0x7fffffffUL
> > > >> PAGE_SHIFT);
> > > +	max_zone_pfns[ZONE_DMA]	= min(max_low_pfn,
> > > +			((1UL << ARCH_ZONE_DMA_BITS) - 1) >>
> > > PAGE_SHIFT);
> > >  #endif
> > >  	max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
> > >  #ifdef CONFIG_HIGHMEM
> > > diff --git a/arch/powerpc/platforms/powermac/Kconfig
> > > b/arch/powerpc/platforms/powermac/Kconfig
> > > index f834a19ed772..c02d8c503b29 100644
> > > --- a/arch/powerpc/platforms/powermac/Kconfig
> > > +++ b/arch/powerpc/platforms/powermac/Kconfig
> > > @@ -7,6 +7,7 @@ config PPC_PMAC
> > >  	select PPC_INDIRECT_PCI if PPC32
> > >  	select PPC_MPC106 if PPC32
> > >  	select PPC_NATIVE
> > > +	select ZONE_DMA if PPC32
> > >  	default y
> > >  
> > >  config PPC_PMAC64
> > > -- 
> > > 2.20.1
> > 
> > ---end quoted text---


^ permalink raw reply

* [RFC 00/11] opencapi: enable card reset and link retraining
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair

This is the linux part of the work to use the PCI hotplug framework to
control an opencapi card so that it can be reset and re-read after
flashing a new FPGA image.

It needs support in skiboot:
http://patchwork.ozlabs.org/project/skiboot/list/?series=114803
On an old skiboot, it will do nothing.

A virtual PCI slot is created for the opencapi adapter, and its state
can be controlled through the pnv-php hotplug driver:

  echo 0|1 > /sys/bus/pci/slots/OPENCAPI-<...>/power

Note that the power to the card is not really turned off, as the card
needs to stay on to be flashed with a new image. Instead the card is
placed in reset.

The first part of the series mostly deals with the pci/ioda state, as
the devices can now go away and the state needs to be cleaned up.

The second part is modifications to the hotplug driver on powernv, so
that a virtual slot is created for the opencapi adapters found in the
device tree

Frederic Barrat (11):
  powerpc/powernv/ioda: Fix ref count for devices with their own PE
  powerpc/powernv/ioda: Protect PE list
  powerpc/powernv/ioda: set up PE on opencapi device when enabling
  powerpc/powernv/ioda: Release opencapi device
  powerpc/powernv/ioda: Find opencapi slot for a device node
  pci/hotplug/pnv-php: Remove erroneous warning
  pci/hotplug/pnv-php: Improve error msg on power state change failure
  pci/hotplug/pnv-php: Register opencapi slots
  pci/hotplug/pnv-php: Relax check when disabling slot
  pci/hotplug/pnv-php: Wrap warnings in macro
  ocxl: Add PCI hotplug dependency to Kconfig

 arch/powerpc/include/asm/pnv-pci.h        |   1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 106 ++++++++++++++--------
 arch/powerpc/platforms/powernv/pci.c      |  10 +-
 drivers/misc/ocxl/Kconfig                 |   1 +
 drivers/pci/hotplug/pnv_php.c             |  66 ++++++++------
 5 files changed, 115 insertions(+), 69 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [RFC 07/11] pci/hotplug/pnv-php: Improve error msg on power state change failure
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

When changing the slot state, if opal hits an error and tells as such
in the asynchronous reply, the warning "Wrong msg" is logged, which is
rather confusing. Instead we can reuse the better message which is
already used when we couldn't submit the asynchronous opal request
initially.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/pci/hotplug/pnv_php.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 5b5cbf1e636d..5cdd2a3a4dd9 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -336,18 +336,19 @@ int pnv_php_set_slot_power_state(struct hotplug_slot *slot,
 	ret = pnv_pci_set_power_state(php_slot->id, state, &msg);
 	if (ret > 0) {
 		if (be64_to_cpu(msg.params[1]) != php_slot->dn->phandle	||
-		    be64_to_cpu(msg.params[2]) != state			||
-		    be64_to_cpu(msg.params[3]) != OPAL_SUCCESS) {
+		    be64_to_cpu(msg.params[2]) != state) {
 			pci_warn(php_slot->pdev, "Wrong msg (%lld, %lld, %lld)\n",
 				 be64_to_cpu(msg.params[1]),
 				 be64_to_cpu(msg.params[2]),
 				 be64_to_cpu(msg.params[3]));
 			return -ENOMSG;
 		}
+		if (be64_to_cpu(msg.params[3]) != OPAL_SUCCESS) {
+			ret = -ENODEV;
+			goto error;
+		}
 	} else if (ret < 0) {
-		pci_warn(php_slot->pdev, "Error %d powering %s\n",
-			 ret, (state == OPAL_PCI_SLOT_POWER_ON) ? "on" : "off");
-		return ret;
+		goto error;
 	}
 
 	if (state == OPAL_PCI_SLOT_POWER_OFF || state == OPAL_PCI_SLOT_OFFLINE)
@@ -356,6 +357,11 @@ int pnv_php_set_slot_power_state(struct hotplug_slot *slot,
 		ret = pnv_php_add_devtree(php_slot);
 
 	return ret;
+
+error:
+	pci_warn(php_slot->pdev, "Error %d powering %s\n",
+		ret, (state == OPAL_PCI_SLOT_POWER_ON) ? "on" : "off");
+	return ret;
 }
 EXPORT_SYMBOL_GPL(pnv_php_set_slot_power_state);
 
-- 
2.21.0


^ permalink raw reply related

* [RFC 11/11] ocxl: Add PCI hotplug dependency to Kconfig
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

The PCI hotplug framework is used to update the devices when a new
image is written to the FPGA.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/misc/ocxl/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/misc/ocxl/Kconfig b/drivers/misc/ocxl/Kconfig
index 7fb6d39d4c5a..13a5d9f30369 100644
--- a/drivers/misc/ocxl/Kconfig
+++ b/drivers/misc/ocxl/Kconfig
@@ -12,6 +12,7 @@ config OCXL
 	tristate "OpenCAPI coherent accelerator support"
 	depends on PPC_POWERNV && PCI && EEH
 	select OCXL_BASE
+	select HOTPLUG_PCI_POWERNV
 	default m
 	help
 	  Select this option to enable the ocxl driver for Open
-- 
2.21.0


^ permalink raw reply related

* [RFC 06/11] pci/hotplug/pnv-php: Remove erroneous warning
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

On powernv, when removing a device through hotplug, the following
warning is logged:

     Invalid refcount <.> on <...>

It may be incorrect, the refcount may be set to a higher value than 1
and be valid. of_detach_node() can drop more than one reference. As it
doesn't seem trivial to assert the correct value, let's remove the
warning.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/pci/hotplug/pnv_php.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 6758fd7c382e..5b5cbf1e636d 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -151,17 +151,11 @@ static void pnv_php_rmv_pdns(struct device_node *dn)
 static void pnv_php_detach_device_nodes(struct device_node *parent)
 {
 	struct device_node *dn;
-	int refcount;
 
 	for_each_child_of_node(parent, dn) {
 		pnv_php_detach_device_nodes(dn);
 
 		of_node_put(dn);
-		refcount = kref_read(&dn->kobj.kref);
-		if (refcount != 1)
-			pr_warn("Invalid refcount %d on <%pOF>\n",
-				refcount, dn);
-
 		of_detach_node(dn);
 	}
 }
-- 
2.21.0


^ permalink raw reply related

* [RFC 04/11] powerpc/powernv/ioda: Release opencapi device
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

With hotplug, an opencapi device can now go away. It needs to be
released, mostly to clean up its PE state. We were previously not
defining any device callback. We can reuse the standard PCI release
callback, it does a bit too much for an opencapi device, but it's
harmless, and only needs minor tuning.

Also separate the undo of the PELT-V code in a separate function, it
is not needed for NPU devices and it improves a bit the readability of
the code.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 58 +++++++++++++++--------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2cf06fb98978..33054d00b2c5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -186,7 +186,7 @@ static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
 	unsigned int pe_num = pe->pe_number;
 
 	WARN_ON(pe->pdev);
-	WARN_ON(pe->npucomp); /* NPUs are not supposed to be freed */
+	WARN_ON(pe->npucomp); /* NPUs for nvlink are not supposed to be freed */
 	kfree(pe->npucomp);
 	memset(pe, 0, sizeof(struct pnv_ioda_pe));
 	clear_bit(pe_num, phb->ioda.pe_alloc);
@@ -775,6 +775,33 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 	return 0;
 }
 
+static void pnv_ioda_unset_peltv(struct pnv_phb *phb,
+				 struct pnv_ioda_pe *pe,
+				 struct pci_dev *parent)
+{
+	int64_t rc;
+
+	while (parent) {
+		struct pci_dn *pdn = pci_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+
+	opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
+				OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Disassociate PE in PELT */
+	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
+				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+	if (rc)
+		pe_warn(pe, "OPAL error %lld remove self from PELTV\n", rc);
+
+}
+
 static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -825,25 +852,13 @@ static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 	for (rid = pe->rid; rid < rid_end; rid++)
 		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
 
-	/* Release from all parents PELT-V */
-	while (parent) {
-		struct pci_dn *pdn = pci_get_pdn(parent);
-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
-			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
-						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
-			/* XXX What to do in case of error ? */
-		}
-		parent = parent->bus->self;
-	}
-
-	opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
-				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+	/*
+	 * Release from all parents PELT-V. NPUs don't have a PELTV
+	 * table
+	 */
+	if (phb->type != PNV_PHB_NPU_NVLINK && phb->type != PNV_PHB_NPU_OCAPI)
+		pnv_ioda_unset_peltv(phb, pe, parent);
 
-	/* Disassociate PE in PELT */
-	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
-				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
-	if (rc)
-		pe_warn(pe, "OPAL error %lld remove self from PELTV\n", rc);
 	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
 			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
 	if (rc)
@@ -3528,6 +3543,8 @@ static void pnv_ioda_release_pe(struct pnv_ioda_pe *pe)
 	case PNV_PHB_IODA2:
 		pnv_pci_ioda2_release_pe_dma(pe);
 		break;
+	case PNV_PHB_NPU_OCAPI:
+		break;
 	default:
 		WARN_ON(1);
 	}
@@ -3580,7 +3597,7 @@ static void pnv_pci_release_device(struct pci_dev *pdev)
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	pdn->pe_number = IODA_INVALID_PE;
 
-	WARN_ON(--pe->device_count < 0);
+	WARN_ON((pe->flags != PNV_IODA_PE_DEV) && (--pe->device_count < 0));
 	if (pe->device_count == 0)
 		pnv_ioda_release_pe(pe);
 }
@@ -3629,6 +3646,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
 
 static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
 	.enable_device_hook	= pnv_ocapi_enable_device_hook,
+	.release_device		= pnv_pci_release_device,
 	.window_alignment	= pnv_pci_window_alignment,
 	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
 	.shutdown		= pnv_pci_ioda_shutdown,
-- 
2.21.0


^ permalink raw reply related

* [RFC 08/11] pci/hotplug/pnv-php: Register opencapi slots
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

Add the opencapi PHBs to the list of PHBs being scanned to look for
slots.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/pci/hotplug/pnv_php.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 5cdd2a3a4dd9..f9c624334ef7 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -954,7 +954,8 @@ static int __init pnv_php_init(void)
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
 		pnv_php_register(dn);
-
+	for_each_compatible_node(dn, NULL, "ibm,ioda2-npu2-opencapi-phb")
+		pnv_php_register_one(dn);
 	return 0;
 }
 
@@ -964,6 +965,8 @@ static void __exit pnv_php_exit(void)
 
 	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
 		pnv_php_unregister(dn);
+	for_each_compatible_node(dn, NULL, "ibm,ioda2-npu2-opencapi-phb")
+		pnv_php_unregister(dn);
 }
 
 module_init(pnv_php_init);
-- 
2.21.0


^ permalink raw reply related

* [RFC 10/11] pci/hotplug/pnv-php: Wrap warnings in macro
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

An opencapi slot doesn't have an associated bridge device. It's not
needed for operation, but any warning is displayed through pci_warn()
which uses the pci_dev struct of the assocated bridge device. So wrap
those warning so that a different trace mechanism can be used if it's
an opencapi slot.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/pci/hotplug/pnv_php.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 74b62a8e11e7..08ac8f0df06c 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -18,6 +18,9 @@
 #define DRIVER_AUTHOR	"Gavin Shan, IBM Corporation"
 #define DRIVER_DESC	"PowerPC PowerNV PCI Hotplug Driver"
 
+#define SLOT_WARN(slot, x...) \
+	(slot->pdev ? pci_warn(slot->pdev, x) : dev_warn(&slot->bus->dev, x))
+
 struct pnv_php_event {
 	bool			added;
 	struct pnv_php_slot	*php_slot;
@@ -265,7 +268,7 @@ static int pnv_php_add_devtree(struct pnv_php_slot *php_slot)
 
 	ret = pnv_pci_get_device_tree(php_slot->dn->phandle, fdt1, 0x10000);
 	if (ret) {
-		pci_warn(php_slot->pdev, "Error %d getting FDT blob\n", ret);
+		SLOT_WARN(php_slot, "Error %d getting FDT blob\n", ret);
 		goto free_fdt1;
 	}
 
@@ -279,7 +282,7 @@ static int pnv_php_add_devtree(struct pnv_php_slot *php_slot)
 	dt = of_fdt_unflatten_tree(fdt, php_slot->dn, NULL);
 	if (!dt) {
 		ret = -EINVAL;
-		pci_warn(php_slot->pdev, "Cannot unflatten FDT\n");
+		SLOT_WARN(php_slot, "Cannot unflatten FDT\n");
 		goto free_fdt;
 	}
 
@@ -289,7 +292,7 @@ static int pnv_php_add_devtree(struct pnv_php_slot *php_slot)
 	ret = pnv_php_populate_changeset(&php_slot->ocs, php_slot->dn);
 	if (ret) {
 		pnv_php_reverse_nodes(php_slot->dn);
-		pci_warn(php_slot->pdev, "Error %d populating changeset\n",
+		SLOT_WARN(php_slot, "Error %d populating changeset\n",
 			 ret);
 		goto free_dt;
 	}
@@ -297,7 +300,7 @@ static int pnv_php_add_devtree(struct pnv_php_slot *php_slot)
 	php_slot->dn->child = NULL;
 	ret = of_changeset_apply(&php_slot->ocs);
 	if (ret) {
-		pci_warn(php_slot->pdev, "Error %d applying changeset\n", ret);
+		SLOT_WARN(php_slot, "Error %d applying changeset\n", ret);
 		goto destroy_changeset;
 	}
 
@@ -337,7 +340,7 @@ int pnv_php_set_slot_power_state(struct hotplug_slot *slot,
 	if (ret > 0) {
 		if (be64_to_cpu(msg.params[1]) != php_slot->dn->phandle	||
 		    be64_to_cpu(msg.params[2]) != state) {
-			pci_warn(php_slot->pdev, "Wrong msg (%lld, %lld, %lld)\n",
+			SLOT_WARN(php_slot, "Wrong msg (%lld, %lld, %lld)\n",
 				 be64_to_cpu(msg.params[1]),
 				 be64_to_cpu(msg.params[2]),
 				 be64_to_cpu(msg.params[3]));
@@ -359,7 +362,7 @@ int pnv_php_set_slot_power_state(struct hotplug_slot *slot,
 	return ret;
 
 error:
-	pci_warn(php_slot->pdev, "Error %d powering %s\n",
+	SLOT_WARN(php_slot, "Error %d powering %s\n",
 		ret, (state == OPAL_PCI_SLOT_POWER_ON) ? "on" : "off");
 	return ret;
 }
@@ -378,7 +381,7 @@ static int pnv_php_get_power_state(struct hotplug_slot *slot, u8 *state)
 	 */
 	ret = pnv_pci_get_power_state(php_slot->id, &power_state);
 	if (ret) {
-		pci_warn(php_slot->pdev, "Error %d getting power status\n",
+		SLOT_WARN(php_slot, "Error %d getting power status\n",
 			 ret);
 	} else {
 		*state = power_state;
@@ -402,7 +405,7 @@ static int pnv_php_get_adapter_state(struct hotplug_slot *slot, u8 *state)
 		*state = presence;
 		ret = 0;
 	} else {
-		pci_warn(php_slot->pdev, "Error %d getting presence\n", ret);
+		SLOT_WARN(php_slot, "Error %d getting presence\n", ret);
 	}
 
 	return ret;
@@ -637,7 +640,7 @@ static int pnv_php_register_slot(struct pnv_php_slot *php_slot)
 	ret = pci_hp_register(&php_slot->slot, php_slot->bus,
 			      php_slot->slot_no, php_slot->name);
 	if (ret) {
-		pci_warn(php_slot->pdev, "Error %d registering slot\n", ret);
+		SLOT_WARN(php_slot, "Error %d registering slot\n", ret);
 		return ret;
 	}
 
@@ -690,7 +693,7 @@ static int pnv_php_enable_msix(struct pnv_php_slot *php_slot)
 	/* Enable MSIx */
 	ret = pci_enable_msix_exact(pdev, &entry, 1);
 	if (ret) {
-		pci_warn(pdev, "Error %d enabling MSIx\n", ret);
+		SLOT_WARN(php_slot, "Error %d enabling MSIx\n", ret);
 		return ret;
 	}
 
@@ -734,7 +737,7 @@ static irqreturn_t pnv_php_interrupt(int irq, void *data)
 		   (sts & PCI_EXP_SLTSTA_PDC)) {
 		ret = pnv_pci_get_presence_state(php_slot->id, &presence);
 		if (ret) {
-			pci_warn(pdev, "PCI slot [%s] error %d getting presence (0x%04x), to retry the operation.\n",
+			SLOT_WARN(php_slot, "PCI slot [%s] error %d getting presence (0x%04x), to retry the operation.\n",
 				 php_slot->name, ret, sts);
 			return IRQ_HANDLED;
 		}
@@ -764,7 +767,7 @@ static irqreturn_t pnv_php_interrupt(int irq, void *data)
 	 */
 	event = kzalloc(sizeof(*event), GFP_ATOMIC);
 	if (!event) {
-		pci_warn(pdev, "PCI slot [%s] missed hotplug event 0x%04x\n",
+		SLOT_WARN(php_slot, "PCI slot [%s] missed hotplug event 0x%04x\n",
 			 php_slot->name, sts);
 		return IRQ_HANDLED;
 	}
@@ -789,7 +792,7 @@ static void pnv_php_init_irq(struct pnv_php_slot *php_slot, int irq)
 	/* Allocate workqueue */
 	php_slot->wq = alloc_workqueue("pciehp-%s", 0, 0, php_slot->name);
 	if (!php_slot->wq) {
-		pci_warn(pdev, "Cannot alloc workqueue\n");
+		SLOT_WARN(php_slot, "Cannot alloc workqueue\n");
 		pnv_php_disable_irq(php_slot, true);
 		return;
 	}
@@ -813,7 +816,7 @@ static void pnv_php_init_irq(struct pnv_php_slot *php_slot, int irq)
 			  php_slot->name, php_slot);
 	if (ret) {
 		pnv_php_disable_irq(php_slot, true);
-		pci_warn(pdev, "Error %d enabling IRQ %d\n", ret, irq);
+		SLOT_WARN(php_slot, "Error %d enabling IRQ %d\n", ret, irq);
 		return;
 	}
 
@@ -849,7 +852,7 @@ static void pnv_php_enable_irq(struct pnv_php_slot *php_slot)
 
 	ret = pci_enable_device(pdev);
 	if (ret) {
-		pci_warn(pdev, "Error %d enabling device\n", ret);
+		SLOT_WARN(php_slot, "Error %d enabling device\n", ret);
 		return;
 	}
 
-- 
2.21.0


^ permalink raw reply related

* [RFC 09/11] pci/hotplug/pnv-php: Relax check when disabling slot
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

The driver only allows to disable a slot in the POPULATED
state. However, if an error occurs while enabling the slot, say
because the link couldn't be trained, then the POPULATED state may not
be reached, yet the power state of the slot is on. So allow to disable
a slot in the REGISTERED state. Removing the devices will do nothing
since it's not populated, and we'll set the power state of the slot
back to off.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/pci/hotplug/pnv_php.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index f9c624334ef7..74b62a8e11e7 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -523,7 +523,13 @@ static int pnv_php_disable_slot(struct hotplug_slot *slot)
 	struct pnv_php_slot *php_slot = to_pnv_php_slot(slot);
 	int ret;
 
-	if (php_slot->state != PNV_PHP_STATE_POPULATED)
+	/*
+	 * Allow to disable a slot already in the registered state to
+	 * cover cases where the slot couldn't be enabled and never
+	 * reached the populated state
+	 */
+	if (php_slot->state != PNV_PHP_STATE_POPULATED &&
+		php_slot->state != PNV_PHP_STATE_REGISTERED)
 		return 0;
 
 	/* Remove all devices behind the slot */
-- 
2.21.0


^ permalink raw reply related

* [RFC 05/11] powerpc/powernv/ioda: Find opencapi slot for a device node
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

Unlike real PCI slots, opencapi slots are directly associated to
the (virtual) opencapi PHB, there's no intermediate bridge. So when
looking for a slot ID, we must start the search from the device node
itself and not its parent.

Also, the slot ID is not attached to a specific bdfn, so let's build
it from the PHB ID, like skiboot.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/include/asm/pnv-pci.h   |  1 +
 arch/powerpc/platforms/powernv/pci.c | 10 +++++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h
index b5a85f1bb305..4b4dfa6bfdd3 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -15,6 +15,7 @@
 #define PCI_SLOT_ID_PREFIX	(1UL << 63)
 #define PCI_SLOT_ID(phb_id, bdfn)	\
 	(PCI_SLOT_ID_PREFIX | ((uint64_t)(bdfn) << 16) | (phb_id))
+#define PCI_PHB_SLOT_ID(phb_id)		(phb_id)
 
 extern int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id);
 extern int pnv_pci_get_device_tree(uint32_t phandle, void *buf, uint64_t len);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index ff1a33fee8e6..3e4e75a883e1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -49,13 +49,14 @@ int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id)
 		return -ENXIO;
 
 	bdfn = ((bdfn & 0x00ffff00) >> 8);
-	while ((parent = of_get_parent(parent))) {
+	for (parent = np; parent; parent = of_get_parent(parent)) {
 		if (!PCI_DN(parent)) {
 			of_node_put(parent);
 			break;
 		}
 
-		if (!of_device_is_compatible(parent, "ibm,ioda2-phb")) {
+		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
+		    !of_device_is_compatible(parent, "ibm,ioda2-npu2-opencapi-phb")) {
 			of_node_put(parent);
 			continue;
 		}
@@ -66,7 +67,10 @@ int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id)
 			return -ENXIO;
 		}
 
-		*id = PCI_SLOT_ID(phbid, bdfn);
+		if (of_device_is_compatible(parent, "ibm,ioda2-npu2-opencapi-phb"))
+			*id = PCI_PHB_SLOT_ID(phbid);
+		else
+			*id = PCI_SLOT_ID(phbid, bdfn);
 		return 0;
 	}
 
-- 
2.21.0


^ permalink raw reply related

* [RFC 02/11] powerpc/powernv/ioda: Protect PE list
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

Protect the PHB's list of PE. Probably not needed as long as it was
populated during PHB creation, but it feels right and will become
required once we can add/remove opencapi devices on hotplug.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3082912e2600..2c063b05bb64 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1078,8 +1078,9 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 	}
 
 	/* Put PE to the list */
+	mutex_lock(&phb->ioda.pe_list_mutex);
 	list_add_tail(&pe->list, &phb->ioda.pe_list);
-
+	mutex_unlock(&phb->ioda.pe_list_mutex);
 	return pe;
 }
 
@@ -3501,7 +3502,10 @@ static void pnv_ioda_release_pe(struct pnv_ioda_pe *pe)
 	struct pnv_phb *phb = pe->phb;
 	struct pnv_ioda_pe *slave, *tmp;
 
+	mutex_lock(&phb->ioda.pe_list_mutex);
 	list_del(&pe->list);
+	mutex_unlock(&phb->ioda.pe_list_mutex);
+
 	switch (phb->type) {
 	case PNV_PHB_IODA1:
 		pnv_pci_ioda1_release_pe_dma(pe);
-- 
2.21.0


^ permalink raw reply related

* [RFC 03/11] powerpc/powernv/ioda: set up PE on opencapi device when enabling
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

The PE for an opencapi device was set as part of a late PHB fixup
operation, when creating the PHB. To use the PCI hotplug framework,
this is not going to work, as the PHB stays the same, it's only the
devices underneath which are updated. For regular PCI devices, it is
done as part of the reconfiguration of the bridge, but for opencapi
PHBs, we don't have an intermediate bridge. So let's define the PE
when the device is enabled. PEs are meaningless for opencapi, the NPU
doesn't define them and opal is not doing anything with them.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 31 +++++++++++++++++------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2c063b05bb64..2cf06fb98978 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1258,8 +1258,6 @@ static void pnv_pci_ioda_setup_PEs(void)
 {
 	struct pci_controller *hose;
 	struct pnv_phb *phb;
-	struct pci_bus *bus;
-	struct pci_dev *pdev;
 	struct pnv_ioda_pe *pe;
 
 	list_for_each_entry(hose, &hose_list, list_node) {
@@ -1271,11 +1269,6 @@ static void pnv_pci_ioda_setup_PEs(void)
 			if (phb->model == PNV_PHB_MODEL_NPU2)
 				WARN_ON_ONCE(pnv_npu2_init(hose));
 		}
-		if (phb->type == PNV_PHB_NPU_OCAPI) {
-			bus = hose->bus;
-			list_for_each_entry(pdev, &bus->devices, bus_list)
-				pnv_ioda_setup_dev_PE(pdev);
-		}
 	}
 	list_for_each_entry(hose, &hose_list, list_node) {
 		phb = hose->private_data;
@@ -3373,6 +3366,28 @@ static bool pnv_pci_enable_device_hook(struct pci_dev *dev)
 	return true;
 }
 
+static bool pnv_ocapi_enable_device_hook(struct pci_dev *dev)
+{
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dn *pdn;
+	struct pnv_ioda_pe *pe;
+
+	if (!phb->initialized)
+		return true;
+
+	pdn = pci_get_pdn(dev);
+	if (!pdn)
+		return false;
+
+	if (pdn->pe_number == IODA_INVALID_PE) {
+		pe = pnv_ioda_setup_dev_PE(dev);
+		if (!pe)
+			return false;
+	}
+	return true;
+}
+
 static long pnv_pci_ioda1_unset_window(struct iommu_table_group *table_group,
 				       int num)
 {
@@ -3613,7 +3628,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
 };
 
 static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
-	.enable_device_hook	= pnv_pci_enable_device_hook,
+	.enable_device_hook	= pnv_ocapi_enable_device_hook,
 	.window_alignment	= pnv_pci_window_alignment,
 	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
 	.shutdown		= pnv_pci_ioda_shutdown,
-- 
2.21.0


^ permalink raw reply related

* [RFC 01/11] powerpc/powernv/ioda: Fix ref count for devices with their own PE
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

Taking a reference on the pci_dev structure was required with initial
commit 184cd4a3b962 ("powerpc/powernv: PCI support for p7IOC under
OPAL v2"), where we we storing the pci dev in the pci_dn structure.

However, the pci_dev was later removed from the pci_dn structure, but
the reference was kept. See 902bdc57451c ("powerpc/powernv/idoa:
Remove unnecessary pcidev from pci_dn").

The pnv_ioda_pe structure life cycle is the same as the pci_dev
structure, the PE is freed when the device is released. So we don't
need a reference for the pci_dev stored in the PE, otherwise the
pci_dev will never be released. Which is not really a surprise as the
comment (removed here as no longer needed) was stating as much.

Fixes: 902bdc57451c ("powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn")
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 10cc42b9e541..3082912e2600 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1060,14 +1060,6 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 		return NULL;
 	}
 
-	/* NOTE: We get only one ref to the pci_dev for the pdn, not for the
-	 * pointer in the PE data structure, both should be destroyed at the
-	 * same time. However, this needs to be looked at more closely again
-	 * once we actually start removing things (Hotplug, SR-IOV, ...)
-	 *
-	 * At some point we want to remove the PDN completely anyways
-	 */
-	pci_dev_get(dev);
 	pdn->pe_number = pe->pe_number;
 	pe->flags = PNV_IODA_PE_DEV;
 	pe->pdev = dev;
@@ -1082,7 +1074,6 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 		pnv_ioda_free_pe(pe);
 		pdn->pe_number = IODA_INVALID_PE;
 		pe->pdev = NULL;
-		pci_dev_put(dev);
 		return NULL;
 	}
 
@@ -1226,7 +1217,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_npu_PE(struct pci_dev *npu_pdev)
 			 */
 			dev_info(&npu_pdev->dev,
 				"Associating to existing PE %x\n", pe_num);
-			pci_dev_get(npu_pdev);
+			pci_dev_get(npu_pdev); // still needed after 902bdc57451c2c64aa139bbe24067f70a186db0a ?
 			npu_pdn = pci_get_pdn(npu_pdev);
 			rid = npu_pdev->bus->number << 8 | npu_pdn->devfn;
 			npu_pdn->pe_number = pe_num;
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH 1/3] powerpc/64: __ioremap_at clean up in the error case
From: Christophe Leroy @ 2019-06-19 13:04 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <1560916886.zyg87enrjs.astroid@bobo.none>



Le 19/06/2019 à 06:04, Nicholas Piggin a écrit :
> Christophe Leroy's on June 11, 2019 4:28 pm:
>>
>>
>> Le 10/06/2019 à 05:08, Nicholas Piggin a écrit :
>>> __ioremap_at error handling is wonky, it requires caller to clean up
>>> after it. Implement a helper that does the map and error cleanup and
>>> remove the requirement from the caller.
>>>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>>
>>> This series is a different approach to the problem, using the generic
>>> ioremap_page_range directly which reduces added code, and moves
>>> the radix specific code into radix files. Thanks to Christophe for
>>> pointing out various problems with the previous patch.
>>>
>>>    arch/powerpc/mm/pgtable_64.c | 27 ++++++++++++++++++++-------
>>>    1 file changed, 20 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>>> index d2d976ff8a0e..6bd3660388aa 100644
>>> --- a/arch/powerpc/mm/pgtable_64.c
>>> +++ b/arch/powerpc/mm/pgtable_64.c
>>> @@ -108,14 +108,30 @@ unsigned long ioremap_bot;
>>>    unsigned long ioremap_bot = IOREMAP_BASE;
>>>    #endif
>>>    
>>> +static int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
>>> +{
>>> +	unsigned long i;
>>> +
>>> +	for (i = 0; i < size; i += PAGE_SIZE) {
>>> +		int err = map_kernel_page(ea + i, pa + i, prot);
>>
>> Missing a blank line
>>
>>> +		if (err) {
>>
>> I'd have done the following to reduce indentation depth
>>
>> 		if (!err)
>> 			continue
> 
> I'll consider it, line lengths were not too bad.
> 
>>> +			if (slab_is_available())
>>> +				unmap_kernel_range(ea, size);
>>
>> Shouldn't it be unmap_kernel_range(ea, i) ?
> 
> I guess (i - PAGE_SIZE really), although the old code effectively did
> the full range. As a "clean up" it may be better to avoid subtle
> change in behaviour and do that in another patch?

Not sure we have to do it in another patch.
Previous code was doing full range because it was done at upper level so 
it didn't know the boundaries. You are creating a nice brand new 
function that have all necessary information, so why not make it right 
from the start ?

Christophe

> 
> Thanks,
> Nick
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox