LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [EXT] Re: [PATCH 2/3] arm64: dts: ls1028a: Add PCIe controller DT nodes
From: Arnd Bergmann @ 2019-05-17  8:58 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: Mark Rutland, Roy Zang, Lorenzo Pieralisi, DTML, gregkh,
	Kate Stewart, linuxppc-dev, linux-pci, Linux Kernel Mailing List,
	Kishon, M.h. Lian, Rob Herring, Linux ARM, Philippe Ombredanne,
	Bjorn Helgaas, Leo Li, Shawn Guo, Shawn Lin, Mingkai Hu
In-Reply-To: <AM5PR04MB329934765FB8EB1828743D79F50B0@AM5PR04MB3299.eurprd04.prod.outlook.com>

On Fri, May 17, 2019 at 5:21 AM Xiaowei Bao <xiaowei.bao@nxp.com> wrote:
> -----Original Message-----
> From: Arnd Bergmann <arnd@arndb.de>
> On Wed, May 15, 2019 at 9:36 AM Xiaowei Bao <xiaowei.bao@nxp.com> wrote:
> > Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> > ---
> >  arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi |   52 ++++++++++++++++++++++++
> >  1 files changed, 52 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
> > index b045812..50b579b 100644
> > --- a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
> > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
> > @@ -398,6 +398,58 @@
> >                         status = "disabled";
> >                 };
> >
> > +               pcie@3400000 {
> > +                       compatible = "fsl,ls1028a-pcie";
> > +                       reg = <0x00 0x03400000 0x0 0x00100000   /* controller registers */
> > +                              0x80 0x00000000 0x0 0x00002000>; /* configuration space */
> > +                       reg-names = "regs", "config";
> > +                       interrupts = <GIC_SPI 108 IRQ_TYPE_LEVEL_HIGH>, /* PME interrupt */
> > +                                    <GIC_SPI 109 IRQ_TYPE_LEVEL_HIGH>; /* aer interrupt */
> > +                       interrupt-names = "pme", "aer";
> > +                       #address-cells = <3>;
> > +                       #size-cells = <2>;
> > +                       device_type = "pci";
> > +                       dma-coherent;
> > +                       num-lanes = <4>;
> > +                       bus-range = <0x0 0xff>;
> > +                       ranges = <0x81000000 0x0 0x00000000 0x80 0x00010000 0x0 0x00010000   /* downstream I/O */
> > +                                 0x82000000 0x0 0x40000000 0x80 0x40000000 0x0 0x40000000>; /* non-prefetchable memory */
>
> Are you sure there is no support for 64-bit BARs or prefetchable memory?
> [Xiaowei Bao] sorry for late reply, Thought that our Layerscape platform has not added prefetchable memory support in DTS, so this platform has not been added, I will submit a separate patch to add prefetchable memory support for all Layerscape platforms.

Ok, thanks.

> Of course, the prefetchable PCIE device can work in our boards, because the RC will
> assign non-prefetchable memory for this device. We reserve 1G no-prefetchable
> memory for PCIE device, it is enough for general devices.

Sure, many devices work just fine, this is mostly a question of supporting those
devices that do require multiple gigabytes, or that need prefetchable memory
semantics to get the expected performance. GPUs are the obvious example,
but I think there are others (infiniband?).

      Arnd

^ permalink raw reply

* Re: [RFC PATCH] powerpc/64/ftrace: mprofile-kernel patch out mflr
From: Nicholas Piggin @ 2019-05-17  9:12 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman, Naveen N. Rao
In-Reply-To: <1557989161.cjlaryiij4.naveen@linux.ibm.com>

Naveen N. Rao's on May 17, 2019 4:22 am:
> Nicholas Piggin wrote:
>> Naveen N. Rao's on May 14, 2019 6:32 pm:
>>> Michael Ellerman wrote:
>>>> "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> writes:
>>>>> Michael Ellerman wrote:
>>>>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>>>>> The new mprofile-kernel mcount sequence is
>>>>>>>
>>>>>>>   mflr	r0
>>>>>>>   bl	_mcount
>>>>>>>
>>>>>>> Dynamic ftrace patches the branch instruction with a noop, but leaves
>>>>>>> the mflr. mflr is executed by the branch unit that can only execute one
>>>>>>> per cycle on POWER9 and shared with branches, so it would be nice to
>>>>>>> avoid it where possible.
>>>>>>>
>>>>>>> This patch is a hacky proof of concept to nop out the mflr. Can we do
>>>>>>> this or are there races or other issues with it?
>>>>>> 
>>>>>> There's a race, isn't there?
>>>>>> 
>>>>>> We have a function foo which currently has tracing disabled, so the mflr
>>>>>> and bl are nop'ed out.
>>>>>> 
>>>>>>   CPU 0			CPU 1
>>>>>>   ==================================
>>>>>>   bl foo
>>>>>>   nop (ie. not mflr)
>>>>>>   -> interrupt
>>>>>>   something else	enable tracing for foo
>>>>>>   ...			patch mflr and branch
>>>>>>   <- rfi
>>>>>>   bl _mcount
>>>>>> 
>>>>>> So we end up in _mcount() but with r0 not populated.
>>>>>
>>>>> Good catch! Looks like we need to patch the mflr with a "b +8" similar 
>>>>> to what we do in __ftrace_make_nop().
>>>> 
>>>> Would that actually make it any faster though? Nick?
>>> 
>>> Ok, how about doing this as a 2-step process?
>>> 1. patch 'mflr r0' with a 'b +8'
>>>    synchronize_rcu_tasks()
>>> 2. convert 'b +8' to a 'nop'
>> 
>> Good idea. Well the mflr r0 is harmless, so you can leave that in.
>> You just need to ensure it's not removed before the bl is. So nop
>> the bl _mcount, then synchronize_rcu_tasks(), then nop the mflr?
> 
> The problem actually seems to be when we try to patch in the branch to 
> _mcount(), rather than when we are patching in the nop instructions 
> (i.e., the race is when we try to enable the function tracer, rather 
> than while disabling it).
> 
> When we disable ftrace, we only need to ensure we patch out the branch 
> to _mcount() before patching out the preceding 'mflr r0'. I don't think 
> we need a synchronize_rcu_tasks() in that case.

That's probably right.

> While enabling ftrace, we will first need to patch the preceding 'mflr 
> r0' (which would now be a 'nop') with 'b +8', then use 
> synchronize_rcu_tasks() and finally patch in 'bl _mcount()' followed by 
> 'mflr r0'.
> 
> I think that's what you meant, just that my reference to patching 'mflr 
> r0' with a 'b +8' should have called out that the mflr would have been 
> nop'ed out.

I meant that we don't need the b +8 anywhere, because the mflr r0
is harmless. Enabling ftrace just needs to patch in 'mflr r0', and
then synchronize_rcu_tasks(), and then patch in 'bl _mcount'. I think?

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH] powerpc/book3s/mm: Clear MMU_FTR_HPTE_TABLE when radix is enabled.
From: Nicholas Piggin @ 2019-05-17  9:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V, mpe, paulus; +Cc: linuxppc-dev
In-Reply-To: <df83cf16-669c-ae90-88c9-333700e38dcd@linux.ibm.com>

Aneesh Kumar K.V's on May 16, 2019 11:36 pm:
> On 5/16/19 10:34 AM, Nicholas Piggin wrote:
>> Aneesh Kumar K.V's on May 14, 2019 4:02 pm:
>>> Avoids confusion when printing Oops message like below
>>>
>>>   Faulting instruction address: 0xc00000000008bdb4
>>>   Oops: Kernel access of bad area, sig: 11 [#1]
>>>   LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
>>>
>>> Either ibm,pa-features or ibm,powerpc-cpu-features can be used to enable the
>>> MMU features. We don't clear related MMU feature bits there. We use the kernel
>>> commandline to determine what translation mode we want to use and clear the
>>> HPTE or radix bit accordingly. On LPAR we do have to renable HASH bit if the
>>> hypervisor can't do radix.
>> 
>> Well we have the HPTE feature: the CPU supports hash MMU mode. It's
>> just the the kernel is booted in radix mode.
>> 
> 
> We are not using mmu_features to indicate the capability of the hardware 
> right? ie, mmu_features is an indication of current running config.

It's kind of both.

> We 
> set MMU_FTR_TYPE_RADIX if the kernel is running in radix translation 
> mode and on similar lines we should set MMU_FTR_HPTE_TABLE if the kernel 
> is running in only hash translation mode. Whether the hardware support 
> these translation mode is different from which mode is currently used.

I don't see why that logic follows. We have MMU_FTR_TYPE_RADIX to
determine if we are running in radix or HPT mode, why do we need
another bit for the same thing?

>> Could make a difference for KVM, if it will support an HPT guest or
>> not.
>> 
> 
> kvm should not depend on MMU_FTR_HPTE_TABLE to identify whether the 
> hardware supports hash page table translation.

Why not though?

> I don't think we do that.

It doesn't, but the point is the bit is kind of useful now (in
theory if you wanted to do something like that), but if you just
make it an inverse of the current mode bit we already have, then
it's useless.

Point is, just use the existing radix MMU selection bit that we
use everywhere else to fix the problem. If that finishes off the
only 64-bit users of the bit and you want to get rid of that as
well I'm fine with that too.

Thanks,
Nick

^ permalink raw reply

* Re: [RFC PATCH 3/3] powerpc/mm/hugetlb: Don't enable HugeTLB if we don't have a page table cache
From: Aneesh Kumar K.V @ 2019-05-17  9:32 UTC (permalink / raw)
  To: Michael Ellerman, npiggin, paulus; +Cc: linuxppc-dev
In-Reply-To: <04403623-d4ae-1d91-d3f4-16bd09e94d34@linux.ibm.com>

On 5/17/19 9:29 AM, Aneesh Kumar K.V wrote:
> On 5/16/19 8:17 PM, Michael Ellerman wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>> This makes sure we don't enable HugeTLB if the cache is not configured.
>>> I am still not sure about this. IMHO hugetlb support should be a 
>>> hardware
>>> support derivative and any cache allocation failure should be handled 
>>> as I did
>>> in the earlier patch. But then if we were not able to create hugetlb 
>>> page table
>>> cache, we can as well declare hugetlb support disabled thereby 
>>> avoiding calling
>>> into allocation routines.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>   arch/powerpc/mm/hugetlbpage.c | 11 +++++++++--
>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/hugetlbpage.c 
>>> b/arch/powerpc/mm/hugetlbpage.c
>>> index ee16a3fb788a..4bf8bc659cc7 100644
>>> --- a/arch/powerpc/mm/hugetlbpage.c
>>> +++ b/arch/powerpc/mm/hugetlbpage.c
>>> @@ -602,6 +602,7 @@ __setup("hugepagesz=", hugepage_setup_sz);
>>>   static int __init hugetlbpage_init(void)
>>>   {
>>>       int psize;
>>> +    bool configured = false;
>>
>> Where's my reverse Christmas tree! :)
> 
> Will fix that :)
> 
>>
>>>       if (hugetlb_disabled) {
>>>           pr_info("HugeTLB support is disabled!\n");
>>> @@ -651,10 +652,16 @@ static int __init hugetlbpage_init(void)
>>>               pgtable_cache_add(pdshift - shift);
>>>           else if (IS_ENABLED(CONFIG_PPC_FSL_BOOK3E) || 
>>> IS_ENABLED(CONFIG_PPC_8xx))
>>>               pgtable_cache_add(PTE_T_ORDER);
>>> +
>>> +        if (!configured)
>>> +            configured = true;
>>
>> I'd just not worry about the if.
>>
>>>       }
>>> -    if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE))
>>> -        hugetlbpage_init_default();
>>> +    if (configured) {
>>> +        if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE))
>>> +            hugetlbpage_init_default();
>>> +    } else
>>> +        pr_info("Disabling HugeTLB");
>>
>> We're not actually doing anything to disable it in the
>> CONFIG_HUGETLB_PAGE_SIZE_VARIABLE=n case, but I guess the print is still
>> correct because we didn't enable a size in the for loop above?
>>
>> Can we make it a bit more explicit? Maybe like:
>>
>>    "Disabling HugeTLB, no usable page sizes found."
>>
> 
> That would confuse when they find in the dmesg
> 
> [    0.000000] hash-mmu: Page sizes from device-tree:
> [    0.000000] hash-mmu: base_shift=12: shift=12, sllp=0x0000, 
> avpnm=0x00000000, tlbiel=1, penc=0
> [    0.000000] hash-mmu: base_shift=12: shift=16, sllp=0x0000, 
> avpnm=0x00000000, tlbiel=1, penc=7
> [    0.000000] hash-mmu: base_shift=12: shift=24, sllp=0x0000, 
> avpnm=0x00000000, tlbiel=1, penc=56
> [    0.000000] hash-mmu: base_shift=16: shift=16, sllp=0x0110, 
> avpnm=0x00000000, tlbiel=1, penc=1
> [    0.000000] hash-mmu: base_shift=16: shift=24, sllp=0x0110, 
> avpnm=0x00000000, tlbiel=1, penc=8
> [    0.000000] hash-mmu: base_shift=24: shift=24, sllp=0x0100, 
> avpnm=0x00000001, tlbiel=0, penc=0
> [    0.000000] hash-mmu: base_shift=34: shift=34, sllp=0x0120, 
> avpnm=0x000007ff, tlbiel=0, penc=3

There is another failure condition which i am not sure how to handle 
with the pagetable cache creation failures. With above, if we had kernel 
command line hugepagesz=x hugepages=y, and if that x is a gigantic 
hugepage, we will allocate those pages early even if we don't support 
hugetlb because we failed to create page table cache.

I am not sure whether we should handle that error gracefully?

-aneesh


^ permalink raw reply

* Re: [EXT] Re: [PATCH 2/3] arm64: dts: ls1028a: Add PCIe controller DT nodes
From: Ard Biesheuvel @ 2019-05-17 10:44 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Mark Rutland, Roy Zang, Lorenzo Pieralisi, Xiaowei Bao, DTML,
	gregkh, Shawn Lin, Philippe Ombredanne, Mingkai Hu,
	Linux Kernel Mailing List, Kishon, M.h. Lian, Rob Herring,
	Linux ARM, linux-pci, Bjorn Helgaas, Shawn Guo, Leo Li,
	linuxppc-dev, Kate Stewart
In-Reply-To: <CAK8P3a0kKb7njiJvUkwJYwf-yc-hEyErSiWcvbdf0XnMoctzrg@mail.gmail.com>

On Fri, 17 May 2019 at 10:59, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 17, 2019 at 5:21 AM Xiaowei Bao <xiaowei.bao@nxp.com> wrote:
> > -----Original Message-----
> > From: Arnd Bergmann <arnd@arndb.de>
> > On Wed, May 15, 2019 at 9:36 AM Xiaowei Bao <xiaowei.bao@nxp.com> wrote:
> > > Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> > > ---
> > >  arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi |   52 ++++++++++++++++++++++++
> > >  1 files changed, 52 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
> > > index b045812..50b579b 100644
> > > --- a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
> > > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
> > > @@ -398,6 +398,58 @@
> > >                         status = "disabled";
> > >                 };
> > >
> > > +               pcie@3400000 {
> > > +                       compatible = "fsl,ls1028a-pcie";
> > > +                       reg = <0x00 0x03400000 0x0 0x00100000   /* controller registers */
> > > +                              0x80 0x00000000 0x0 0x00002000>; /* configuration space */
> > > +                       reg-names = "regs", "config";
> > > +                       interrupts = <GIC_SPI 108 IRQ_TYPE_LEVEL_HIGH>, /* PME interrupt */
> > > +                                    <GIC_SPI 109 IRQ_TYPE_LEVEL_HIGH>; /* aer interrupt */
> > > +                       interrupt-names = "pme", "aer";
> > > +                       #address-cells = <3>;
> > > +                       #size-cells = <2>;
> > > +                       device_type = "pci";
> > > +                       dma-coherent;
> > > +                       num-lanes = <4>;
> > > +                       bus-range = <0x0 0xff>;
> > > +                       ranges = <0x81000000 0x0 0x00000000 0x80 0x00010000 0x0 0x00010000   /* downstream I/O */
> > > +                                 0x82000000 0x0 0x40000000 0x80 0x40000000 0x0 0x40000000>; /* non-prefetchable memory */
> >
> > Are you sure there is no support for 64-bit BARs or prefetchable memory?
> > [Xiaowei Bao] sorry for late reply, Thought that our Layerscape platform has not added prefetchable memory support in DTS, so this platform has not been added, I will submit a separate patch to add prefetchable memory support for all Layerscape platforms.
>
> Ok, thanks.
>
> > Of course, the prefetchable PCIE device can work in our boards, because the RC will
> > assign non-prefetchable memory for this device. We reserve 1G no-prefetchable
> > memory for PCIE device, it is enough for general devices.
>
> Sure, many devices work just fine, this is mostly a question of supporting those
> devices that do require multiple gigabytes, or that need prefetchable memory
> semantics to get the expected performance. GPUs are the obvious example,
> but I think there are others (infiniband?).
>

Some implementations of the Synopsys dw PCIe IP contain a 'root port'
(within quotes because it is not actually a root port but an arbitrary
set of MMIO registers that looks like a type 01 config region) that
does not permit the prefetchable bridge window BAR to be written (a
thing which is apparently permitted by the PCIe spec). So while the
host bridge is capable of supporting more than one MMIO BAR window,
the OS visible software interface does not expose this functionality

In practice, it probably doesn't matter, since the driver uses the
same iATU attributes for prefetchable and non-prefetchable windows,
but I guess 1 GB of MMIO BAR space is a bit restrictive for modern
systems.

^ permalink raw reply

* Re: [RFC PATCH 3/3] powerpc/mm/hugetlb: Don't enable HugeTLB if we don't have a page table cache
From: Michael Ellerman @ 2019-05-17 11:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V, npiggin, paulus; +Cc: linuxppc-dev
In-Reply-To: <04403623-d4ae-1d91-d3f4-16bd09e94d34@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> On 5/16/19 8:17 PM, Michael Ellerman wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>> This makes sure we don't enable HugeTLB if the cache is not configured.
>>> I am still not sure about this. IMHO hugetlb support should be a hardware
>>> support derivative and any cache allocation failure should be handled as I did
>>> in the earlier patch. But then if we were not able to create hugetlb page table
>>> cache, we can as well declare hugetlb support disabled thereby avoiding calling
>>> into allocation routines.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>   arch/powerpc/mm/hugetlbpage.c | 11 +++++++++--
>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
>>> index ee16a3fb788a..4bf8bc659cc7 100644
>>> --- a/arch/powerpc/mm/hugetlbpage.c
>>> +++ b/arch/powerpc/mm/hugetlbpage.c
>>> @@ -602,6 +602,7 @@ __setup("hugepagesz=", hugepage_setup_sz);
>>>   static int __init hugetlbpage_init(void)
>>>   {
>>>   	int psize;
>>> +	bool configured = false;
>> 
>> Where's my reverse Christmas tree! :)
>
> Will fix that :)

Thanks.

>>> @@ -651,10 +652,16 @@ static int __init hugetlbpage_init(void)
>>>   
>>> -	if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE))
>>> -		hugetlbpage_init_default();
>>> +	if (configured) {
>>> +		if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE))
>>> +			hugetlbpage_init_default();
>>> +	} else
>>> +		pr_info("Disabling HugeTLB");
>> 
>> We're not actually doing anything to disable it in the
>> CONFIG_HUGETLB_PAGE_SIZE_VARIABLE=n case, but I guess the print is still
>> correct because we didn't enable a size in the for loop above?
>> 
>> Can we make it a bit more explicit? Maybe like:
>> 
>>    "Disabling HugeTLB, no usable page sizes found."
>> 
>
> That would confuse when they find in the dmesg
>
> [    0.000000] hash-mmu: Page sizes from device-tree: 
> [    0.000000] hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0 
> [    0.000000] hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7 
> [    0.000000] hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56 
> [    0.000000] hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1 
> [    0.000000] hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8 
> [    0.000000] hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0 
> [    0.000000] hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3

But aren't they going to be even more confused when all we print is
"Disabling HugeTLB" with no explanation?

cheers

^ permalink raw reply

* Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest
From: Michael Ellerman @ 2019-05-17 11:20 UTC (permalink / raw)
  To: srikanth, linuxppc-dev; +Cc: linux-next, linux-kernel, bharata
In-Reply-To: <16a7a635-c592-27e2-75b4-d02071833278@linux.vnet.ibm.com>

srikanth <sraithal@linux.vnet.ibm.com> writes:
> Hello,
>
> On power9 host, performing memory hotunplug from ppc64le guest results 
> in kernel oops.

Thanks for the report.

Did this used to work in the past? If so what is the last version that
worked?

> Kernel used : https://github.com/torvalds/linux/tree/v5.1 built using 
> ppc64le_defconfig for host and ppc64le_guest_defconfig for guest.
>
> Recreation steps:
>
> 1. Boot a guest with below mem configuration:
>    <maxMemory slots='32' unit='KiB'>33554432</maxMemory>
>    <memory unit='KiB'>8388608</memory>
>    <currentMemory unit='KiB'>4194304</currentMemory>
>    <cpu>
>      <numa>
>        <cell id='0' cpus='0-31' memory='8388608' unit='KiB'/>
>      </numa>
>    </cpu>
>
> 2. From host hotplug 8G memory -> verify memory hotadded succesfully -> 
> now reboot guest -> once guest comes back try to unplug 8G memory

I assume the reboot is required to trigger the bug? ie. if you unplug
without rebooting it doesn't crash?

> mem.xml used:
> <memory model='dimm'>
> <target>
> <size unit='GiB'>8</size>
> <node>0</node>
> </target>
> </memory>
>
> Memory attach and detach commands used:
>      virsh attach-device vm1 ./mem.xml --live
>      virsh detach-device vm1 ./mem.xml --live
>
> Trace seen inside guest after unplug, guest just hangs there forever:
>
> [   21.962986] kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
> [   21.963064] Oops: Exception in kernel mode, sig: 5 [#1]
> [   21.963090] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA 
> pSeries
> [   21.963131] Modules linked in: xt_tcpudp iptable_filter squashfs fuse 
> vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi 
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress 
> zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy 
> async_pq async_xor async_tx xor raid6_pq multipath crc32c_vpmsum
> [   21.963281] CPU: 11 PID: 316 Comm: kworker/u64:5 Kdump: loaded Not 
> tainted 5.1.0-dirty #2
> [   21.963323] Workqueue: pseries hotplug workque pseries_hp_work_fn
> [   21.963355] NIP:  c000000000079e18 LR: c000000000c79308 CTR: 
> 0000000000008000
> [   21.963392] REGS: c0000003f88034f0 TRAP: 0700   Not tainted (5.1.0-dirty)
> [   21.963422] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  
> CR: 28002884  XER: 20040000
> [   21.963470] CFAR: c000000000c79304 IRQMASK: 0
> [   21.963470] GPR00: c000000000c79308 c0000003f8803780 c000000001521000 
> 0000000000fff8c0

Can you try not to word wrap these, it makes them much harder to read.

There's some instructions here on configuring Thunderbird:
  https://www.kernel.org/doc/html/latest/process/email-clients.html#thunderbird-gui

> [   21.963470] GPR04: 0000000000000001 00000000ffe30005 0000000000000005 
> 0000000000000020
> [   21.963470] GPR08: 0000000000000000 0000000000000001 c00a000000fff8e0 
> c0000000016d21a0
> [   21.963470] GPR12: c0000000016e7b90 c000000007ff2700 c00a000000a00000 
> c0000003ffe30100
> [   21.963470] GPR16: c0000003ffe30000 c0000000014aa4de c00a0000009f0000 
> c0000000016d21b0
> [   21.963470] GPR20: c0000000014de588 0000000000000001 c0000000016d21b8 
> c00a000000a00000
> [   21.963470] GPR24: 0000000000000000 ffffffffffffffff c00a000000a00000 
> c0000003ffe96000
> [   21.963470] GPR28: c00a000000a00000 c00a000000a00000 c0000003fffec000 
> c00a000000fff8c0
> [   21.963802] NIP [c000000000079e18] pte_fragment_free+0x48/0xd0
> [   21.963838] LR [c000000000c79308] remove_pagetable+0x49c/0x5b4
> [   21.963873] Call Trace:
> [   21.963890] [c0000003f8803780] [c0000003ffe997f0] 0xc0000003ffe997f0 
> (unreliable)
> [   21.963933] [c0000003f88037b0] [0000000000000000] (null)
> [   21.963969] [c0000003f88038c0] [c00000000006f038] 
> vmemmap_free+0x218/0x2e0
> [   21.964006] [c0000003f8803940] [c00000000036f100] 
> sparse_remove_one_section+0xd0/0x138
> [   21.964050] [c0000003f8803980] [c000000000383a50] 
> __remove_pages+0x410/0x560
> [   21.964093] [c0000003f8803a90] [c000000000c784d8] 
> arch_remove_memory+0x68/0xdc
> [   21.964136] [c0000003f8803ad0] [c000000000385d74] 
> __remove_memory+0xc4/0x110
> [   21.964180] [c0000003f8803b10] [c0000000000d44e4] 
> dlpar_remove_lmb+0x94/0x140
> [   21.964223] [c0000003f8803b50] [c0000000000d52b4] 
> dlpar_memory+0x464/0xd00
> [   21.964259] [c0000003f8803be0] [c0000000000cd5c0] 
> handle_dlpar_errorlog+0xc0/0x190
> [   21.964303] [c0000003f8803c50] [c0000000000cd6bc] 
> pseries_hp_work_fn+0x2c/0x60
> [   21.964346] [c0000003f8803c80] [c00000000013a4a0] 
> process_one_work+0x2b0/0x5a0
> [   21.964388] [c0000003f8803d10] [c00000000013a818] 
> worker_thread+0x88/0x610
> [   21.964434] [c0000003f8803db0] [c000000000143884] kthread+0x1a4/0x1b0
> [   21.964468] [c0000003f8803e20] [c00000000000bdc4] 
> ret_from_kernel_thread+0x5c/0x78
> [   21.964506] Instruction dump:
> [   21.964527] fbe1fff8 f821ffd1 78638502 78633664 ebe90000 7fff1a14 
> 395f0020 813f0020
> [   21.964569] 7d2907b4 7d2900d0 79290fe0 69290001 <0b090000> 7c0004ac 
> 7d205028 3129ffff
> [   21.964613] ---[ end trace aaa571aa1636fee6 ]---
> [   21.966349]
> [   21.966383] Sending IPI to other CPUs
> [   21.978335] IPI complete
> [   21.981354] kexec: Starting switchover sequence.
> I'm in purgatory

It's not hung here, it's just not executing what we want it to :)

If you break into the qemu monitor and issue `info registers` it should
give you some idea of what's going on.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/pseries: Fix xive=off command line
From: Greg Kurz @ 2019-05-17 10:57 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linuxppc-dev, linux-kernel, stable, Cédric Le Goater
In-Reply-To: <20190515105443.835E72084E@mail.kernel.org>

On Wed, 15 May 2019 10:54:42 +0000
Sasha Levin <sashal@kernel.org> wrote:

> Hi,
> 

Hi,

> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: eac1e731b59e powerpc/xive: guest exploitation of the XIVE interrupt controller.
> 
> The bot has tested the following trees: v5.1.1, v5.0.15, v4.19.42, v4.14.118.
> 
> v5.1.1: Build OK!
> v5.0.15: Build OK!
> v4.19.42: Failed to apply! Possible dependencies:
>     8ca2d5151e7f ("powerpc/prom_init: Move a few remaining statics to appropriate sections")
>     c886087caee7 ("powerpc/prom_init: Move prom_radix_disable to __prombss")
> 

Dependencies are:

3bad719b4954 ("powerpc/prom_init: Make of_workarounds static")
e63334e556d9 ("powerpc/prom_init: Replace __initdata with __prombss when applicable")
11fdb309341c ("powerpc/prom_init: Remove support for OPAL v2")
c886087caee7 ("powerpc/prom_init: Move prom_radix_disable to __prombss")
8ca2d5151e7f ("powerpc/prom_init: Move a few remaining statics to appropriate sections")
f1f208e54d08 ("powerpc/prom_init: Generate "phandle" instead of "linux, phandle"")
cbe46bd4f510 ("powerpc: remove CONFIG_CMDLINE #ifdef mess")
450e7dd4001f ("powerpc/prom_init: don't use string functions from lib/")

The patches apply flawlessly and allow the build to succeed.

> v4.14.118: Failed to apply! Possible dependencies:
>     028555a590d6 ("powerpc/xive: fix hcall H_INT_RESET to support long busy delays")
>     7a22d6321c3d ("powerpc/mm/radix: Update command line parsing for disable_radix")
>     8ca2d5151e7f ("powerpc/prom_init: Move a few remaining statics to appropriate sections")
>     c886087caee7 ("powerpc/prom_init: Move prom_radix_disable to __prombss")
> 

Dependencies are:

7a22d6321c3d ("powerpc/mm/radix: Update command line parsing for disable_radix")
028555a590d6 ("powerpc/xive: fix hcall H_INT_RESET to support long busy delays")
3bad719b4954 ("powerpc/prom_init: Make of_workarounds static")
e63334e556d9 ("powerpc/prom_init: Replace __initdata with __prombss when applicable")
11fdb309341c ("powerpc/prom_init: Remove support for OPAL v2")
c886087caee7 ("powerpc/prom_init: Move prom_radix_disable to __prombss")
8ca2d5151e7f ("powerpc/prom_init: Move a few remaining statics to appropriate sections")
f1f208e54d08 ("powerpc/prom_init: Generate "phandle" instead of "linux, phandle"")
cbe46bd4f510 ("powerpc: remove CONFIG_CMDLINE #ifdef mess")
450e7dd4001f ("powerpc/prom_init: don't use string functions from lib/")

The patches apply flawlessly and allow the build to succeed.

> 
> How should we proceed with this patch?
> 

xive=off allows the kernel to use the legacy XICS interrupt controller
interface on POWER9, definitely not a recommended setting. A typical
usage for this would be to workaround some issue that would only pop
up when using XIVE. Note also that this only affects the pseries platform,
ie. running under an hypervisor (KVM or pHyp).

I cannot state right now whether it is worth the pain to cherry-pick all
the dependencies to fix this or not in older kernels...

Cheers,

--
Greg

> --
> Thanks,
> Sasha


^ permalink raw reply

* [PATCH v4] powerpc/64s: support nospectre_v2 cmdline option
From: Christopher M. Riedl @ 2019-05-17 13:14 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Christopher M. Riedl, Andrew Donnellan

Add support for disabling the kernel implemented spectre v2 mitigation
(count cache flush on context switch) via the nospectre_v2 and
mitigations=off cmdline options.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
---
v3->v4:
	add support for new mitigations=off cmdline option

 arch/powerpc/kernel/security.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index e1c9cf079503..88ae6a9bdf74 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -28,7 +28,7 @@ static enum count_cache_flush_type count_cache_flush_type = COUNT_CACHE_FLUSH_NO
 bool barrier_nospec_enabled;
 static bool no_nospec;
 static bool btb_flush_enabled;
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3S_64)
 static bool no_spectrev2;
 #endif
 
@@ -114,7 +114,7 @@ static __init int security_feature_debugfs_init(void)
 device_initcall(security_feature_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_BOOK3S_64)
 static int __init handle_nospectre_v2(char *p)
 {
 	no_spectrev2 = true;
@@ -122,6 +122,9 @@ static int __init handle_nospectre_v2(char *p)
 	return 0;
 }
 early_param("nospectre_v2", handle_nospectre_v2);
+#endif /* CONFIG_PPC_FSL_BOOK3E || CONFIG_PPC_BOOK3S_64 */
+
+#ifdef CONFIG_PPC_FSL_BOOK3E
 void setup_spectre_v2(void)
 {
 	if (no_spectrev2 || cpu_mitigations_off())
@@ -399,7 +402,17 @@ static void toggle_count_cache_flush(bool enable)
 
 void setup_count_cache_flush(void)
 {
-	toggle_count_cache_flush(true);
+	bool enable = true;
+
+	if (no_spectrev2 || cpu_mitigations_off()) {
+		if (security_ftr_enabled(SEC_FTR_BCCTRL_SERIALISED)
+		    || security_ftr_enabled(SEC_FTR_COUNT_CACHE_DISABLED))
+			pr_warn("Spectre v2 mitigations not under software control, can't disable\n");
+
+		enable = false;
+	}
+
+	toggle_count_cache_flush(enable);
 }
 
 #ifdef CONFIG_DEBUG_FS
-- 
2.21.0


^ permalink raw reply related

* [PATCH v2] powerpc/mm/hash: Fix get_region_id() for invalid addresses
From: Michael Ellerman @ 2019-05-17 13:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar, npiggin

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>

Accesses by userspace to random addresses outside the user or kernel
address range will generate an SLB fault. When we handle that fault we
classify the effective address into several classes, eg. user, kernel
linear, kernel virtual etc.

For addresses that are completely outside of any valid range, we
should not insert an SLB entry at all, and instead immediately an
exception.

In the past this was handled in two ways. Firstly we would check the
top nibble of the address (using REGION_ID(ea)) and that would tell us
if the address was user (0), kernel linear (c), kernel virtual (d), or
vmemmap (f). If the address didn't match any of these it was invalid.

Then for each type of address we would do a secondary check. For the
user region we check against H_PGTABLE_RANGE, for kernel linear we
would mask the top nibble of the address and then check the address
against MAX_PHYSMEM_BITS.

As part of commit 0034d395f89d ("powerpc/mm/hash64: Map all the kernel
regions in the same 0xc range") we replaced REGION_ID() with
get_region_id() and changed the masking of the top nibble to only mask
the top two bits, which introduced a bug.

Addresses less than (4 << 60) are still handled correctly, they are
either less than (1 << 60) in which case they are subject to the
H_PGTABLE_RANGE check, or they are correctly checked against
MAX_PHYSMEM_BITS.

However addresses from (4 << 60) to ((0xc << 60) - 1), are incorrectly
treated as kernel linear addresses in get_region_id(). Then the top
two bits are cleared by EA_MASK in slb_allocate_kernel() and the
address is checked against MAX_PHYSMEM_BITS, which it passes due to
the masking. The end result is we incorrectly insert SLB entries for
those addresses.

That is not actually catastrophic, having inserted the SLB entry we
will then go on to take a page fault for the address and at that point
we detect the problem and report it as a bad fault.

Still we should not be inserting those entries, or treating them as
kernel linear addresses in the first place. So fix get_region_id() to
detect addresses in that range and return an invalid region id, which
we cause use to not insert an SLB entry and directly report an
exception.

Fixes: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in the same 0xc range")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Drop change to EA_MASK for now, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/book3s/64/hash.h | 4 ++++
 1 file changed, 4 insertions(+)

v2: Drop change to EA_MASK for now, rewrite change log.

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 5486087e64ea..2781ebf6add4 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -93,6 +93,7 @@
 #define VMALLOC_REGION_ID	NON_LINEAR_REGION_ID(H_VMALLOC_START)
 #define IO_REGION_ID		NON_LINEAR_REGION_ID(H_KERN_IO_START)
 #define VMEMMAP_REGION_ID	NON_LINEAR_REGION_ID(H_VMEMMAP_START)
+#define INVALID_REGION_ID	(VMEMMAP_REGION_ID + 1)

 /*
  * Defines the address of the vmemap area, in its own region on
@@ -119,6 +120,9 @@ static inline int get_region_id(unsigned long ea)
 	if (id == 0)
 		return USER_REGION_ID;

+	if (id != (PAGE_OFFSET >> 60))
+		return INVALID_REGION_ID;
+
 	if (ea < H_KERN_VIRT_START)
 		return LINEAR_MAP_REGION_ID;

-- 
2.20.1

^ permalink raw reply related

* [PATCH] ocxl: Fix potential memory leak on context creation
From: Frederic Barrat @ 2019-05-17 14:20 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, alastair; +Cc: clombard

If we couldn't fully init a context, we were leaking memory.

Fixes: b9721d275cc2 ("ocxl: Allow external drivers to use OpenCAPI contexts")
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/misc/ocxl/context.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
index bab9c9364184..ab93156aa83e 100644
--- a/drivers/misc/ocxl/context.c
+++ b/drivers/misc/ocxl/context.c
@@ -22,6 +22,7 @@ int ocxl_context_alloc(struct ocxl_context **context, struct ocxl_afu *afu,
 			afu->pasid_base + afu->pasid_max, GFP_KERNEL);
 	if (pasid < 0) {
 		mutex_unlock(&afu->contexts_lock);
+		kfree(*context);
 		return pasid;
 	}
 	afu->pasid_count++;
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] mm/nvdimm: Pick the right alignment default when creating dax devices
From: Vaibhav Jain @ 2019-05-17 14:49 UTC (permalink / raw)
  To: Aneesh Kumar K.V, dan.j.williams
  Cc: linux-mm, linux-nvdimm, linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <20190514025449.9416-1-aneesh.kumar@linux.ibm.com>

Hi Aneesh,

Apart from a minor review comment for changes to nd_pfn_validate() the
patch looks good to me.

Also, I Tested this patch on a PPC64 qemu guest with virtual nvdimm and
verified that default alignment of newly created devdax namespace was
64KiB instead of 16MiB. Below are the test results:

* Without the patch creating a devdax namespace results in namespace
  with 16MiB default alignment. Using daxio to zero out the dax device
  results in a SIGBUS and a hashing failure.

  $ sudo ndctl create-namespace --mode=devdax  | grep align
    "align":16777216,
  "align":16777216

  $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
  65536 16777216

  $ sudo daxio.static-debug  -z -o /dev/dax0.0
  Bus error (core dumped)

  $ dmesg | tail
  [  438.738958] lpar: Failed hash pte insert with error -4
  [  438.739412] hash-mmu: mm: Hashing failure ! EA=0x7fff17000000 access=0x8000000000000006 current=daxio
  [  438.739760] hash-mmu:     trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc000000501002b86
  [  438.740143] daxio[3860]: bus error (7) at 7fff17000000 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b0000+20000]
  [  438.740634] daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120 
  [  438.741015] daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 <f9490000> e93f0088 39290008 f93f0110 

* With the patch creating a devdax namespace results in namespace
  with 64KiB default alignment. Using daxio to zero out the dax device
  succeeds:
  
  $ sudo ndctl create-namespace --mode=devdax  | grep align
    "align":65536,
  "align":65536

  $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
  65536

  $ daxio -z -o /dev/dax0.0
  daxio: copied 2130706432 bytes to device "/dev/dax0.0"

Hence,

Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Allow arch to provide the supported alignments and use hugepage alignment only
> if we support hugepage. Right now we depend on compile time configs whereas this
> patch switch this to runtime discovery.
>
> Architectures like ppc64 can have THP enabled in code, but then can have
> hugepage size disabled by the hypervisor. This allows us to create dax devices
> with PAGE_SIZE alignment in this case.
>
> Existing dax namespace with alignment larger than PAGE_SIZE will fail to
> initialize in this specific case. We still allow fsdax namespace initialization.
>
> With respect to identifying whether to enable hugepage fault for a dax device,
> if THP is enabled during compile, we default to taking hugepage fault and in dax
> fault handler if we find the fault size > alignment we retry with PAGE_SIZE
> fault size.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/libnvdimm.h |  9 ++++++++
>  arch/powerpc/mm/Makefile             |  1 +
>  arch/powerpc/mm/nvdimm.c             | 34 ++++++++++++++++++++++++++++
>  arch/x86/include/asm/libnvdimm.h     | 19 ++++++++++++++++
>  drivers/nvdimm/nd.h                  |  6 -----
>  drivers/nvdimm/pfn_devs.c            | 32 +++++++++++++++++++++++++-
>  include/linux/huge_mm.h              |  7 +++++-
>  7 files changed, 100 insertions(+), 8 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/libnvdimm.h
>  create mode 100644 arch/powerpc/mm/nvdimm.c
>  create mode 100644 arch/x86/include/asm/libnvdimm.h
>
> diff --git a/arch/powerpc/include/asm/libnvdimm.h b/arch/powerpc/include/asm/libnvdimm.h
> new file mode 100644
> index 000000000000..d35fd7f48603
> --- /dev/null
> +++ b/arch/powerpc/include/asm/libnvdimm.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_LIBNVDIMM_H
> +#define _ASM_POWERPC_LIBNVDIMM_H
> +
> +#define nd_pfn_supported_alignments nd_pfn_supported_alignments
> +extern unsigned long *nd_pfn_supported_alignments(void);
> +extern unsigned long nd_pfn_default_alignment(void);
> +
> +#endif
> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
> index 0f499db315d6..42e4a399ba5d 100644
> --- a/arch/powerpc/mm/Makefile
> +++ b/arch/powerpc/mm/Makefile
> @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM)		+= highmem.o
>  obj-$(CONFIG_PPC_COPRO_BASE)	+= copro_fault.o
>  obj-$(CONFIG_PPC_PTDUMP)	+= ptdump/
>  obj-$(CONFIG_KASAN)		+= kasan/
> +obj-$(CONFIG_NVDIMM_PFN)		+= nvdimm.o
> diff --git a/arch/powerpc/mm/nvdimm.c b/arch/powerpc/mm/nvdimm.c
> new file mode 100644
> index 000000000000..a29a4510715e
> --- /dev/null
> +++ b/arch/powerpc/mm/nvdimm.c
> @@ -0,0 +1,34 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <asm/pgtable.h>
> +#include <asm/page.h>
> +
> +#include <linux/mm.h>
> +/*
> + * We support only pte and pmd mappings for now.
> + */
> +const unsigned long *nd_pfn_supported_alignments(void)
> +{
> +	static unsigned long supported_alignments[3];
> +
> +	supported_alignments[0] = PAGE_SIZE;
> +
> +	if (has_transparent_hugepage())
> +		supported_alignments[1] = HPAGE_PMD_SIZE;
> +	else
> +		supported_alignments[1] = 0;
> +
> +	supported_alignments[2] = 0;
> +	return supported_alignments;
> +}
> +
> +/*
> + * Use pmd mapping if supported as default alignment
> + */
> +unsigned long nd_pfn_default_alignment(void)
> +{
> +
> +	if (has_transparent_hugepage())
> +		return HPAGE_PMD_SIZE;
> +	return PAGE_SIZE;
> +}
> diff --git a/arch/x86/include/asm/libnvdimm.h b/arch/x86/include/asm/libnvdimm.h
> new file mode 100644
> index 000000000000..3d5361db9164
> --- /dev/null
> +++ b/arch/x86/include/asm/libnvdimm.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_LIBNVDIMM_H
> +#define _ASM_X86_LIBNVDIMM_H
> +
> +static inline unsigned long nd_pfn_default_alignment(void)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	return HPAGE_PMD_SIZE;
> +#else
> +	return PAGE_SIZE;
> +#endif
> +}
> +
> +static inline unsigned long nd_altmap_align_size(unsigned long nd_align)
> +{
> +	return PMD_SIZE;
> +}
> +
> +#endif
> diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
> index a5ac3b240293..44fe923b2ee3 100644
> --- a/drivers/nvdimm/nd.h
> +++ b/drivers/nvdimm/nd.h
> @@ -292,12 +292,6 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)
>  struct nd_pfn *to_nd_pfn(struct device *dev);
>  #if IS_ENABLED(CONFIG_NVDIMM_PFN)
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define PFN_DEFAULT_ALIGNMENT HPAGE_PMD_SIZE
> -#else
> -#define PFN_DEFAULT_ALIGNMENT PAGE_SIZE
> -#endif
> -
>  int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns);
>  bool is_nd_pfn(struct device *dev);
>  struct device *nd_pfn_create(struct nd_region *nd_region);
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 01f40672507f..347cab166376 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -18,6 +18,7 @@
>  #include <linux/slab.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
> +#include <asm/libnvdimm.h>
>  #include "nd-core.h"
>  #include "pfn.h"
>  #include "nd.h"
> @@ -111,6 +112,8 @@ static ssize_t align_show(struct device *dev,
>  	return sprintf(buf, "%ld\n", nd_pfn->align);
>  }
>
> +#ifndef nd_pfn_supported_alignments
> +#define nd_pfn_supported_alignments nd_pfn_supported_alignments
>  static const unsigned long *nd_pfn_supported_alignments(void)
>  {
>  	/*
> @@ -133,6 +136,7 @@ static const unsigned long *nd_pfn_supported_alignments(void)
>
>  	return data;
>  }
> +#endif
>
>  static ssize_t align_store(struct device *dev,
>  		struct device_attribute *attr, const char *buf, size_t len)
> @@ -310,7 +314,7 @@ struct device *nd_pfn_devinit(struct nd_pfn *nd_pfn,
>  		return NULL;
>
>  	nd_pfn->mode = PFN_MODE_NONE;
> -	nd_pfn->align = PFN_DEFAULT_ALIGNMENT;
> +	nd_pfn->align = nd_pfn_default_alignment();
>  	dev = &nd_pfn->dev;
>  	device_initialize(&nd_pfn->dev);
>  	if (ndns && !__nd_attach_ndns(&nd_pfn->dev, ndns, &nd_pfn->ndns)) {
> @@ -420,6 +424,20 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
>  	return 0;
>  }
>
> +static bool nd_supported_alignment(unsigned long align)
> +{
> +	int i;
> +	const unsigned long *supported = nd_pfn_supported_alignments();
> +
> +	if (align == 0)
> +		return false;
> +
> +	for (i = 0; supported[i]; i++)
> +		if (align == supported[i])
> +			return true;
> +	return false;
> +}
> +
>  int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
>  {
>  	u64 checksum, offset;
> @@ -474,6 +492,18 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
>  		align = 1UL << ilog2(offset);
>  	mode = le32_to_cpu(pfn_sb->mode);
>
> +	/*
> +	 * Check whether the we support the alignment. For Dax if the
> +	 * superblock alignment is not matching, we won't initialize
> +	 * the device.
> +	 */
> +	if (!nd_supported_alignment(align) &&
> +	    memcmp(pfn_sb->signature, DAX_SIG, PFN_SIG_LEN)) {
Suggestion to change this check to:

if (memcmp(pfn_sb->signature, DAX_SIG, PFN_SIG_LEN) &&
   !nd_supported_alignment(align))

It would look  a bit more natural i.e. "If the device has dax signature and alignment is
not supported". 


> +		dev_err(&nd_pfn->dev, "init failed, settings mismatch\n");
> +		dev_dbg(&nd_pfn->dev, "align: %lx:%lx\n", nd_pfn->align, align);
> +		return -EINVAL;
> +	}
> +
>  	if (!nd_pfn->uuid) {
>  		/*
>  		 * When probing a namepace via nd_pfn_probe() the uuid
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 381e872bfde0..d5cfea3d8b86 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -110,7 +110,12 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>
>  	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>  		return true;
> -
> +	/*
> +	 * For dax let's try to do hugepage fault always. If we don't support
> +	 * hugepages we will not have enabled namespaces with hugepage alignment.
> +	 * This also means we try to handle hugepage fault on device with
> +	 * smaller alignment. But for then we will return with VM_FAULT_FALLBACK
> +	 */
>  	if (vma_is_dax(vma))
>  		return true;
>
> -- 
> 2.21.0
>

-- 
Vaibhav Jain <vaibhav@linux.ibm.com>
Linux Technology Center, IBM India Pvt. Ltd.


^ permalink raw reply

* Re: [PATCH] mm/nvdimm: Pick the right alignment default when creating dax devices
From: Aneesh Kumar K.V @ 2019-05-17 15:17 UTC (permalink / raw)
  To: Vaibhav Jain, dan.j.williams; +Cc: linux-mm, linuxppc-dev, linux-nvdimm
In-Reply-To: <875zq9m8zx.fsf@vajain21.in.ibm.com>

On 5/17/19 8:19 PM, Vaibhav Jain wrote:
> Hi Aneesh,
> 
> Apart from a minor review comment for changes to nd_pfn_validate() the
> patch looks good to me.
> 
> Also, I Tested this patch on a PPC64 qemu guest with virtual nvdimm and
> verified that default alignment of newly created devdax namespace was
> 64KiB instead of 16MiB. Below are the test results:
> 
> * Without the patch creating a devdax namespace results in namespace
>    with 16MiB default alignment. Using daxio to zero out the dax device
>    results in a SIGBUS and a hashing failure.
> 
>    $ sudo ndctl create-namespace --mode=devdax  | grep align
>      "align":16777216,
>    "align":16777216
> 
>    $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
>    65536 16777216
> 
>    $ sudo daxio.static-debug  -z -o /dev/dax0.0
>    Bus error (core dumped)
> 
>    $ dmesg | tail
>    [  438.738958] lpar: Failed hash pte insert with error -4
>    [  438.739412] hash-mmu: mm: Hashing failure ! EA=0x7fff17000000 access=0x8000000000000006 current=daxio
>    [  438.739760] hash-mmu:     trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc000000501002b86
>    [  438.740143] daxio[3860]: bus error (7) at 7fff17000000 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b0000+20000]
>    [  438.740634] daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120
>    [  438.741015] daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 <f9490000> e93f0088 39290008 f93f0110
> 
> * With the patch creating a devdax namespace results in namespace
>    with 64KiB default alignment. Using daxio to zero out the dax device
>    succeeds:
>    
>    $ sudo ndctl create-namespace --mode=devdax  | grep align
>      "align":65536,
>    "align":65536
> 
>    $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
>    65536
> 
>    $ daxio -z -o /dev/dax0.0
>    daxio: copied 2130706432 bytes to device "/dev/dax0.0"
> 
> Hence,
> 
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> 
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> Allow arch to provide the supported alignments and use hugepage alignment only
>> if we support hugepage. Right now we depend on compile time configs whereas this
>> patch switch this to runtime discovery.
>>
>> Architectures like ppc64 can have THP enabled in code, but then can have
>> hugepage size disabled by the hypervisor. This allows us to create dax devices
>> with PAGE_SIZE alignment in this case.
>>
>> Existing dax namespace with alignment larger than PAGE_SIZE will fail to
>> initialize in this specific case. We still allow fsdax namespace initialization.
>>
>> With respect to identifying whether to enable hugepage fault for a dax device,
>> if THP is enabled during compile, we default to taking hugepage fault and in dax
>> fault handler if we find the fault size > alignment we retry with PAGE_SIZE
>> fault size.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/include/asm/libnvdimm.h |  9 ++++++++
>>   arch/powerpc/mm/Makefile             |  1 +
>>   arch/powerpc/mm/nvdimm.c             | 34 ++++++++++++++++++++++++++++
>>   arch/x86/include/asm/libnvdimm.h     | 19 ++++++++++++++++
>>   drivers/nvdimm/nd.h                  |  6 -----
>>   drivers/nvdimm/pfn_devs.c            | 32 +++++++++++++++++++++++++-
>>   include/linux/huge_mm.h              |  7 +++++-
>>   7 files changed, 100 insertions(+), 8 deletions(-)
>>   create mode 100644 arch/powerpc/include/asm/libnvdimm.h
>>   create mode 100644 arch/powerpc/mm/nvdimm.c
>>   create mode 100644 arch/x86/include/asm/libnvdimm.h
>>
>> diff --git a/arch/powerpc/include/asm/libnvdimm.h b/arch/powerpc/include/asm/libnvdimm.h
>> new file mode 100644
>> index 000000000000..d35fd7f48603
>> --- /dev/null
>> +++ b/arch/powerpc/include/asm/libnvdimm.h
>> @@ -0,0 +1,9 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _ASM_POWERPC_LIBNVDIMM_H
>> +#define _ASM_POWERPC_LIBNVDIMM_H
>> +
>> +#define nd_pfn_supported_alignments nd_pfn_supported_alignments
>> +extern unsigned long *nd_pfn_supported_alignments(void);
>> +extern unsigned long nd_pfn_default_alignment(void);
>> +
>> +#endif
>> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
>> index 0f499db315d6..42e4a399ba5d 100644
>> --- a/arch/powerpc/mm/Makefile
>> +++ b/arch/powerpc/mm/Makefile
>> @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM)		+= highmem.o
>>   obj-$(CONFIG_PPC_COPRO_BASE)	+= copro_fault.o
>>   obj-$(CONFIG_PPC_PTDUMP)	+= ptdump/
>>   obj-$(CONFIG_KASAN)		+= kasan/
>> +obj-$(CONFIG_NVDIMM_PFN)		+= nvdimm.o
>> diff --git a/arch/powerpc/mm/nvdimm.c b/arch/powerpc/mm/nvdimm.c
>> new file mode 100644
>> index 000000000000..a29a4510715e
>> --- /dev/null
>> +++ b/arch/powerpc/mm/nvdimm.c
>> @@ -0,0 +1,34 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <asm/pgtable.h>
>> +#include <asm/page.h>
>> +
>> +#include <linux/mm.h>
>> +/*
>> + * We support only pte and pmd mappings for now.
>> + */
>> +const unsigned long *nd_pfn_supported_alignments(void)
>> +{
>> +	static unsigned long supported_alignments[3];
>> +
>> +	supported_alignments[0] = PAGE_SIZE;
>> +
>> +	if (has_transparent_hugepage())
>> +		supported_alignments[1] = HPAGE_PMD_SIZE;
>> +	else
>> +		supported_alignments[1] = 0;
>> +
>> +	supported_alignments[2] = 0;
>> +	return supported_alignments;
>> +}
>> +
>> +/*
>> + * Use pmd mapping if supported as default alignment
>> + */
>> +unsigned long nd_pfn_default_alignment(void)
>> +{
>> +
>> +	if (has_transparent_hugepage())
>> +		return HPAGE_PMD_SIZE;
>> +	return PAGE_SIZE;
>> +}
>> diff --git a/arch/x86/include/asm/libnvdimm.h b/arch/x86/include/asm/libnvdimm.h
>> new file mode 100644
>> index 000000000000..3d5361db9164
>> --- /dev/null
>> +++ b/arch/x86/include/asm/libnvdimm.h
>> @@ -0,0 +1,19 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _ASM_X86_LIBNVDIMM_H
>> +#define _ASM_X86_LIBNVDIMM_H
>> +
>> +static inline unsigned long nd_pfn_default_alignment(void)
>> +{
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	return HPAGE_PMD_SIZE;
>> +#else
>> +	return PAGE_SIZE;
>> +#endif
>> +}
>> +
>> +static inline unsigned long nd_altmap_align_size(unsigned long nd_align)
>> +{
>> +	return PMD_SIZE;
>> +}
>> +
>> +#endif
>> diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
>> index a5ac3b240293..44fe923b2ee3 100644
>> --- a/drivers/nvdimm/nd.h
>> +++ b/drivers/nvdimm/nd.h
>> @@ -292,12 +292,6 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)
>>   struct nd_pfn *to_nd_pfn(struct device *dev);
>>   #if IS_ENABLED(CONFIG_NVDIMM_PFN)
>>
>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -#define PFN_DEFAULT_ALIGNMENT HPAGE_PMD_SIZE
>> -#else
>> -#define PFN_DEFAULT_ALIGNMENT PAGE_SIZE
>> -#endif
>> -
>>   int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns);
>>   bool is_nd_pfn(struct device *dev);
>>   struct device *nd_pfn_create(struct nd_region *nd_region);
>> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>> index 01f40672507f..347cab166376 100644
>> --- a/drivers/nvdimm/pfn_devs.c
>> +++ b/drivers/nvdimm/pfn_devs.c
>> @@ -18,6 +18,7 @@
>>   #include <linux/slab.h>
>>   #include <linux/fs.h>
>>   #include <linux/mm.h>
>> +#include <asm/libnvdimm.h>
>>   #include "nd-core.h"
>>   #include "pfn.h"
>>   #include "nd.h"
>> @@ -111,6 +112,8 @@ static ssize_t align_show(struct device *dev,
>>   	return sprintf(buf, "%ld\n", nd_pfn->align);
>>   }
>>
>> +#ifndef nd_pfn_supported_alignments
>> +#define nd_pfn_supported_alignments nd_pfn_supported_alignments
>>   static const unsigned long *nd_pfn_supported_alignments(void)
>>   {
>>   	/*
>> @@ -133,6 +136,7 @@ static const unsigned long *nd_pfn_supported_alignments(void)
>>
>>   	return data;
>>   }
>> +#endif
>>
>>   static ssize_t align_store(struct device *dev,
>>   		struct device_attribute *attr, const char *buf, size_t len)
>> @@ -310,7 +314,7 @@ struct device *nd_pfn_devinit(struct nd_pfn *nd_pfn,
>>   		return NULL;
>>
>>   	nd_pfn->mode = PFN_MODE_NONE;
>> -	nd_pfn->align = PFN_DEFAULT_ALIGNMENT;
>> +	nd_pfn->align = nd_pfn_default_alignment();
>>   	dev = &nd_pfn->dev;
>>   	device_initialize(&nd_pfn->dev);
>>   	if (ndns && !__nd_attach_ndns(&nd_pfn->dev, ndns, &nd_pfn->ndns)) {
>> @@ -420,6 +424,20 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
>>   	return 0;
>>   }
>>
>> +static bool nd_supported_alignment(unsigned long align)
>> +{
>> +	int i;
>> +	const unsigned long *supported = nd_pfn_supported_alignments();
>> +
>> +	if (align == 0)
>> +		return false;
>> +
>> +	for (i = 0; supported[i]; i++)
>> +		if (align == supported[i])
>> +			return true;
>> +	return false;
>> +}
>> +
>>   int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
>>   {
>>   	u64 checksum, offset;
>> @@ -474,6 +492,18 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
>>   		align = 1UL << ilog2(offset);
>>   	mode = le32_to_cpu(pfn_sb->mode);
>>
>> +	/*
>> +	 * Check whether the we support the alignment. For Dax if the
>> +	 * superblock alignment is not matching, we won't initialize
>> +	 * the device.
>> +	 */
>> +	if (!nd_supported_alignment(align) &&
>> +	    memcmp(pfn_sb->signature, DAX_SIG, PFN_SIG_LEN)) {
> Suggestion to change this check to:
> 
> if (memcmp(pfn_sb->signature, DAX_SIG, PFN_SIG_LEN) &&
>     !nd_supported_alignment(align))
> 
> It would look  a bit more natural i.e. "If the device has dax signature and alignment is
> not supported".
> 

I guess that should be !memcmp()? . I will send an updated patch with 
the hash failure details in the commit message.

-aneesh


^ permalink raw reply

* [Bug 203517] WARNING: inconsistent lock state. inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
From: bugzilla-daemon @ 2019-05-17 15:48 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-203517-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=203517

--- Comment #6 from Erhard F. (erhard_f@mailbox.org) ---
Created attachment 282807
  --> https://bugzilla.kernel.org/attachment.cgi?id=282807&action=edit
kernel .config (5.1.3, Talos II)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* [Bug 203517] WARNING: inconsistent lock state. inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
From: bugzilla-daemon @ 2019-05-17 15:49 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-203517-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=203517

Erhard F. (erhard_f@mailbox.org) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #282807|0                           |1
        is obsolete|                            |

--- Comment #7 from Erhard F. (erhard_f@mailbox.org) ---
Created attachment 282809
  --> https://bugzilla.kernel.org/attachment.cgi?id=282809&action=edit
dmesg (5.1.3, Talos II)

Still around in 5.1.3.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [RFC PATCH] powerpc/64/ftrace: mprofile-kernel patch out mflr
From: Naveen N. Rao @ 2019-05-17 17:25 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman, Nicholas Piggin
In-Reply-To: <1558084179.fz93ts375u.astroid@bobo.none>

Nicholas Piggin wrote:
> Naveen N. Rao's on May 17, 2019 4:22 am:
> 
>> While enabling ftrace, we will first need to patch the preceding 'mflr 
>> r0' (which would now be a 'nop') with 'b +8', then use 
>> synchronize_rcu_tasks() and finally patch in 'bl _mcount()' followed by 
>> 'mflr r0'.
>> 
>> I think that's what you meant, just that my reference to patching 'mflr 
>> r0' with a 'b +8' should have called out that the mflr would have been 
>> nop'ed out.
> 
> I meant that we don't need the b +8 anywhere, because the mflr r0
> is harmless. Enabling ftrace just needs to patch in 'mflr r0', and
> then synchronize_rcu_tasks(), and then patch in 'bl _mcount'. I think?

Ah, that's a good point! That should be enough and it simplifies this 
further.

Thanks,
Naveen

^ permalink raw reply

* [RFC PATCH 0/4] Nop out the preceding mflr with -mprofile-kernel
From: Naveen N. Rao @ 2019-05-17 19:02 UTC (permalink / raw)
  To: Steven Rostedt, Michael Ellerman, Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel

On powerpc64, -mprofile-kernel results in two instructions being 
emitted: 'mflr r0' and 'bl _mcount'. So far, we were only nop'ing out 
the branch to _mcount(). This series implements an approach to also nop 
out the preceding mflr.

Patches 1-3 are generic changes. Patch 2 is a fix for x86, but has not 
been tested. Patch 4 implements the changes for powerpc64.

- Naveen

Naveen N. Rao (4):
  ftrace: Expose flags used for ftrace_replace_code()
  x86/ftrace: Fix use of flags in ftrace_replace_code()
  ftrace: Expose __ftrace_replace_code()
  powerpc/ftrace: Additionally nop out the preceding mflr with
    -mprofile-kernel

 arch/powerpc/kernel/trace/ftrace.c | 188 +++++++++++++++++++++++++----
 arch/x86/kernel/ftrace.c           |   3 +-
 include/linux/ftrace.h             |   6 +
 kernel/trace/ftrace.c              |  13 +-
 4 files changed, 178 insertions(+), 32 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [RFC PATCH 1/4] ftrace: Expose flags used for ftrace_replace_code()
From: Naveen N. Rao @ 2019-05-17 19:02 UTC (permalink / raw)
  To: Steven Rostedt, Michael Ellerman, Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <cover.1558115654.git.naveen.n.rao@linux.vnet.ibm.com>

Since ftrace_replace_code() is a __weak function and can be overridden,
we need to expose the flags that can be set. So, move the flags enum to
the header file.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 include/linux/ftrace.h | 5 +++++
 kernel/trace/ftrace.c  | 5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 20899919ead8..835e761f63b0 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -162,6 +162,11 @@ enum {
 	FTRACE_OPS_FL_TRACE_ARRAY		= 1 << 15,
 };
 
+enum {
+	FTRACE_MODIFY_ENABLE_FL		= (1 << 0),
+	FTRACE_MODIFY_MAY_SLEEP_FL	= (1 << 1),
+};
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 /* The hash used to know what functions callbacks trace */
 struct ftrace_ops_hash {
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b920358dd8f7..38c15cd27fc4 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -78,11 +78,6 @@
 #define ASSIGN_OPS_HASH(opsname, val)
 #endif
 
-enum {
-	FTRACE_MODIFY_ENABLE_FL		= (1 << 0),
-	FTRACE_MODIFY_MAY_SLEEP_FL	= (1 << 1),
-};
-
 struct ftrace_ops ftrace_list_end __read_mostly = {
 	.func		= ftrace_stub,
 	.flags		= FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_STUB,
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH 3/4] ftrace: Expose __ftrace_replace_code()
From: Naveen N. Rao @ 2019-05-17 19:02 UTC (permalink / raw)
  To: Steven Rostedt, Michael Ellerman, Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <cover.1558115654.git.naveen.n.rao@linux.vnet.ibm.com>

While over-riding ftrace_replace_code(), we still want to reuse the
existing __ftrace_replace_code() function. Rename the function and
make it available for other kernel code.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 include/linux/ftrace.h | 1 +
 kernel/trace/ftrace.c  | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 835e761f63b0..3f4ceb982214 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -456,6 +456,7 @@ ftrace_set_early_filter(struct ftrace_ops *ops, char *buf, int enable);
 /* defined in arch */
 extern int ftrace_ip_converted(unsigned long ip);
 extern int ftrace_dyn_arch_init(void);
+extern int ftrace_do_replace_code(struct dyn_ftrace *rec, int enable);
 extern void ftrace_replace_code(int enable);
 extern int ftrace_update_ftrace_func(ftrace_func_t func);
 extern void ftrace_caller(void);
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 38c15cd27fc4..d94f7e526c33 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2354,8 +2354,8 @@ unsigned long ftrace_get_addr_curr(struct dyn_ftrace *rec)
 		return (unsigned long)FTRACE_ADDR;
 }
 
-static int
-__ftrace_replace_code(struct dyn_ftrace *rec, int enable)
+int
+ftrace_do_replace_code(struct dyn_ftrace *rec, int enable)
 {
 	unsigned long ftrace_old_addr;
 	unsigned long ftrace_addr;
@@ -2406,7 +2406,7 @@ void __weak ftrace_replace_code(int mod_flags)
 		if (rec->flags & FTRACE_FL_DISABLED)
 			continue;
 
-		failed = __ftrace_replace_code(rec, enable);
+		failed = ftrace_do_replace_code(rec, enable);
 		if (failed) {
 			ftrace_bug(failed, rec);
 			/* Stop processing */
@@ -5822,7 +5822,7 @@ void ftrace_module_enable(struct module *mod)
 		rec->flags = cnt;
 
 		if (ftrace_start_up && cnt) {
-			int failed = __ftrace_replace_code(rec, 1);
+			int failed = ftrace_do_replace_code(rec, 1);
 			if (failed) {
 				ftrace_bug(failed, rec);
 				goto out_loop;
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH 4/4] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel
From: Naveen N. Rao @ 2019-05-17 19:02 UTC (permalink / raw)
  To: Steven Rostedt, Michael Ellerman, Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <cover.1558115654.git.naveen.n.rao@linux.vnet.ibm.com>

With -mprofile-kernel, gcc emits 'mflr r0', followed by 'bl _mcount' to
enable function tracing and profiling. So far, with dynamic ftrace, we
used to only patch out the branch to _mcount(). However, Nick Piggin
points out that "mflr is executed by the branch unit that can only
execute one per cycle on POWER9 and shared with branches, so it would be
nice to avoid it where possible."

We cannot simply nop out the mflr either. Michael Ellerman pointed out
that when enabling function tracing, there can be a race if tracing is
enabled when some thread was interrupted after executing a nop'ed out
mflr. In this case, the thread would execute the now-patched-in branch
to _mcount() without having executed the preceding mflr.

To solve this, we now enable function tracing in 2 steps: patch in the
mflr instruction, use synchronize_rcu_tasks() to ensure all existing
threads make progress, and then patch in the branch to _mcount(). We
override ftrace_replace_code() with a powerpc64 variant for this
purpose.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/trace/ftrace.c | 188 +++++++++++++++++++++++++----
 1 file changed, 166 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c
index 517662a56bdc..5c3523c3b259 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -125,7 +125,7 @@ __ftrace_make_nop(struct module *mod,
 {
 	unsigned long entry, ptr, tramp;
 	unsigned long ip = rec->ip;
-	unsigned int op, pop;
+	unsigned int op;
 
 	/* read where this goes */
 	if (probe_kernel_read(&op, (void *)ip, sizeof(int))) {
@@ -160,8 +160,6 @@ __ftrace_make_nop(struct module *mod,
 
 #ifdef CONFIG_MPROFILE_KERNEL
 	/* When using -mkernel_profile there is no load to jump over */
-	pop = PPC_INST_NOP;
-
 	if (probe_kernel_read(&op, (void *)(ip - 4), 4)) {
 		pr_err("Fetching instruction at %lx failed.\n", ip - 4);
 		return -EFAULT;
@@ -169,26 +167,22 @@ __ftrace_make_nop(struct module *mod,
 
 	/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
 	if (op != PPC_INST_MFLR && op != PPC_INST_STD_LR) {
-		pr_err("Unexpected instruction %08x around bl _mcount\n", op);
+		pr_err("Unexpected instruction %08x before bl _mcount\n", op);
 		return -EINVAL;
 	}
-#else
-	/*
-	 * Our original call site looks like:
-	 *
-	 * bl <tramp>
-	 * ld r2,XX(r1)
-	 *
-	 * Milton Miller pointed out that we can not simply nop the branch.
-	 * If a task was preempted when calling a trace function, the nops
-	 * will remove the way to restore the TOC in r2 and the r2 TOC will
-	 * get corrupted.
-	 *
-	 * Use a b +8 to jump over the load.
-	 */
 
-	pop = PPC_INST_BRANCH | 8;	/* b +8 */
+	/* We should patch out the bl to _mcount first */
+	if (patch_instruction((unsigned int *)ip, PPC_INST_NOP)) {
+		pr_err("Patching NOP failed.\n");
+		return -EPERM;
+	}
 
+	if (op == PPC_INST_MFLR &&
+		patch_instruction((unsigned int *)(ip - 4), PPC_INST_NOP)) {
+		pr_err("Patching NOP failed.\n");
+		return -EPERM;
+	}
+#else
 	/*
 	 * Check what is in the next instruction. We can see ld r2,40(r1), but
 	 * on first pass after boot we will see mflr r0.
@@ -202,12 +196,25 @@ __ftrace_make_nop(struct module *mod,
 		pr_err("Expected %08x found %08x\n", PPC_INST_LD_TOC, op);
 		return -EINVAL;
 	}
-#endif /* CONFIG_MPROFILE_KERNEL */
 
-	if (patch_instruction((unsigned int *)ip, pop)) {
+	/*
+	 * Our original call site looks like:
+	 *
+	 * bl <tramp>
+	 * ld r2,XX(r1)
+	 *
+	 * Milton Miller pointed out that we can not simply nop the branch.
+	 * If a task was preempted when calling a trace function, the nops
+	 * will remove the way to restore the TOC in r2 and the r2 TOC will
+	 * get corrupted.
+	 *
+	 * Use a b +8 to jump over the load.
+	 */
+	if (patch_instruction((unsigned int *)ip, PPC_INST_BRANCH | 8)) {
 		pr_err("Patching NOP failed.\n");
 		return -EPERM;
 	}
+#endif /* CONFIG_MPROFILE_KERNEL */
 
 	return 0;
 }
@@ -421,6 +428,25 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, unsigned long addr)
 		return -EPERM;
 	}
 
+#ifdef CONFIG_MPROFILE_KERNEL
+	if (probe_kernel_read(&op, (void *)(ip - 4), 4)) {
+		pr_err("Fetching instruction at %lx failed.\n", ip - 4);
+		return -EFAULT;
+	}
+
+	/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
+	if (op != PPC_INST_MFLR && op != PPC_INST_STD_LR) {
+		pr_err("Unexpected instruction %08x before bl _mcount\n", op);
+		return -EINVAL;
+	}
+
+	if (op == PPC_INST_MFLR &&
+		patch_instruction((unsigned int *)(ip - 4), PPC_INST_NOP)) {
+		pr_err("Patching NOP failed.\n");
+		return -EPERM;
+	}
+#endif
+
 	return 0;
 }
 
@@ -429,6 +455,7 @@ int ftrace_make_nop(struct module *mod,
 {
 	unsigned long ip = rec->ip;
 	unsigned int old, new;
+	int rc;
 
 	/*
 	 * If the calling address is more that 24 bits away,
@@ -439,7 +466,27 @@ int ftrace_make_nop(struct module *mod,
 		/* within range */
 		old = ftrace_call_replace(ip, addr, 1);
 		new = PPC_INST_NOP;
-		return ftrace_modify_code(ip, old, new);
+		rc = ftrace_modify_code(ip, old, new);
+#ifdef CONFIG_MPROFILE_KERNEL
+		if (rc)
+			return rc;
+
+		if (probe_kernel_read(&old, (void *)(ip - 4), 4)) {
+			pr_err("Fetching instruction at %lx failed.\n", ip - 4);
+			return -EFAULT;
+		}
+
+		/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
+		if (old != PPC_INST_MFLR && old != PPC_INST_STD_LR) {
+			pr_err("Unexpected instruction %08x before bl _mcount\n", old);
+			return -EINVAL;
+		}
+
+		if (old == PPC_INST_MFLR)
+			rc = patch_instruction((unsigned int *)(ip - 4),
+					PPC_INST_NOP);
+#endif
+		return rc;
 	} else if (core_kernel_text(ip))
 		return __ftrace_make_nop_kernel(rec, addr);
 
@@ -863,6 +910,103 @@ void arch_ftrace_update_code(int command)
 	ftrace_modify_all_code(command);
 }
 
+#ifdef CONFIG_MPROFILE_KERNEL
+static int
+__ftrace_make_call_prep(struct dyn_ftrace *rec)
+{
+	void *ip = (void *)rec->ip - MCOUNT_INSN_SIZE;
+	unsigned int op[2], pop;
+
+	/* read where this goes */
+	if (probe_kernel_read(op, ip, sizeof(op)))
+		return -EFAULT;
+
+	if (op[1] != PPC_INST_NOP) {
+		pr_err("Unexpected call sequence at %p: %x %x\n",
+							ip, op[0], op[1]);
+		return -EINVAL;
+	}
+
+	/*
+	 * nothing to do if this is using the older -mprofile-kernel
+	 * instruction sequence
+	 */
+	if (op[0] != PPC_INST_NOP)
+		return 0;
+
+	pop = PPC_INST_MFLR;
+
+	if (patch_instruction((unsigned int *)ip, pop)) {
+		pr_err("Patching MFLR failed.\n");
+		return -EPERM;
+	}
+
+	return 0;
+}
+
+void ftrace_replace_code(int mod_flags)
+{
+	int enable = mod_flags & FTRACE_MODIFY_ENABLE_FL;
+	int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
+	int ret, failed, make_call = 0;
+	struct ftrace_rec_iter *iter;
+	struct dyn_ftrace *rec;
+
+	if (unlikely(!ftrace_enabled))
+		return;
+
+	for_ftrace_rec_iter(iter) {
+		rec = ftrace_rec_iter_record(iter);
+
+		if (rec->flags & FTRACE_FL_DISABLED)
+			continue;
+
+		ret = ftrace_test_record(rec, enable);
+		if (ret == FTRACE_UPDATE_MAKE_CALL) {
+			make_call++;
+			failed = __ftrace_make_call_prep(rec);
+		} else {
+			failed = ftrace_do_replace_code(rec, enable);
+		}
+
+		if (failed) {
+			ftrace_bug(failed, rec);
+			/* Stop processing */
+			return;
+		}
+
+		if (schedulable)
+			cond_resched();
+	}
+
+	if (!make_call)
+		return;
+
+	synchronize_rcu_tasks();
+
+	for_ftrace_rec_iter(iter) {
+		rec = ftrace_rec_iter_record(iter);
+
+		if (rec->flags & FTRACE_FL_DISABLED)
+			continue;
+
+		ret = ftrace_test_record(rec, enable);
+		if (ret == FTRACE_UPDATE_MAKE_CALL)
+			failed = ftrace_do_replace_code(rec, enable);
+
+		if (failed) {
+			ftrace_bug(failed, rec);
+			/* Stop processing */
+			return;
+		}
+
+		if (schedulable)
+			cond_resched();
+	}
+
+}
+#endif
+
 #ifdef CONFIG_PPC64
 #define PACATOC offsetof(struct paca_struct, kernel_toc)
 
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH 2/4] x86/ftrace: Fix use of flags in ftrace_replace_code()
From: Naveen N. Rao @ 2019-05-17 19:02 UTC (permalink / raw)
  To: Steven Rostedt, Michael Ellerman, Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <cover.1558115654.git.naveen.n.rao@linux.vnet.ibm.com>

In commit a0572f687fb3c ("ftrace: Allow ftrace_replace_code() to be
schedulable), the generic ftrace_replace_code() function was modified to
accept a flags argument in place of a single 'enable' flag. However, the
x86 version of this function was not updated. Fix the same.

Fixes: a0572f687fb3c ("ftrace: Allow ftrace_replace_code() to be schedulable")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
I haven't yet tested this patch on x86, but this looked wrong so sending 
this as a RFC.

- Naveen

 arch/x86/kernel/ftrace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0caf8122d680..0c01b344ba16 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -554,8 +554,9 @@ static void run_sync(void)
 		local_irq_disable();
 }

-void ftrace_replace_code(int enable)
+void ftrace_replace_code(int mod_flags)
 {
+	int enable = mod_flags & FTRACE_MODIFY_ENABLE_FL;
 	struct ftrace_rec_iter *iter;
 	struct dyn_ftrace *rec;
 	const char *report = "adding breakpoints";
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH] ASoC: fsl_esai: fix the channel swap issue after xrun
From: Nicolin Chen @ 2019-05-17 20:01 UTC (permalink / raw)
  To: S.j. Wang
  Cc: alsa-devel@alsa-project.org, timur@kernel.org,
	Xiubo.Lee@gmail.com, festevam@gmail.com,
	linux-kernel@vger.kernel.org, broonie@kernel.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20190517030903.25731-1-shengjiu.wang@nxp.com>

On Fri, May 17, 2019 at 03:09:22AM +0000, S.j. Wang wrote:
> There is chip errata ERR008000, the reference doc is
> (https://www.nxp.com/docs/en/errata/IMX6DQCE.pdf),
> 
> The issue is "While using ESAI transmit or receive and
> an underrun/overrun happens, channel swap may occur.
> The only recovery mechanism is to reset the ESAI."
> 
> In this commit add a tasklet to handle reset of ESAI
> after xrun happens
> 
> Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
> ---
>  sound/soc/fsl/fsl_esai.c | 166 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 166 insertions(+)
> 
> diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
> index 10d2210c91ef..149972894c95 100644
> --- a/sound/soc/fsl/fsl_esai.c
> +++ b/sound/soc/fsl/fsl_esai.c
> @@ -52,17 +52,20 @@ struct fsl_esai {
>  	struct clk *extalclk;
>  	struct clk *fsysclk;
>  	struct clk *spbaclk;
> +	struct tasklet_struct task;
[...]
> +	u32 tx_channels;
[...]
> +	bool reset_at_xrun;

Please add descriptions for them in the comments of the struct.

  
> @@ -71,8 +74,14 @@ static irqreturn_t esai_isr(int irq, void *devid)
>  	struct fsl_esai *esai_priv = (struct fsl_esai *)devid;
>  	struct platform_device *pdev = esai_priv->pdev;
>  	u32 esr;
> +	u32 saisr;
>  
>  	regmap_read(esai_priv->regmap, REG_ESAI_ESR, &esr);
> +	regmap_read(esai_priv->regmap, REG_ESAI_SAISR, &saisr);
> +
> +	if ((saisr & (ESAI_SAISR_TUE | ESAI_SAISR_ROE))
> +		&& esai_priv->reset_at_xrun)

Please follow the coding style:
+	if ((saisr & (ESAI_SAISR_TUE | ESAI_SAISR_ROE)) &&
+	    esai_priv->reset_at_xrun)

> +		tasklet_schedule(&esai_priv->task);

And maybe a dev_dbg also to inform people it's recovering.

> @@ -552,6 +561,9 @@ static int fsl_esai_trigger(struct snd_pcm_substream *substream, int cmd,
>  	u32 pins = DIV_ROUND_UP(channels, esai_priv->slots);
>  	u32 mask;
>  
> +	if (tx)
> +		esai_priv->tx_channels = channels;
> +
>  	switch (cmd) {
>  	case SNDRV_PCM_TRIGGER_START:
>  	case SNDRV_PCM_TRIGGER_RESUME:
> @@ -585,10 +597,16 @@ static int fsl_esai_trigger(struct snd_pcm_substream *substream, int cmd,
>  		regmap_update_bits(esai_priv->regmap, REG_ESAI_xSMA(tx),
>  				   ESAI_xSMA_xS_MASK, ESAI_xSMA_xS(mask));
>  
> +		regmap_update_bits(esai_priv->regmap, REG_ESAI_xCR(tx),
> +				   ESAI_xCR_xEIE_MASK, ESAI_xCR_xEIE);

A line of comments please.

> +static void fsl_esai_reset(unsigned long arg)
> +{
> +	struct fsl_esai *esai_priv = (struct fsl_esai *)arg;

> +	u32 saisr;
> +	u32 tsma, tsmb, rsma, rsmb, tcr, rcr, tfcr, rfcr;

Could we merge these two lines?

> +	/*
> +	 * stop the tx & rx
> +	 */

Single-line style please.

> +	regmap_read(esai_priv->regmap, REG_ESAI_TSMA, &tsma);
> +	regmap_read(esai_priv->regmap, REG_ESAI_TSMB, &tsmb);
> +	regmap_read(esai_priv->regmap, REG_ESAI_RSMA, &rsma);
> +	regmap_read(esai_priv->regmap, REG_ESAI_RSMB, &rsmb);
> +
> +	regmap_read(esai_priv->regmap, REG_ESAI_TCR, &tcr);
> +	regmap_read(esai_priv->regmap, REG_ESAI_RCR, &rcr);
> +
> +	regmap_read(esai_priv->regmap, REG_ESAI_TFCR, &tfcr);
> +	regmap_read(esai_priv->regmap, REG_ESAI_RFCR, &rfcr);

I think this chunk is to save register values other than "stop".

> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TCR,
> +				ESAI_xCR_xEIE_MASK | ESAI_xCR_TE_MASK, 0);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RCR,
> +				ESAI_xCR_xEIE_MASK | ESAI_xCR_RE_MASK, 0);

Indentation:
+	regmap_update_bits(esai_priv->regmap, REG_ESAI_RCR,
+			   ESAI_xCR_xEIE_MASK | ESAI_xCR_RE_MASK, 0);

> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TSMA,
> +				ESAI_xSMA_xS_MASK, 0);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TSMB,
> +				ESAI_xSMB_xS_MASK, 0);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RSMA,
> +				ESAI_xSMA_xS_MASK, 0);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RSMB,
> +				ESAI_xSMB_xS_MASK, 0);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TFCR,
> +				ESAI_xFCR_xFR | ESAI_xFCR_xFEN, ESAI_xFCR_xFR);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TFCR,
> +				ESAI_xFCR_xFR, 0);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RFCR,
> +				ESAI_xFCR_xFR | ESAI_xFCR_xFEN, ESAI_xFCR_xFR);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RFCR,
> +				ESAI_xFCR_xFR, 0);

Just a thought that I'd like to discuss: since these operations
are completely same as TRIGGER_STOP(tx) + TRIGGER_STOP(rx), can
we abstract a function of fsl_esai_trigger_stop(.., bool tx)?

Benefits would be A) easier to read B) Won't miss an operation,
as we might add something new to one of the stop routines while
forgetting the other side.

> +	/*
> +	 * reset the esai, and restore the registers
> +	 */
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_ECR,
> +				ESAI_ECR_ESAIEN_MASK | ESAI_ECR_ERST_MASK,
> +				ESAI_ECR_ESAIEN | ESAI_ECR_ERST);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_ECR,
> +				ESAI_ECR_ESAIEN_MASK | ESAI_ECR_ERST_MASK,
> +				ESAI_ECR_ESAIEN);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TCR,
> +				ESAI_xCR_xPR_MASK,
> +				ESAI_xCR_xPR);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RCR,
> +				ESAI_xCR_xPR_MASK,
> +				ESAI_xCR_xPR);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_PRRC,
> +				ESAI_PRRC_PDC_MASK, 0);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_PCRC,
> +				ESAI_PCRC_PC_MASK, 0);

And this could be abstracted too by sharing with probe().

> +	/*
> +	 * Add fifo reset here, because the regcache_sync will
> +	 * write one more data to ETDR.
> +	 * Which will cause channel shift.

Sounds like a bug to me...should fix it first by marking the
data registers as volatile.

> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TFCR,
> +				ESAI_xFCR_xFR_MASK, ESAI_xFCR_xFR);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RFCR,
> +				ESAI_xFCR_xFR_MASK, ESAI_xFCR_xFR);
> +
> +	regcache_mark_dirty(esai_priv->regmap);
> +	regcache_sync(esai_priv->regmap);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TFCR,
> +				ESAI_xFCR_xFR_MASK, 0);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RFCR,
> +				ESAI_xFCR_xFR_MASK, 0);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TCR,
> +				ESAI_xCR_xPR_MASK, 0);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RCR,
> +				ESAI_xCR_xPR_MASK, 0);

Also same as suspend()-resume().

> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_PRRC,
> +				ESAI_PRRC_PDC_MASK,
> +				ESAI_PRRC_PDC(ESAI_GPIO));
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_PCRC,
> +				ESAI_PCRC_PC_MASK,
> +				ESAI_PCRC_PC(ESAI_GPIO));
> +
> +	regmap_read(esai_priv->regmap, REG_ESAI_SAISR, &saisr);
> +
> +	/*
> +	 * restart tx / rx, if they already enabled
> +	 */
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TFCR,
> +				ESAI_xFCR_xFEN_MASK, tfcr & ESAI_xFCR_xFEN);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RFCR,
> +				ESAI_xFCR_xFEN_MASK, rfcr & ESAI_xFCR_xFEN);

Btw, this xFEN should be xFE...a typo in the driver itself...

> +
> +	/* Write initial words reqiured by ESAI as normal procedure */
> +	for (i = 0; i < esai_priv->tx_channels; i++)
> +		regmap_write(esai_priv->regmap, REG_ESAI_ETDR, 0x0);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TCR,
> +				ESAI_xCR_TE_MASK,
> +				ESAI_xCR_TE_MASK & tcr);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RCR,
> +				ESAI_xCR_RE_MASK,
> +				ESAI_xCR_RE_MASK & rcr);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TSMB,
> +				ESAI_xSMB_xS_MASK,
> +				ESAI_xSMB_xS_MASK & tsmb);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TSMA,
> +				ESAI_xSMA_xS_MASK,
> +				ESAI_xSMA_xS_MASK & tsma);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RSMB,
> +				ESAI_xSMB_xS_MASK,
> +				ESAI_xSMB_xS_MASK & rsmb);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RSMA,
> +				ESAI_xSMA_xS_MASK,
> +				ESAI_xSMA_xS_MASK & rsma);
> +
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_TCR,
> +			   ESAI_xCR_xEIE_MASK, ESAI_xCR_xEIE & tcr);
> +	regmap_update_bits(esai_priv->regmap, REG_ESAI_RCR,
> +			   ESAI_xCR_xEIE_MASK, ESAI_xCR_xEIE & rcr);

Similarly having an fsl_esai_trigger_start() could do:
	if (tfcr & ESAI_xFCR_xFE)
		fsl_esai_trigger_start(tx);
	if (rfcr & ESAI_xFCR_xFE)
		fsl_esai_trigger_start(rx);

Thank you

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/pseries: Simplify cpu readd to use drc_index
From: Tyrel Datwyler @ 2019-05-17 22:58 UTC (permalink / raw)
  To: Nathan Lynch, Tyrel Datwyler; +Cc: mingming.cao, linuxppc-dev
In-Reply-To: <8736leky3x.fsf@linux.ibm.com>

On 05/16/2019 12:17 PM, Nathan Lynch wrote:
> Tyrel Datwyler <tyreld@linux.ibm.com> writes:
>> The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up
>> the cpus device_node so that we can get at the ibm,my-drc-index
>> property. The only user of cpu readd is an OF notifier call back. This
>> call back already has a reference to the device_node and therefore can
>> retrieve the drc_index from the device_node.
> 
> dlpar_cpu_readd is a hack to try to change the CPU-node relationship at
> runtime without destabilizing the system. It doesn't accomplish that and
> it should just be removed (and I'm working on that).
> 

I will politely disagree. We've done exactly this from userspace for years. My
experience still suggests that memory affinity is the problem area, and that the
work to push this all into the kernel originally was poorly tested.

-Tyrel


^ permalink raw reply

* Re: [PATCH v2] powerpc/mm/hash: Fix get_region_id() for invalid addresses
From: Nicholas Piggin @ 2019-05-18  1:56 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar
In-Reply-To: <20190517132958.22299-1-mpe@ellerman.id.au>

Michael Ellerman's on May 17, 2019 11:29 pm:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> 
> Accesses by userspace to random addresses outside the user or kernel
> address range will generate an SLB fault. When we handle that fault we
> classify the effective address into several classes, eg. user, kernel
> linear, kernel virtual etc.
> 
> For addresses that are completely outside of any valid range, we
> should not insert an SLB entry at all, and instead immediately an
> exception.
> 
> In the past this was handled in two ways. Firstly we would check the
> top nibble of the address (using REGION_ID(ea)) and that would tell us
> if the address was user (0), kernel linear (c), kernel virtual (d), or
> vmemmap (f). If the address didn't match any of these it was invalid.
> 
> Then for each type of address we would do a secondary check. For the
> user region we check against H_PGTABLE_RANGE, for kernel linear we
> would mask the top nibble of the address and then check the address
> against MAX_PHYSMEM_BITS.
> 
> As part of commit 0034d395f89d ("powerpc/mm/hash64: Map all the kernel
> regions in the same 0xc range") we replaced REGION_ID() with
> get_region_id() and changed the masking of the top nibble to only mask
> the top two bits, which introduced a bug.
> 
> Addresses less than (4 << 60) are still handled correctly, they are
> either less than (1 << 60) in which case they are subject to the
> H_PGTABLE_RANGE check, or they are correctly checked against
> MAX_PHYSMEM_BITS.
> 
> However addresses from (4 << 60) to ((0xc << 60) - 1), are incorrectly
> treated as kernel linear addresses in get_region_id(). Then the top
> two bits are cleared by EA_MASK in slb_allocate_kernel() and the
> address is checked against MAX_PHYSMEM_BITS, which it passes due to
> the masking. The end result is we incorrectly insert SLB entries for
> those addresses.
> 
> That is not actually catastrophic, having inserted the SLB entry we
> will then go on to take a page fault for the address and at that point
> we detect the problem and report it as a bad fault.
> 
> Still we should not be inserting those entries, or treating them as
> kernel linear addresses in the first place. So fix get_region_id() to
> detect addresses in that range and return an invalid region id, which
> we cause use to not insert an SLB entry and directly report an
> exception.
> 
> Fixes: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in the same 0xc range")
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> [mpe: Drop change to EA_MASK for now, rewrite change log]
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Looks good to me.

> ---
>  arch/powerpc/include/asm/book3s/64/hash.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> v2: Drop change to EA_MASK for now, rewrite change log.
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> index 5486087e64ea..2781ebf6add4 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -93,6 +93,7 @@
>  #define VMALLOC_REGION_ID	NON_LINEAR_REGION_ID(H_VMALLOC_START)
>  #define IO_REGION_ID		NON_LINEAR_REGION_ID(H_KERN_IO_START)
>  #define VMEMMAP_REGION_ID	NON_LINEAR_REGION_ID(H_VMEMMAP_START)
> +#define INVALID_REGION_ID	(VMEMMAP_REGION_ID + 1)
>  
>  /*
>   * Defines the address of the vmemap area, in its own region on
> @@ -119,6 +120,9 @@ static inline int get_region_id(unsigned long ea)
>  	if (id == 0)
>  		return USER_REGION_ID;
>  
> +	if (id != (PAGE_OFFSET >> 60))
> +		return INVALID_REGION_ID;
> +
>  	if (ea < H_KERN_VIRT_START)
>  		return LINEAR_MAP_REGION_ID;
>  
> -- 
> 2.20.1
> 
> 

^ permalink raw reply

* Re: [RFC PATCH 4/4] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel
From: Nicholas Piggin @ 2019-05-18  2:08 UTC (permalink / raw)
  To: Michael Ellerman, Naveen N. Rao, Steven Rostedt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <ee2a5457d98850f51bf96eb17389b375e6955bbf.1558115654.git.naveen.n.rao@linux.vnet.ibm.com>

Naveen N. Rao's on May 18, 2019 5:02 am:
> With -mprofile-kernel, gcc emits 'mflr r0', followed by 'bl _mcount' to
> enable function tracing and profiling. So far, with dynamic ftrace, we
> used to only patch out the branch to _mcount(). However, Nick Piggin
> points out that "mflr is executed by the branch unit that can only
> execute one per cycle on POWER9 and shared with branches, so it would be
> nice to avoid it where possible."
> 
> We cannot simply nop out the mflr either. Michael Ellerman pointed out
> that when enabling function tracing, there can be a race if tracing is
> enabled when some thread was interrupted after executing a nop'ed out
> mflr. In this case, the thread would execute the now-patched-in branch
> to _mcount() without having executed the preceding mflr.
> 
> To solve this, we now enable function tracing in 2 steps: patch in the
> mflr instruction, use synchronize_rcu_tasks() to ensure all existing
> threads make progress, and then patch in the branch to _mcount(). We
> override ftrace_replace_code() with a powerpc64 variant for this
> purpose.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

Nice! Thanks for doing a real patch. You needn't add my SOB there: my
hack was obviously garbage :) Suggested-by if anything, then for
clarity of changelog you can write the motivation directly rather than
quote me.

I don't know the ftrace subsystem well, but the powerpc instructions
and patching sequence appears to match what we agreed is the right way
to go.

As a suggestion, I would perhaps add most of information from the
second and third paragraphs of the changelog into comments
(and also explain that the lone mflr r0 is harmless).

But otherwise it looks good

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  arch/powerpc/kernel/trace/ftrace.c | 188 +++++++++++++++++++++++++----
>  1 file changed, 166 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c
> index 517662a56bdc..5c3523c3b259 100644
> --- a/arch/powerpc/kernel/trace/ftrace.c
> +++ b/arch/powerpc/kernel/trace/ftrace.c
> @@ -125,7 +125,7 @@ __ftrace_make_nop(struct module *mod,
>  {
>  	unsigned long entry, ptr, tramp;
>  	unsigned long ip = rec->ip;
> -	unsigned int op, pop;
> +	unsigned int op;
>  
>  	/* read where this goes */
>  	if (probe_kernel_read(&op, (void *)ip, sizeof(int))) {
> @@ -160,8 +160,6 @@ __ftrace_make_nop(struct module *mod,
>  
>  #ifdef CONFIG_MPROFILE_KERNEL
>  	/* When using -mkernel_profile there is no load to jump over */
> -	pop = PPC_INST_NOP;
> -
>  	if (probe_kernel_read(&op, (void *)(ip - 4), 4)) {
>  		pr_err("Fetching instruction at %lx failed.\n", ip - 4);
>  		return -EFAULT;
> @@ -169,26 +167,22 @@ __ftrace_make_nop(struct module *mod,
>  
>  	/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
>  	if (op != PPC_INST_MFLR && op != PPC_INST_STD_LR) {
> -		pr_err("Unexpected instruction %08x around bl _mcount\n", op);
> +		pr_err("Unexpected instruction %08x before bl _mcount\n", op);
>  		return -EINVAL;
>  	}
> -#else
> -	/*
> -	 * Our original call site looks like:
> -	 *
> -	 * bl <tramp>
> -	 * ld r2,XX(r1)
> -	 *
> -	 * Milton Miller pointed out that we can not simply nop the branch.
> -	 * If a task was preempted when calling a trace function, the nops
> -	 * will remove the way to restore the TOC in r2 and the r2 TOC will
> -	 * get corrupted.
> -	 *
> -	 * Use a b +8 to jump over the load.
> -	 */
>  
> -	pop = PPC_INST_BRANCH | 8;	/* b +8 */
> +	/* We should patch out the bl to _mcount first */
> +	if (patch_instruction((unsigned int *)ip, PPC_INST_NOP)) {
> +		pr_err("Patching NOP failed.\n");
> +		return -EPERM;
> +	}
>  
> +	if (op == PPC_INST_MFLR &&
> +		patch_instruction((unsigned int *)(ip - 4), PPC_INST_NOP)) {
> +		pr_err("Patching NOP failed.\n");
> +		return -EPERM;
> +	}
> +#else
>  	/*
>  	 * Check what is in the next instruction. We can see ld r2,40(r1), but
>  	 * on first pass after boot we will see mflr r0.
> @@ -202,12 +196,25 @@ __ftrace_make_nop(struct module *mod,
>  		pr_err("Expected %08x found %08x\n", PPC_INST_LD_TOC, op);
>  		return -EINVAL;
>  	}
> -#endif /* CONFIG_MPROFILE_KERNEL */
>  
> -	if (patch_instruction((unsigned int *)ip, pop)) {
> +	/*
> +	 * Our original call site looks like:
> +	 *
> +	 * bl <tramp>
> +	 * ld r2,XX(r1)
> +	 *
> +	 * Milton Miller pointed out that we can not simply nop the branch.
> +	 * If a task was preempted when calling a trace function, the nops
> +	 * will remove the way to restore the TOC in r2 and the r2 TOC will
> +	 * get corrupted.
> +	 *
> +	 * Use a b +8 to jump over the load.
> +	 */
> +	if (patch_instruction((unsigned int *)ip, PPC_INST_BRANCH | 8)) {
>  		pr_err("Patching NOP failed.\n");
>  		return -EPERM;
>  	}
> +#endif /* CONFIG_MPROFILE_KERNEL */
>  
>  	return 0;
>  }
> @@ -421,6 +428,25 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, unsigned long addr)
>  		return -EPERM;
>  	}
>  
> +#ifdef CONFIG_MPROFILE_KERNEL
> +	if (probe_kernel_read(&op, (void *)(ip - 4), 4)) {
> +		pr_err("Fetching instruction at %lx failed.\n", ip - 4);
> +		return -EFAULT;
> +	}
> +
> +	/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
> +	if (op != PPC_INST_MFLR && op != PPC_INST_STD_LR) {
> +		pr_err("Unexpected instruction %08x before bl _mcount\n", op);
> +		return -EINVAL;
> +	}
> +
> +	if (op == PPC_INST_MFLR &&
> +		patch_instruction((unsigned int *)(ip - 4), PPC_INST_NOP)) {
> +		pr_err("Patching NOP failed.\n");
> +		return -EPERM;
> +	}
> +#endif
> +
>  	return 0;
>  }
>  
> @@ -429,6 +455,7 @@ int ftrace_make_nop(struct module *mod,
>  {
>  	unsigned long ip = rec->ip;
>  	unsigned int old, new;
> +	int rc;
>  
>  	/*
>  	 * If the calling address is more that 24 bits away,
> @@ -439,7 +466,27 @@ int ftrace_make_nop(struct module *mod,
>  		/* within range */
>  		old = ftrace_call_replace(ip, addr, 1);
>  		new = PPC_INST_NOP;
> -		return ftrace_modify_code(ip, old, new);
> +		rc = ftrace_modify_code(ip, old, new);
> +#ifdef CONFIG_MPROFILE_KERNEL
> +		if (rc)
> +			return rc;
> +
> +		if (probe_kernel_read(&old, (void *)(ip - 4), 4)) {
> +			pr_err("Fetching instruction at %lx failed.\n", ip - 4);
> +			return -EFAULT;
> +		}
> +
> +		/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
> +		if (old != PPC_INST_MFLR && old != PPC_INST_STD_LR) {
> +			pr_err("Unexpected instruction %08x before bl _mcount\n", old);
> +			return -EINVAL;
> +		}
> +
> +		if (old == PPC_INST_MFLR)
> +			rc = patch_instruction((unsigned int *)(ip - 4),
> +					PPC_INST_NOP);
> +#endif
> +		return rc;
>  	} else if (core_kernel_text(ip))
>  		return __ftrace_make_nop_kernel(rec, addr);
>  
> @@ -863,6 +910,103 @@ void arch_ftrace_update_code(int command)
>  	ftrace_modify_all_code(command);
>  }
>  
> +#ifdef CONFIG_MPROFILE_KERNEL
> +static int
> +__ftrace_make_call_prep(struct dyn_ftrace *rec)
> +{
> +	void *ip = (void *)rec->ip - MCOUNT_INSN_SIZE;
> +	unsigned int op[2], pop;
> +
> +	/* read where this goes */
> +	if (probe_kernel_read(op, ip, sizeof(op)))
> +		return -EFAULT;
> +
> +	if (op[1] != PPC_INST_NOP) {
> +		pr_err("Unexpected call sequence at %p: %x %x\n",
> +							ip, op[0], op[1]);
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * nothing to do if this is using the older -mprofile-kernel
> +	 * instruction sequence
> +	 */
> +	if (op[0] != PPC_INST_NOP)
> +		return 0;
> +
> +	pop = PPC_INST_MFLR;
> +
> +	if (patch_instruction((unsigned int *)ip, pop)) {
> +		pr_err("Patching MFLR failed.\n");
> +		return -EPERM;
> +	}
> +
> +	return 0;
> +}
> +
> +void ftrace_replace_code(int mod_flags)
> +{
> +	int enable = mod_flags & FTRACE_MODIFY_ENABLE_FL;
> +	int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
> +	int ret, failed, make_call = 0;
> +	struct ftrace_rec_iter *iter;
> +	struct dyn_ftrace *rec;
> +
> +	if (unlikely(!ftrace_enabled))
> +		return;
> +
> +	for_ftrace_rec_iter(iter) {
> +		rec = ftrace_rec_iter_record(iter);
> +
> +		if (rec->flags & FTRACE_FL_DISABLED)
> +			continue;
> +
> +		ret = ftrace_test_record(rec, enable);
> +		if (ret == FTRACE_UPDATE_MAKE_CALL) {
> +			make_call++;
> +			failed = __ftrace_make_call_prep(rec);
> +		} else {
> +			failed = ftrace_do_replace_code(rec, enable);
> +		}
> +
> +		if (failed) {
> +			ftrace_bug(failed, rec);
> +			/* Stop processing */
> +			return;
> +		}
> +
> +		if (schedulable)
> +			cond_resched();
> +	}
> +
> +	if (!make_call)
> +		return;
> +
> +	synchronize_rcu_tasks();
> +
> +	for_ftrace_rec_iter(iter) {
> +		rec = ftrace_rec_iter_record(iter);
> +
> +		if (rec->flags & FTRACE_FL_DISABLED)
> +			continue;
> +
> +		ret = ftrace_test_record(rec, enable);
> +		if (ret == FTRACE_UPDATE_MAKE_CALL)
> +			failed = ftrace_do_replace_code(rec, enable);
> +
> +		if (failed) {
> +			ftrace_bug(failed, rec);
> +			/* Stop processing */
> +			return;
> +		}
> +
> +		if (schedulable)
> +			cond_resched();
> +	}
> +
> +}
> +#endif
> +
>  #ifdef CONFIG_PPC64
>  #define PACATOC offsetof(struct paca_struct, kernel_toc)
>  
> -- 
> 2.21.0
> 
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox