LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] powerpc: Fix compile issue with force DAWR
From: Michael Neuling @ 2019-05-14  4:47 UTC (permalink / raw)
  To: Christophe Leroy, mpe; +Cc: linuxppc-dev
In-Reply-To: <f1015de7-faf1-ae6d-d1f9-9c904f19c58b@c-s.fr>

On Mon, 2019-05-13 at 11:08 +0200, Christophe Leroy wrote:
> 
> Le 13/05/2019 à 09:17, Michael Neuling a écrit :
> > If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail
> > at linking with:
> >    arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined
> > reference to `dawr_force_enable'
> > 
> > This was caused by commit c1fe190c0672 ("powerpc: Add force enable of
> > DAWR on P9 option").
> > 
> > This puts more of the dawr infrastructure in a new file.
> 
> I think you are doing a bit more than that. I think you should explain 
> that you define a new CONFIG_ option, when it is selected, etc ...
> 
> The commit you are referring to is talking about P9. It looks like your 
> patch covers a lot more, so it should be mentionned her I guess.

Not really. It looks like I'm doing a lot but I'm really just moving code around
to deal with the ugliness of a bunch of config options and dependencies. 

> > Signed-off-by: Michael Neuling <mikey@neuling.org>
> 
> You should add a Fixes: tag, ie:
> 
> Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option")

Ok

> 
> > --
> > v2:
> >    Fixes based on Christophe Leroy's comments:
> >    - Fix commit message formatting
> >    - Move more DAWR code into dawr.c
> > ---
> >   arch/powerpc/Kconfig                     |  5 ++
> >   arch/powerpc/include/asm/hw_breakpoint.h | 20 ++++---
> >   arch/powerpc/kernel/Makefile             |  1 +
> >   arch/powerpc/kernel/dawr.c               | 75 ++++++++++++++++++++++++
> >   arch/powerpc/kernel/hw_breakpoint.c      | 56 ------------------
> >   arch/powerpc/kvm/Kconfig                 |  1 +
> >   6 files changed, 94 insertions(+), 64 deletions(-)
> >   create mode 100644 arch/powerpc/kernel/dawr.c
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 2711aac246..f4b19e48cc 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -242,6 +242,7 @@ config PPC
> >   	select SYSCTL_EXCEPTION_TRACE
> >   	select THREAD_INFO_IN_TASK
> >   	select VIRT_TO_BUS			if !PPC64
> > +	select PPC_DAWR_FORCE_ENABLE		if PPC64 || PERF
> 
> What's PERF ? Did you mean PERF_EVENTS ?
> 
> Then what you mean is:
> - Selected all the time for PPC64
> - Selected for PPC32 when PERF is also selected.
> 
> Is that what you want ? At first I thought it was linked to P9.

This is wrong.  I think we just want PPC64. PERF is a red herring.

> And ... did you read below statement ?

Clearly not :-)

> 
> >   	#
> >   	# Please keep this list sorted alphabetically.
> >   	#
> > @@ -369,6 +370,10 @@ config PPC_ADV_DEBUG_DAC_RANGE
> >   	depends on PPC_ADV_DEBUG_REGS && 44x
> >   	default y
> >   
> > +config PPC_DAWR_FORCE_ENABLE
> > +	bool
> > +	default y
> > +
> 
> Why defaulting it to y. Then how is it set to n ?

Good point.

> 
> >   config ZONE_DMA
> >   	bool
> >   	default y if PPC_BOOK3E_64
> > diff --git a/arch/powerpc/include/asm/hw_breakpoint.h
> > b/arch/powerpc/include/asm/hw_breakpoint.h
> > index 0fe8c1e46b..ffbc8eab41 100644
> > --- a/arch/powerpc/include/asm/hw_breakpoint.h
> > +++ b/arch/powerpc/include/asm/hw_breakpoint.h
> > @@ -47,6 +47,8 @@ struct arch_hw_breakpoint {
> >   #define HW_BRK_TYPE_PRIV_ALL	(HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL |
> > \
> >   				 HW_BRK_TYPE_HYP)
> >   
> > +extern int set_dawr(struct arch_hw_breakpoint *brk);
> > +
> >   #ifdef CONFIG_HAVE_HW_BREAKPOINT
> >   #include <linux/kdebug.h>
> >   #include <asm/reg.h>
> > @@ -90,18 +92,20 @@ static inline void hw_breakpoint_disable(void)
> >   extern void thread_change_pc(struct task_struct *tsk, struct pt_regs
> > *regs);
> >   int hw_breakpoint_handler(struct die_args *args);
> >   
> > -extern int set_dawr(struct arch_hw_breakpoint *brk);
> > -extern bool dawr_force_enable;
> > -static inline bool dawr_enabled(void)
> > -{
> > -	return dawr_force_enable;
> > -}
> > -
> 
> That's a very simple function, why not keep it here (or in another .h) 
> as 'static inline' ?

Sure.

> >   #else	/* CONFIG_HAVE_HW_BREAKPOINT */
> >   static inline void hw_breakpoint_disable(void) { }
> >   static inline void thread_change_pc(struct task_struct *tsk,
> >   					struct pt_regs *regs) { }
> > -static inline bool dawr_enabled(void) { return false; }
> > +
> >   #endif	/* CONFIG_HAVE_HW_BREAKPOINT */
> > +
> > +extern bool dawr_force_enable;
> > +
> > +#ifdef CONFIG_PPC_DAWR_FORCE_ENABLE
> > +extern bool dawr_enabled(void);
> 
> Functions should not be 'extern'. I'm sure checkpatch --strict will tell 
> you.

That's not what's currently being done in this header file.  I'm keeping with
the style of that file.

> > +#else
> > +#define dawr_enabled() true
> 
> true by default ?
> Previously it was false by default.

Thanks, yeah that's wrong but I need to rethink the config option to make it
CONFIG_PPC_DAWR. 

This patch is far more difficult than it should be due to the mess that
CONFIG_HAVE_HW_BREAKPOINT and CONFIG_PPC_ADV_DEBUG_REGS creates in ptrace.c and
process.c. We really need to fix up 
https://github.com/linuxppc/issues/issues/128

> And why a #define ? That's better to keep a static inline.

Changed.

> 
> > +#endif
> > +
> >   #endif	/* __KERNEL__ */
> >   #endif	/* _PPC_BOOK3S_64_HW_BREAKPOINT_H */
> > diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> > index 0ea6c4aa3a..a9c497c34f 100644
> > --- a/arch/powerpc/kernel/Makefile
> > +++ b/arch/powerpc/kernel/Makefile
> > @@ -56,6 +56,7 @@ obj-$(CONFIG_PPC64)		+= setup_64.o
> > sys_ppc32.o \
> >   obj-$(CONFIG_VDSO32)		+= vdso32/
> >   obj-$(CONFIG_PPC_WATCHDOG)	+= watchdog.o
> >   obj-$(CONFIG_HAVE_HW_BREAKPOINT)	+= hw_breakpoint.o
> > +obj-$(CONFIG_PPC_DAWR_FORCE_ENABLE)	+= dawr.o
> >   obj-$(CONFIG_PPC_BOOK3S_64)	+= cpu_setup_ppc970.o cpu_setup_pa6t.o
> >   obj-$(CONFIG_PPC_BOOK3S_64)	+= cpu_setup_power.o
> >   obj-$(CONFIG_PPC_BOOK3S_64)	+= mce.o mce_power.o
> > diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c
> > new file mode 100644
> > index 0000000000..cf1d02fe1e
> > --- /dev/null
> > +++ b/arch/powerpc/kernel/dawr.c
> > @@ -0,0 +1,75 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +//
> > +// DAWR infrastructure
> > +//
> > +// Copyright 2019, Michael Neuling, IBM Corporation.
> > +
> > +#include <linux/types.h>
> > +#include <linux/export.h>
> > +#include <linux/fs.h>
> > +#include <linux/debugfs.h>
> > +#include <asm/debugfs.h>
> > +#include <asm/machdep.h>
> > +#include <asm/hvcall.h>
> > +
> > +bool dawr_force_enable;
> > +EXPORT_SYMBOL_GPL(dawr_force_enable);
> > +
> > +extern bool dawr_enabled(void)
> 
> extern ????

oops
> 
> > +{
> > +	return dawr_force_enable;
> > +}
> > +EXPORT_SYMBOL_GPL(dawr_enabled);
> 
> Since dawr_force_enable is also exported, I see no point in having such 
> a tiny function as an exported function, was better as a 'static inline'.

Yep, changed to static inline.

> > +
> > +static ssize_t dawr_write_file_bool(struct file *file,
> > +				    const char __user *user_buf,
> > +				    size_t count, loff_t *ppos)
> > +{
> > +	struct arch_hw_breakpoint null_brk = {0, 0, 0};
> > +	size_t rc;
> > +
> > +	/* Send error to user if they hypervisor won't allow us to write DAWR */
> > +	if ((!dawr_force_enable) &&
> > +	    (firmware_has_feature(FW_FEATURE_LPAR)) &&
> > +	    (set_dawr(&null_brk) != H_SUCCESS))
> 
> The above is not real clear.
> set_dabr() returns 0, H_SUCCESS is not used there.

It pseries_set_dawr() will return a hcall number.

This code hasn't changed. I'm just moving it.

> 
> > +		return -1;
> > +
> > +	rc = debugfs_write_file_bool(file, user_buf, count, ppos);
> > +	if (rc)
> > +		return rc;
> > +
> > +	/* If we are clearing, make sure all CPUs have the DAWR cleared */
> > +	if (!dawr_force_enable)
> > +		smp_call_function((smp_call_func_t)set_dawr, &null_brk, 0);
> > +
> > +	return rc;
> > +}
> > +
> > +static const struct file_operations dawr_enable_fops = {
> > +	.read =		debugfs_read_file_bool,
> > +	.write =	dawr_write_file_bool,
> > +	.open =		simple_open,
> > +	.llseek =	default_llseek,
> > +};
> > +
> > +static int __init dawr_force_setup(void)
> > +{
> > +	dawr_force_enable = false;
> 
> The above is not needed, the BSS is zeroised at kernel startup.
> 
> > +
> > +	if (cpu_has_feature(CPU_FTR_DAWR)) {
> > +		/* Don't setup sysfs file for user control on P8 */
> > +		dawr_force_enable = true;
> 
> Strange comment, word "don't" doesn't really fit with a 'true'
> 
> > +		return 0;
> > +	}
> > +
> > +	if (PVR_VER(mfspr(SPRN_PVR)) == PVR_POWER9) {
> 
> You could use pvr_version_is(PVR_POWER9) instead of open codiing.

All this code hasn't changed. I'm just moving it.

Feel free to clean it up but lets fix a real problem here.

> 
> > +		/* Turn DAWR off by default, but allow admin to turn it on */
> > +		dawr_force_enable = false;
> > +		debugfs_create_file_unsafe("dawr_enable_dangerous", 0600,
> > +					   powerpc_debugfs_root,
> > +					   &dawr_force_enable,
> > +					   &dawr_enable_fops);
> > +	}
> > +	return 0;
> > +}
> > +arch_initcall(dawr_force_setup);
> 
> Wouldn't it also make sense to move set_dawr() from process.c to here ?

Yep, done.

> 
> > diff --git a/arch/powerpc/kernel/hw_breakpoint.c
> > b/arch/powerpc/kernel/hw_breakpoint.c
> > index da307dd93e..95605a9c9a 100644
> > --- a/arch/powerpc/kernel/hw_breakpoint.c
> > +++ b/arch/powerpc/kernel/hw_breakpoint.c
> > @@ -380,59 +380,3 @@ void hw_breakpoint_pmu_read(struct perf_event *bp)
> >   {
> >   	/* TODO */
> >   }
> > -
> > -bool dawr_force_enable;
> > -EXPORT_SYMBOL_GPL(dawr_force_enable);
> > -
> > -static ssize_t dawr_write_file_bool(struct file *file,
> > -				    const char __user *user_buf,
> > -				    size_t count, loff_t *ppos)
> > -{
> > -	struct arch_hw_breakpoint null_brk = {0, 0, 0};
> > -	size_t rc;
> > -
> > -	/* Send error to user if they hypervisor won't allow us to write DAWR */
> > -	if ((!dawr_force_enable) &&
> > -	    (firmware_has_feature(FW_FEATURE_LPAR)) &&
> > -	    (set_dawr(&null_brk) != H_SUCCESS))
> > -		return -1;
> > -
> > -	rc = debugfs_write_file_bool(file, user_buf, count, ppos);
> > -	if (rc)
> > -		return rc;
> > -
> > -	/* If we are clearing, make sure all CPUs have the DAWR cleared */
> > -	if (!dawr_force_enable)
> > -		smp_call_function((smp_call_func_t)set_dawr, &null_brk, 0);
> > -
> > -	return rc;
> > -}
> > -
> > -static const struct file_operations dawr_enable_fops = {
> > -	.read =		debugfs_read_file_bool,
> > -	.write =	dawr_write_file_bool,
> > -	.open =		simple_open,
> > -	.llseek =	default_llseek,
> > -};
> > -
> > -static int __init dawr_force_setup(void)
> > -{
> > -	dawr_force_enable = false;
> > -
> > -	if (cpu_has_feature(CPU_FTR_DAWR)) {
> > -		/* Don't setup sysfs file for user control on P8 */
> > -		dawr_force_enable = true;
> > -		return 0;
> > -	}
> > -
> > -	if (PVR_VER(mfspr(SPRN_PVR)) == PVR_POWER9) {
> > -		/* Turn DAWR off by default, but allow admin to turn it on */
> > -		dawr_force_enable = false;
> > -		debugfs_create_file_unsafe("dawr_enable_dangerous", 0600,
> > -					   powerpc_debugfs_root,
> > -					   &dawr_force_enable,
> > -					   &dawr_enable_fops);
> > -	}
> > -	return 0;
> > -}
> > -arch_initcall(dawr_force_setup);
> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> > index bfdde04e49..9c0d315108 100644
> > --- a/arch/powerpc/kvm/Kconfig
> > +++ b/arch/powerpc/kvm/Kconfig
> > @@ -39,6 +39,7 @@ config KVM_BOOK3S_32_HANDLER
> >   config KVM_BOOK3S_64_HANDLER
> >   	bool
> >   	select KVM_BOOK3S_HANDLER
> > +	select PPC_DAWR_FORCE_ENABLE
> >   
> >   config KVM_BOOK3S_PR_POSSIBLE
> >   	bool
> > 
> 
> Christophe
> 


^ permalink raw reply

* Re: [PATCH] mm/nvdimm: Use correct #defines instead of opencoding
From: Aneesh Kumar K.V @ 2019-05-14  4:46 UTC (permalink / raw)
  To: Dan Williams; +Cc: Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <CAPcyv4gOr8SFbdtBbWhMOU-wdYuMCQ4Jn2SznGRsv6Vku97Xnw@mail.gmail.com>

On 5/14/19 9:42 AM, Dan Williams wrote:
> On Mon, May 13, 2019 at 9:05 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 5/14/19 9:28 AM, Dan Williams wrote:
>>> On Mon, May 13, 2019 at 7:56 PM Aneesh Kumar K.V
>>> <aneesh.kumar@linux.ibm.com> wrote:
>>>>
>>>> The nfpn related change is needed to fix the kernel message
>>>>
>>>> "number of pfns truncated from 2617344 to 163584"
>>>>
>>>> The change makes sure the nfpns stored in the superblock is right value.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>    drivers/nvdimm/pfn_devs.c    | 6 +++---
>>>>    drivers/nvdimm/region_devs.c | 8 ++++----
>>>>    2 files changed, 7 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>>>> index 347cab166376..6751ff0296ef 100644
>>>> --- a/drivers/nvdimm/pfn_devs.c
>>>> +++ b/drivers/nvdimm/pfn_devs.c
>>>> @@ -777,8 +777,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>>>>                    * when populating the vmemmap. This *should* be equal to
>>>>                    * PMD_SIZE for most architectures.
>>>>                    */
>>>> -               offset = ALIGN(start + reserve + 64 * npfns,
>>>> -                               max(nd_pfn->align, PMD_SIZE)) - start;
>>>> +               offset = ALIGN(start + reserve + sizeof(struct page) * npfns,
>>>> +                              max(nd_pfn->align, PMD_SIZE)) - start;
>>>
>>> No, I think we need to record the page-size into the superblock format
>>> otherwise this breaks in debug builds where the struct-page size is
>>> extended.
>>>
>>>>           } else if (nd_pfn->mode == PFN_MODE_RAM)
>>>>                   offset = ALIGN(start + reserve, nd_pfn->align) - start;
>>>>           else
>>>> @@ -790,7 +790,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>>>>                   return -ENXIO;
>>>>           }
>>>>
>>>> -       npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
>>>> +       npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE;
>>>
>>> Similar comment, if the page size is variable then the superblock
>>> needs to explicitly account for it.
>>>
>>
>> PAGE_SIZE is not really variable. What we can run into is the issue you
>> mentioned above. The size of struct page can change which means the
>> reserved space for keeping vmemmap in device may not be sufficient for
>> certain kernel builds.
>>
>> I was planning to add another patch that fails namespace init if we
>> don't have enough space to keep the struct page.
>>
>> Why do you suggest we need to have PAGE_SIZE as part of pfn superblock?
> 
> So that the kernel has a chance to identify cases where the superblock
> it is handling was created on a system with different PAGE_SIZE
> assumptions.
> 

The reason to do that is we don't have enough space to keep struct page 
backing the total number of pfns? If so, what i suggested above should 
handle that.

or are you finding any other reason why we should fail a namespace init 
with a different PAGE_SIZE value?

My another patch handle the details w.r.t devdax alignment for which 
devdax got created with PAGE_SIZE 4K but we are now trying to load that 
in a kernel with PAGE_SIZE 64k.

-aneesh


^ permalink raw reply

* [Bug 203597] kernel 4.9.175 fails to boot on a PowerMac G4 3,6 at early stage
From: bugzilla-daemon @ 2019-05-14  4:43 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-203597-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=203597

Christophe Leroy (christophe.leroy@c-s.fr) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |christophe.leroy@c-s.fr

--- Comment #2 from Christophe Leroy (christophe.leroy@c-s.fr) ---
You are missing following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b45ba4a51cd

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [PATCH] mm/nvdimm: Use correct alignment when looking at first pfn from a region
From: Aneesh Kumar K.V @ 2019-05-14  4:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <CAPcyv4hgNUDxjgYNkxOXJ9hfLb6z2+E1yasNoZNDKFUxkCzWLA@mail.gmail.com>

On 5/14/19 9:59 AM, Dan Williams wrote:
> On Mon, May 13, 2019 at 7:55 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> We already add the start_pad to the resource->start but fails to section
>> align the start. This make sure with altmap we compute the right first
>> pfn when start_pad is zero and we are doing an align down of start address.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   kernel/memremap.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/memremap.c b/kernel/memremap.c
>> index a856cb5ff192..23d77b60e728 100644
>> --- a/kernel/memremap.c
>> +++ b/kernel/memremap.c
>> @@ -59,9 +59,9 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap)
>>   {
>>          const struct resource *res = &pgmap->res;
>>          struct vmem_altmap *altmap = &pgmap->altmap;
>> -       unsigned long pfn;
>> +       unsigned long pfn = PHYS_PFN(res->start);
>>
>> -       pfn = res->start >> PAGE_SHIFT;
>> +       pfn = SECTION_ALIGN_DOWN(pfn);
> 
> This does not seem right to me it breaks the assumptions of where the
> first expected valid pfn occurs in the passed in range.
> 

How do we define the first valid pfn? Isn't that at pfn_sb->dataoff ?

-aneesh


^ permalink raw reply

* Re: [RFC PATCH] mm/nvdimm: Fix kernel crash on devm_mremap_pages_release
From: Aneesh Kumar K.V @ 2019-05-14  4:40 UTC (permalink / raw)
  To: Dan Williams; +Cc: Keith Busch, Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <CAPcyv4hsTvyRnLGr3y4JB6zPzdxb7WGQgaWs=5vRqf=L1DYynQ@mail.gmail.com>

On 5/14/19 9:45 AM, Dan Williams wrote:
> [ add Keith who was looking at something similar ]
> 
> On Mon, May 13, 2019 at 7:54 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> When we initialize the namespace, if we support altmap, we don't initialize all the
>> backing struct page where as while releasing the namespace we look at some of
>> these uninitilized struct page. This results in a kernel crash as below.
>>
>> kernel BUG at include/linux/mm.h:1034!
>> cpu 0x2: Vector: 700 (Program Check) at [c00000024146b870]
>>      pc: c0000000003788f8: devm_memremap_pages_release+0x258/0x3a0
>>      lr: c0000000003788f4: devm_memremap_pages_release+0x254/0x3a0
>>      sp: c00000024146bb00
>>     msr: 800000000282b033
>>    current = 0xc000000241382f00
>>    paca    = 0xc00000003fffd680   irqmask: 0x03   irq_happened: 0x01
>>      pid   = 4114, comm = ndctl
>>   c0000000009bf8c0 devm_action_release+0x30/0x50
>>   c0000000009c0938 release_nodes+0x268/0x2d0
>>   c0000000009b95b4 device_release_driver_internal+0x164/0x230
>>   c0000000009b638c unbind_store+0x13c/0x190
>>   c0000000009b4f44 drv_attr_store+0x44/0x60
>>   c00000000058ccc0 sysfs_kf_write+0x70/0xa0
>>   c00000000058b52c kernfs_fop_write+0x1ac/0x290
>>   c0000000004a415c __vfs_write+0x3c/0x70
>>   c0000000004a85ac vfs_write+0xec/0x200
>>   c0000000004a8920 ksys_write+0x80/0x130
>>   c00000000000bee4 system_call+0x5c/0x70
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   mm/page_alloc.c | 5 +----
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 59661106da16..892eabe1ec13 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5740,8 +5740,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>
>>   #ifdef CONFIG_ZONE_DEVICE
>>          /*
>> -        * Honor reservation requested by the driver for this ZONE_DEVICE
>> -        * memory. We limit the total number of pages to initialize to just
>> +        * We limit the total number of pages to initialize to just
>>           * those that might contain the memory mapping. We will defer the
>>           * ZONE_DEVICE page initialization until after we have released
>>           * the hotplug lock.
>> @@ -5750,8 +5749,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>                  if (!altmap)
>>                          return;
>>
>> -               if (start_pfn == altmap->base_pfn)
>> -                       start_pfn += altmap->reserve;
> 
> If it's reserved then we should not be accessing, even if the above
> works in practice. Isn't the fix something more like this to fix up
> the assumptions at release time?
> 
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index a856cb5ff192..9074ba14572c 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -90,6 +90,7 @@ static void devm_memremap_pages_release(void *data)
>    struct device *dev = pgmap->dev;
>    struct resource *res = &pgmap->res;
>    resource_size_t align_start, align_size;
> + struct vmem_altmap *altmap = pgmap->altmap_valid ? &pgmap->altmap : NULL;
>    unsigned long pfn;
>    int nid;
> 
> @@ -102,7 +103,10 @@ static void devm_memremap_pages_release(void *data)
>    align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
>    - align_start;
> 
> - nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
> + pfn = align_start >> PAGE_SHIFT;
> + if (altmap)
> + pfn += vmem_altmap_offset(altmap);
> + nid = page_to_nid(pfn_to_page(pfn));
> 
>    mem_hotplug_begin();
>    if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> @@ -110,8 +114,7 @@ static void devm_memremap_pages_release(void *data)
>    __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>    align_size >> PAGE_SHIFT, NULL);
>    } else {
> - arch_remove_memory(nid, align_start, align_size,
> - pgmap->altmap_valid ? &pgmap->altmap : NULL);
> + arch_remove_memory(nid, align_start, align_size, altmap);
>    kasan_remove_zero_shadow(__va(align_start), align_size);
>    }
>    mem_hotplug_done();
> 
I did try that first. I was not sure about that. From the memory add vs 
remove perspective.

devm_memremap_pages:

align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
		- align_start;
align_end = align_start + align_size - 1;

error = arch_add_memory(nid, align_start, align_size, altmap,
				false);


devm_memremap_pages_release:

/* pages are dead and unused, undo the arch mapping */
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
		- align_start;

arch_remove_memory(nid, align_start, align_size,
		pgmap->altmap_valid ? &pgmap->altmap : NULL);


Now if we are fixing the memremap_pages_release, shouldn't we adjust 
alig_start w.r.t memremap_pages too? and I was not sure what that means 
w.r.t add/remove alignment requirements.

What is the intended usage of reserve area? I guess we want that part to 
be added? if so shouldn't we remove them?


-aneesh


^ permalink raw reply

* Re: [PATCH] mm/nvdimm: Use correct alignment when looking at first pfn from a region
From: Dan Williams @ 2019-05-14  4:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <20190514025512.9670-1-aneesh.kumar@linux.ibm.com>

On Mon, May 13, 2019 at 7:55 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> We already add the start_pad to the resource->start but fails to section
> align the start. This make sure with altmap we compute the right first
> pfn when start_pad is zero and we are doing an align down of start address.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  kernel/memremap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index a856cb5ff192..23d77b60e728 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -59,9 +59,9 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap)
>  {
>         const struct resource *res = &pgmap->res;
>         struct vmem_altmap *altmap = &pgmap->altmap;
> -       unsigned long pfn;
> +       unsigned long pfn = PHYS_PFN(res->start);
>
> -       pfn = res->start >> PAGE_SHIFT;
> +       pfn = SECTION_ALIGN_DOWN(pfn);

This does not seem right to me it breaks the assumptions of where the
first expected valid pfn occurs in the passed in range.

^ permalink raw reply

* Re: [RFC PATCH] mm/nvdimm: Fix kernel crash on devm_mremap_pages_release
From: Dan Williams @ 2019-05-14  4:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Keith Busch, Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <20190514025354.9108-1-aneesh.kumar@linux.ibm.com>

[ add Keith who was looking at something similar ]

On Mon, May 13, 2019 at 7:54 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> When we initialize the namespace, if we support altmap, we don't initialize all the
> backing struct page where as while releasing the namespace we look at some of
> these uninitilized struct page. This results in a kernel crash as below.
>
> kernel BUG at include/linux/mm.h:1034!
> cpu 0x2: Vector: 700 (Program Check) at [c00000024146b870]
>     pc: c0000000003788f8: devm_memremap_pages_release+0x258/0x3a0
>     lr: c0000000003788f4: devm_memremap_pages_release+0x254/0x3a0
>     sp: c00000024146bb00
>    msr: 800000000282b033
>   current = 0xc000000241382f00
>   paca    = 0xc00000003fffd680   irqmask: 0x03   irq_happened: 0x01
>     pid   = 4114, comm = ndctl
>  c0000000009bf8c0 devm_action_release+0x30/0x50
>  c0000000009c0938 release_nodes+0x268/0x2d0
>  c0000000009b95b4 device_release_driver_internal+0x164/0x230
>  c0000000009b638c unbind_store+0x13c/0x190
>  c0000000009b4f44 drv_attr_store+0x44/0x60
>  c00000000058ccc0 sysfs_kf_write+0x70/0xa0
>  c00000000058b52c kernfs_fop_write+0x1ac/0x290
>  c0000000004a415c __vfs_write+0x3c/0x70
>  c0000000004a85ac vfs_write+0xec/0x200
>  c0000000004a8920 ksys_write+0x80/0x130
>  c00000000000bee4 system_call+0x5c/0x70
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/page_alloc.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 59661106da16..892eabe1ec13 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5740,8 +5740,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>
>  #ifdef CONFIG_ZONE_DEVICE
>         /*
> -        * Honor reservation requested by the driver for this ZONE_DEVICE
> -        * memory. We limit the total number of pages to initialize to just
> +        * We limit the total number of pages to initialize to just
>          * those that might contain the memory mapping. We will defer the
>          * ZONE_DEVICE page initialization until after we have released
>          * the hotplug lock.
> @@ -5750,8 +5749,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>                 if (!altmap)
>                         return;
>
> -               if (start_pfn == altmap->base_pfn)
> -                       start_pfn += altmap->reserve;

If it's reserved then we should not be accessing, even if the above
works in practice. Isn't the fix something more like this to fix up
the assumptions at release time?

diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..9074ba14572c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -90,6 +90,7 @@ static void devm_memremap_pages_release(void *data)
  struct device *dev = pgmap->dev;
  struct resource *res = &pgmap->res;
  resource_size_t align_start, align_size;
+ struct vmem_altmap *altmap = pgmap->altmap_valid ? &pgmap->altmap : NULL;
  unsigned long pfn;
  int nid;

@@ -102,7 +103,10 @@ static void devm_memremap_pages_release(void *data)
  align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
  - align_start;

- nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
+ pfn = align_start >> PAGE_SHIFT;
+ if (altmap)
+ pfn += vmem_altmap_offset(altmap);
+ nid = page_to_nid(pfn_to_page(pfn));

  mem_hotplug_begin();
  if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
@@ -110,8 +114,7 @@ static void devm_memremap_pages_release(void *data)
  __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
  align_size >> PAGE_SHIFT, NULL);
  } else {
- arch_remove_memory(nid, align_start, align_size,
- pgmap->altmap_valid ? &pgmap->altmap : NULL);
+ arch_remove_memory(nid, align_start, align_size, altmap);
  kasan_remove_zero_shadow(__va(align_start), align_size);
  }
  mem_hotplug_done();

^ permalink raw reply related

* Re: [PATCH] mm/nvdimm: Use correct #defines instead of opencoding
From: Dan Williams @ 2019-05-14  4:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <f99c4f11-a43d-c2d3-ab4f-b7072d090351@linux.ibm.com>

On Mon, May 13, 2019 at 9:05 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 5/14/19 9:28 AM, Dan Williams wrote:
> > On Mon, May 13, 2019 at 7:56 PM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> The nfpn related change is needed to fix the kernel message
> >>
> >> "number of pfns truncated from 2617344 to 163584"
> >>
> >> The change makes sure the nfpns stored in the superblock is right value.
> >>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> ---
> >>   drivers/nvdimm/pfn_devs.c    | 6 +++---
> >>   drivers/nvdimm/region_devs.c | 8 ++++----
> >>   2 files changed, 7 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> >> index 347cab166376..6751ff0296ef 100644
> >> --- a/drivers/nvdimm/pfn_devs.c
> >> +++ b/drivers/nvdimm/pfn_devs.c
> >> @@ -777,8 +777,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> >>                   * when populating the vmemmap. This *should* be equal to
> >>                   * PMD_SIZE for most architectures.
> >>                   */
> >> -               offset = ALIGN(start + reserve + 64 * npfns,
> >> -                               max(nd_pfn->align, PMD_SIZE)) - start;
> >> +               offset = ALIGN(start + reserve + sizeof(struct page) * npfns,
> >> +                              max(nd_pfn->align, PMD_SIZE)) - start;
> >
> > No, I think we need to record the page-size into the superblock format
> > otherwise this breaks in debug builds where the struct-page size is
> > extended.
> >
> >>          } else if (nd_pfn->mode == PFN_MODE_RAM)
> >>                  offset = ALIGN(start + reserve, nd_pfn->align) - start;
> >>          else
> >> @@ -790,7 +790,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> >>                  return -ENXIO;
> >>          }
> >>
> >> -       npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
> >> +       npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE;
> >
> > Similar comment, if the page size is variable then the superblock
> > needs to explicitly account for it.
> >
>
> PAGE_SIZE is not really variable. What we can run into is the issue you
> mentioned above. The size of struct page can change which means the
> reserved space for keeping vmemmap in device may not be sufficient for
> certain kernel builds.
>
> I was planning to add another patch that fails namespace init if we
> don't have enough space to keep the struct page.
>
> Why do you suggest we need to have PAGE_SIZE as part of pfn superblock?

So that the kernel has a chance to identify cases where the superblock
it is handling was created on a system with different PAGE_SIZE
assumptions.

^ permalink raw reply

* Re: [PATCH] mm/nvdimm: Use correct #defines instead of opencoding
From: Aneesh Kumar K.V @ 2019-05-14  4:05 UTC (permalink / raw)
  To: Dan Williams; +Cc: Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <CAPcyv4iNgFbSq0Hqb+CStRhGWMHfXx7tL3vrDaQ95DcBBY8QCQ@mail.gmail.com>

On 5/14/19 9:28 AM, Dan Williams wrote:
> On Mon, May 13, 2019 at 7:56 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> The nfpn related change is needed to fix the kernel message
>>
>> "number of pfns truncated from 2617344 to 163584"
>>
>> The change makes sure the nfpns stored in the superblock is right value.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   drivers/nvdimm/pfn_devs.c    | 6 +++---
>>   drivers/nvdimm/region_devs.c | 8 ++++----
>>   2 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>> index 347cab166376..6751ff0296ef 100644
>> --- a/drivers/nvdimm/pfn_devs.c
>> +++ b/drivers/nvdimm/pfn_devs.c
>> @@ -777,8 +777,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>>                   * when populating the vmemmap. This *should* be equal to
>>                   * PMD_SIZE for most architectures.
>>                   */
>> -               offset = ALIGN(start + reserve + 64 * npfns,
>> -                               max(nd_pfn->align, PMD_SIZE)) - start;
>> +               offset = ALIGN(start + reserve + sizeof(struct page) * npfns,
>> +                              max(nd_pfn->align, PMD_SIZE)) - start;
> 
> No, I think we need to record the page-size into the superblock format
> otherwise this breaks in debug builds where the struct-page size is
> extended.
> 
>>          } else if (nd_pfn->mode == PFN_MODE_RAM)
>>                  offset = ALIGN(start + reserve, nd_pfn->align) - start;
>>          else
>> @@ -790,7 +790,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>>                  return -ENXIO;
>>          }
>>
>> -       npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
>> +       npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE;
> 
> Similar comment, if the page size is variable then the superblock
> needs to explicitly account for it.
> 

PAGE_SIZE is not really variable. What we can run into is the issue you 
mentioned above. The size of struct page can change which means the 
reserved space for keeping vmemmap in device may not be sufficient for 
certain kernel builds.

I was planning to add another patch that fails namespace init if we 
don't have enough space to keep the struct page.

Why do you suggest we need to have PAGE_SIZE as part of pfn superblock?

-aneesh


^ permalink raw reply

* Re: [RFC PATCH] mm/nvdimm: Fix kernel crash on devm_mremap_pages_release
From: Anshuman Khandual @ 2019-05-14  4:05 UTC (permalink / raw)
  To: Aneesh Kumar K.V, dan.j.williams; +Cc: linux-mm, linuxppc-dev, linux-nvdimm
In-Reply-To: <20190514025354.9108-1-aneesh.kumar@linux.ibm.com>

On 05/14/2019 08:23 AM, Aneesh Kumar K.V wrote:
> When we initialize the namespace, if we support altmap, we don't initialize all the
> backing struct page where as while releasing the namespace we look at some of
> these uninitilized struct page. This results in a kernel crash as below.
Yes this has been problematic which I have also previously encountered but in a bit
different way (while searching memory resources).

> 
> kernel BUG at include/linux/mm.h:1034!
What that would be ? Did not see a corresponding BUG_ON() line in the file.

> cpu 0x2: Vector: 700 (Program Check) at [c00000024146b870]
>     pc: c0000000003788f8: devm_memremap_pages_release+0x258/0x3a0
>     lr: c0000000003788f4: devm_memremap_pages_release+0x254/0x3a0
>     sp: c00000024146bb00
>    msr: 800000000282b033
>   current = 0xc000000241382f00
>   paca    = 0xc00000003fffd680   irqmask: 0x03   irq_happened: 0x01
>     pid   = 4114, comm = ndctl
>  c0000000009bf8c0 devm_action_release+0x30/0x50
>  c0000000009c0938 release_nodes+0x268/0x2d0
>  c0000000009b95b4 device_release_driver_internal+0x164/0x230
>  c0000000009b638c unbind_store+0x13c/0x190
>  c0000000009b4f44 drv_attr_store+0x44/0x60
>  c00000000058ccc0 sysfs_kf_write+0x70/0xa0
>  c00000000058b52c kernfs_fop_write+0x1ac/0x290
>  c0000000004a415c __vfs_write+0x3c/0x70
>  c0000000004a85ac vfs_write+0xec/0x200
>  c0000000004a8920 ksys_write+0x80/0x130
>  c00000000000bee4 system_call+0x5c/0x70

I saw this as memory hotplug problem with respect to ZONE_DEVICE based device memory.
Hence a bit different explanation which I never posted. I guess parts of the commit
message here can be used for a better comprehensive explanation of the problem.

mm/hotplug: Initialize struct pages for vmem_altmap reserved areas

The following ZONE_DEVICE ranges (altmap) have valid struct pages allocated
from within device memory memmap range.

A. Driver reserved area	[BASE -> BASE + RESV)
B. Device mmap area	[BASE + RESV -> BASE + RESV + FREE]
C. Device usable area	[BASE + RESV + FREE -> END]

BASE - pgmap->altmap.base_pfn (pgmap->res.start >> PAGE_SHIFT)
RESV - pgmap->altmap.reserve
FREE - pgmap->altmap.free
END  - pgmap->res->end >> PAGE_SHIFT

Struct page init for all areas happens in two phases which detects altmap
use case and init parts of the device range in each phase.

1. memmap_init_zone		(Device mmap area)
2. memmap_init_zone_device	(Device usable area)

memmap_init_zone() skips driver reserved area and does not init the
struct pages. This is problematic primarily for two reasons.

Though NODE_DATA(device_node(dev))->node_zones[ZONE_DEVICE] contains the
device memory range in it's entirety (in zone->spanned_pages) parts of this
range does not have zone set to ZONE_DEVICE in their struct page.

__remove_pages() called directly or from within arch_remove_memory() during
ZONE_DEVICE tear down procedure (devm_memremap_pages_release) hits an error
(like below) if there are reserved pages. This is because the first pfn of
the device range (invariably also the first pfn from reserved area) cannot
be identified belonging to ZONE_DEVICE. This erroneously leads range search
within iomem_resource region which never had this device memory region. So
this eventually ends up flashing the following error.

Unable to release resource <0x0000000680000000-0x00000006bfffffff> (-22)

Initialize struct pages for the driver reserved range while still staying
clear from it's contents.

> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/page_alloc.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 59661106da16..892eabe1ec13 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5740,8 +5740,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  
>  #ifdef CONFIG_ZONE_DEVICE
>  	/*
> -	 * Honor reservation requested by the driver for this ZONE_DEVICE
> -	 * memory. We limit the total number of pages to initialize to just
> +	 * We limit the total number of pages to initialize to just
Comment needs bit change to reflect on the fact that both driver reserved as
well as mapped area (containing altmap struct pages) needs init here.

^ permalink raw reply

* Re: [PATCH] mm/nvdimm: Use correct #defines instead of opencoding
From: Dan Williams @ 2019-05-14  3:58 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <20190514025604.9997-1-aneesh.kumar@linux.ibm.com>

On Mon, May 13, 2019 at 7:56 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> The nfpn related change is needed to fix the kernel message
>
> "number of pfns truncated from 2617344 to 163584"
>
> The change makes sure the nfpns stored in the superblock is right value.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/nvdimm/pfn_devs.c    | 6 +++---
>  drivers/nvdimm/region_devs.c | 8 ++++----
>  2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 347cab166376..6751ff0296ef 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -777,8 +777,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>                  * when populating the vmemmap. This *should* be equal to
>                  * PMD_SIZE for most architectures.
>                  */
> -               offset = ALIGN(start + reserve + 64 * npfns,
> -                               max(nd_pfn->align, PMD_SIZE)) - start;
> +               offset = ALIGN(start + reserve + sizeof(struct page) * npfns,
> +                              max(nd_pfn->align, PMD_SIZE)) - start;

No, I think we need to record the page-size into the superblock format
otherwise this breaks in debug builds where the struct-page size is
extended.

>         } else if (nd_pfn->mode == PFN_MODE_RAM)
>                 offset = ALIGN(start + reserve, nd_pfn->align) - start;
>         else
> @@ -790,7 +790,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>                 return -ENXIO;
>         }
>
> -       npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
> +       npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE;

Similar comment, if the page size is variable then the superblock
needs to explicitly account for it.

^ permalink raw reply

* [PATCH] mm/nvdimm: Use correct #defines instead of opencoding
From: Aneesh Kumar K.V @ 2019-05-14  2:56 UTC (permalink / raw)
  To: dan.j.williams; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, linux-nvdimm

The nfpn related change is needed to fix the kernel message

"number of pfns truncated from 2617344 to 163584"

The change makes sure the nfpns stored in the superblock is right value.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/nvdimm/pfn_devs.c    | 6 +++---
 drivers/nvdimm/region_devs.c | 8 ++++----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 347cab166376..6751ff0296ef 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -777,8 +777,8 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 		 * when populating the vmemmap. This *should* be equal to
 		 * PMD_SIZE for most architectures.
 		 */
-		offset = ALIGN(start + reserve + 64 * npfns,
-				max(nd_pfn->align, PMD_SIZE)) - start;
+		offset = ALIGN(start + reserve + sizeof(struct page) * npfns,
+			       max(nd_pfn->align, PMD_SIZE)) - start;
 	} else if (nd_pfn->mode == PFN_MODE_RAM)
 		offset = ALIGN(start + reserve, nd_pfn->align) - start;
 	else
@@ -790,7 +790,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 		return -ENXIO;
 	}
 
-	npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
+	npfns = (size - offset - start_pad - end_trunc) / PAGE_SIZE;
 	pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
 	pfn_sb->dataoff = cpu_to_le64(offset);
 	pfn_sb->npfns = cpu_to_le64(npfns);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..2d8facea5a03 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -994,10 +994,10 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 		struct nd_mapping_desc *mapping = &ndr_desc->mapping[i];
 		struct nvdimm *nvdimm = mapping->nvdimm;
 
-		if ((mapping->start | mapping->size) % SZ_4K) {
-			dev_err(&nvdimm_bus->dev, "%s: %s mapping%d is not 4K aligned\n",
-					caller, dev_name(&nvdimm->dev), i);
-
+		if ((mapping->start | mapping->size) % PAGE_SIZE) {
+			dev_err(&nvdimm_bus->dev,
+				"%s: %s mapping%d is not 4K aligned\n",
+				caller, dev_name(&nvdimm->dev), i);
 			return NULL;
 		}
 
-- 
2.21.0


^ permalink raw reply related

* [PATCH] mm/nvdimm: Use correct alignment when looking at first pfn from a region
From: Aneesh Kumar K.V @ 2019-05-14  2:55 UTC (permalink / raw)
  To: dan.j.williams; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, linux-nvdimm

We already add the start_pad to the resource->start but fails to section
align the start. This make sure with altmap we compute the right first
pfn when start_pad is zero and we are doing an align down of start address.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 kernel/memremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..23d77b60e728 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -59,9 +59,9 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap)
 {
 	const struct resource *res = &pgmap->res;
 	struct vmem_altmap *altmap = &pgmap->altmap;
-	unsigned long pfn;
+	unsigned long pfn = PHYS_PFN(res->start);
 
-	pfn = res->start >> PAGE_SHIFT;
+	pfn = SECTION_ALIGN_DOWN(pfn);
 	if (pgmap->altmap_valid)
 		pfn += vmem_altmap_offset(altmap);
 	return pfn;
-- 
2.21.0


^ permalink raw reply related

* [PATCH] mm/nvdimm: Pick the right alignment default when creating dax devices
From: Aneesh Kumar K.V @ 2019-05-14  2:54 UTC (permalink / raw)
  To: dan.j.williams; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, linux-nvdimm

Allow arch to provide the supported alignments and use hugepage alignment only
if we support hugepage. Right now we depend on compile time configs whereas this
patch switch this to runtime discovery.

Architectures like ppc64 can have THP enabled in code, but then can have
hugepage size disabled by the hypervisor. This allows us to create dax devices
with PAGE_SIZE alignment in this case.

Existing dax namespace with alignment larger than PAGE_SIZE will fail to
initialize in this specific case. We still allow fsdax namespace initialization.

With respect to identifying whether to enable hugepage fault for a dax device,
if THP is enabled during compile, we default to taking hugepage fault and in dax
fault handler if we find the fault size > alignment we retry with PAGE_SIZE
fault size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/libnvdimm.h |  9 ++++++++
 arch/powerpc/mm/Makefile             |  1 +
 arch/powerpc/mm/nvdimm.c             | 34 ++++++++++++++++++++++++++++
 arch/x86/include/asm/libnvdimm.h     | 19 ++++++++++++++++
 drivers/nvdimm/nd.h                  |  6 -----
 drivers/nvdimm/pfn_devs.c            | 32 +++++++++++++++++++++++++-
 include/linux/huge_mm.h              |  7 +++++-
 7 files changed, 100 insertions(+), 8 deletions(-)
 create mode 100644 arch/powerpc/include/asm/libnvdimm.h
 create mode 100644 arch/powerpc/mm/nvdimm.c
 create mode 100644 arch/x86/include/asm/libnvdimm.h

diff --git a/arch/powerpc/include/asm/libnvdimm.h b/arch/powerpc/include/asm/libnvdimm.h
new file mode 100644
index 000000000000..d35fd7f48603
--- /dev/null
+++ b/arch/powerpc/include/asm/libnvdimm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_LIBNVDIMM_H
+#define _ASM_POWERPC_LIBNVDIMM_H
+
+#define nd_pfn_supported_alignments nd_pfn_supported_alignments
+extern unsigned long *nd_pfn_supported_alignments(void);
+extern unsigned long nd_pfn_default_alignment(void);
+
+#endif
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 0f499db315d6..42e4a399ba5d 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM)		+= highmem.o
 obj-$(CONFIG_PPC_COPRO_BASE)	+= copro_fault.o
 obj-$(CONFIG_PPC_PTDUMP)	+= ptdump/
 obj-$(CONFIG_KASAN)		+= kasan/
+obj-$(CONFIG_NVDIMM_PFN)		+= nvdimm.o
diff --git a/arch/powerpc/mm/nvdimm.c b/arch/powerpc/mm/nvdimm.c
new file mode 100644
index 000000000000..a29a4510715e
--- /dev/null
+++ b/arch/powerpc/mm/nvdimm.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <asm/pgtable.h>
+#include <asm/page.h>
+
+#include <linux/mm.h>
+/*
+ * We support only pte and pmd mappings for now.
+ */
+const unsigned long *nd_pfn_supported_alignments(void)
+{
+	static unsigned long supported_alignments[3];
+
+	supported_alignments[0] = PAGE_SIZE;
+
+	if (has_transparent_hugepage())
+		supported_alignments[1] = HPAGE_PMD_SIZE;
+	else
+		supported_alignments[1] = 0;
+
+	supported_alignments[2] = 0;
+	return supported_alignments;
+}
+
+/*
+ * Use pmd mapping if supported as default alignment
+ */
+unsigned long nd_pfn_default_alignment(void)
+{
+
+	if (has_transparent_hugepage())
+		return HPAGE_PMD_SIZE;
+	return PAGE_SIZE;
+}
diff --git a/arch/x86/include/asm/libnvdimm.h b/arch/x86/include/asm/libnvdimm.h
new file mode 100644
index 000000000000..3d5361db9164
--- /dev/null
+++ b/arch/x86/include/asm/libnvdimm.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_LIBNVDIMM_H
+#define _ASM_X86_LIBNVDIMM_H
+
+static inline unsigned long nd_pfn_default_alignment(void)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return HPAGE_PMD_SIZE;
+#else
+	return PAGE_SIZE;
+#endif
+}
+
+static inline unsigned long nd_altmap_align_size(unsigned long nd_align)
+{
+	return PMD_SIZE;
+}
+
+#endif
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index a5ac3b240293..44fe923b2ee3 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -292,12 +292,6 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)
 struct nd_pfn *to_nd_pfn(struct device *dev);
 #if IS_ENABLED(CONFIG_NVDIMM_PFN)
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define PFN_DEFAULT_ALIGNMENT HPAGE_PMD_SIZE
-#else
-#define PFN_DEFAULT_ALIGNMENT PAGE_SIZE
-#endif
-
 int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns);
 bool is_nd_pfn(struct device *dev);
 struct device *nd_pfn_create(struct nd_region *nd_region);
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 01f40672507f..347cab166376 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <asm/libnvdimm.h>
 #include "nd-core.h"
 #include "pfn.h"
 #include "nd.h"
@@ -111,6 +112,8 @@ static ssize_t align_show(struct device *dev,
 	return sprintf(buf, "%ld\n", nd_pfn->align);
 }
 
+#ifndef nd_pfn_supported_alignments
+#define nd_pfn_supported_alignments nd_pfn_supported_alignments
 static const unsigned long *nd_pfn_supported_alignments(void)
 {
 	/*
@@ -133,6 +136,7 @@ static const unsigned long *nd_pfn_supported_alignments(void)
 
 	return data;
 }
+#endif
 
 static ssize_t align_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
@@ -310,7 +314,7 @@ struct device *nd_pfn_devinit(struct nd_pfn *nd_pfn,
 		return NULL;
 
 	nd_pfn->mode = PFN_MODE_NONE;
-	nd_pfn->align = PFN_DEFAULT_ALIGNMENT;
+	nd_pfn->align = nd_pfn_default_alignment();
 	dev = &nd_pfn->dev;
 	device_initialize(&nd_pfn->dev);
 	if (ndns && !__nd_attach_ndns(&nd_pfn->dev, ndns, &nd_pfn->ndns)) {
@@ -420,6 +424,20 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
 	return 0;
 }
 
+static bool nd_supported_alignment(unsigned long align)
+{
+	int i;
+	const unsigned long *supported = nd_pfn_supported_alignments();
+
+	if (align == 0)
+		return false;
+
+	for (i = 0; supported[i]; i++)
+		if (align == supported[i])
+			return true;
+	return false;
+}
+
 int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 {
 	u64 checksum, offset;
@@ -474,6 +492,18 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 		align = 1UL << ilog2(offset);
 	mode = le32_to_cpu(pfn_sb->mode);
 
+	/*
+	 * Check whether the we support the alignment. For Dax if the
+	 * superblock alignment is not matching, we won't initialize
+	 * the device.
+	 */
+	if (!nd_supported_alignment(align) &&
+	    memcmp(pfn_sb->signature, DAX_SIG, PFN_SIG_LEN)) {
+		dev_err(&nd_pfn->dev, "init failed, settings mismatch\n");
+		dev_dbg(&nd_pfn->dev, "align: %lx:%lx\n", nd_pfn->align, align);
+		return -EINVAL;
+	}
+
 	if (!nd_pfn->uuid) {
 		/*
 		 * When probing a namepace via nd_pfn_probe() the uuid
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 381e872bfde0..d5cfea3d8b86 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -110,7 +110,12 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
 		return true;
-
+	/*
+	 * For dax let's try to do hugepage fault always. If we don't support
+	 * hugepages we will not have enabled namespaces with hugepage alignment.
+	 * This also means we try to handle hugepage fault on device with
+	 * smaller alignment. But for then we will return with VM_FAULT_FALLBACK
+	 */
 	if (vma_is_dax(vma))
 		return true;
 
-- 
2.21.0


^ permalink raw reply related

* [RFC PATCH] mm/nvdimm: Fix kernel crash on devm_mremap_pages_release
From: Aneesh Kumar K.V @ 2019-05-14  2:53 UTC (permalink / raw)
  To: dan.j.williams; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, linux-nvdimm

When we initialize the namespace, if we support altmap, we don't initialize all the
backing struct page where as while releasing the namespace we look at some of
these uninitilized struct page. This results in a kernel crash as below.

kernel BUG at include/linux/mm.h:1034!
cpu 0x2: Vector: 700 (Program Check) at [c00000024146b870]
    pc: c0000000003788f8: devm_memremap_pages_release+0x258/0x3a0
    lr: c0000000003788f4: devm_memremap_pages_release+0x254/0x3a0
    sp: c00000024146bb00
   msr: 800000000282b033
  current = 0xc000000241382f00
  paca    = 0xc00000003fffd680   irqmask: 0x03   irq_happened: 0x01
    pid   = 4114, comm = ndctl
 c0000000009bf8c0 devm_action_release+0x30/0x50
 c0000000009c0938 release_nodes+0x268/0x2d0
 c0000000009b95b4 device_release_driver_internal+0x164/0x230
 c0000000009b638c unbind_store+0x13c/0x190
 c0000000009b4f44 drv_attr_store+0x44/0x60
 c00000000058ccc0 sysfs_kf_write+0x70/0xa0
 c00000000058b52c kernfs_fop_write+0x1ac/0x290
 c0000000004a415c __vfs_write+0x3c/0x70
 c0000000004a85ac vfs_write+0xec/0x200
 c0000000004a8920 ksys_write+0x80/0x130
 c00000000000bee4 system_call+0x5c/0x70

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/page_alloc.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59661106da16..892eabe1ec13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5740,8 +5740,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 #ifdef CONFIG_ZONE_DEVICE
 	/*
-	 * Honor reservation requested by the driver for this ZONE_DEVICE
-	 * memory. We limit the total number of pages to initialize to just
+	 * We limit the total number of pages to initialize to just
 	 * those that might contain the memory mapping. We will defer the
 	 * ZONE_DEVICE page initialization until after we have released
 	 * the hotplug lock.
@@ -5750,8 +5749,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		if (!altmap)
 			return;
 
-		if (start_pfn == altmap->base_pfn)
-			start_pfn += altmap->reserve;
 		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 	}
 #endif
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Sergey Senozhatsky @ 2019-05-14  2:25 UTC (permalink / raw)
  To: Petr Mladek
  Cc: linux-arch@vger.kernel.org, Sergey Senozhatsky, Heiko Carstens,
	linux-s390@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	Rasmus Villemoes, linux-kernel@vger.kernel.org, Steven Rostedt,
	Michal Hocko, Sergey Senozhatsky, David Laight, Stephen Rothwell,
	Andy Shevchenko, Linus Torvalds, Martin Schwidefsky,
	Tobin C . Harding
In-Reply-To: <20190514020730.GA651@jagdpanzerIV>

On (05/14/19 11:07), Sergey Senozhatsky wrote:
> How about this:
> 
> 	if ptr < PAGE_SIZE		-> "(null)"

No, this is totally stupid. Forget about it. Sorry.


> 	if IS_ERR_VALUE(ptr)		-> "(fault)"

But Steven's "(fault)" is nice.

	-ss

^ permalink raw reply

* [v2 2/2] [PowerPC] Allow use of SIMD in interrupts from kernel code
From: Shawn Landden @ 2019-05-14  2:23 UTC (permalink / raw)
  Cc: Shawn Landden, Paul Mackerras, linuxppc-dev
In-Reply-To: <20190514022308.32363-1-shawn@git.icu>

This second patch is separate because it could be wrong,
like I am not sure about how kernel thread migration works,
and it is even allowing simd in preemptible kernel code.

Signed-off-by: Shawn Landden <shawn@git.icu>
---
 arch/powerpc/include/asm/switch_to.h | 15 ++-----
 arch/powerpc/kernel/process.c        | 60 ++++++++++++++++++++++++++--
 2 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h
index 5b03d8a82..c79f7d24a 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -30,10 +30,7 @@ extern void enable_kernel_fp(void);
 extern void flush_fp_to_thread(struct task_struct *);
 extern void giveup_fpu(struct task_struct *);
 extern void save_fpu(struct task_struct *);
-static inline void disable_kernel_fp(void)
-{
-	msr_check_and_clear(MSR_FP);
-}
+extern void disable_kernel_fp(void);
 #else
 static inline void save_fpu(struct task_struct *t) { }
 static inline void flush_fp_to_thread(struct task_struct *t) { }
@@ -44,10 +41,7 @@ extern void enable_kernel_altivec(void);
 extern void flush_altivec_to_thread(struct task_struct *);
 extern void giveup_altivec(struct task_struct *);
 extern void save_altivec(struct task_struct *);
-static inline void disable_kernel_altivec(void)
-{
-	msr_check_and_clear(MSR_VEC);
-}
+extern void disable_kernel_altivec(void);
 #else
 static inline void save_altivec(struct task_struct *t) { }
 static inline void __giveup_altivec(struct task_struct *t) { }
@@ -56,10 +50,7 @@ static inline void __giveup_altivec(struct task_struct *t) { }
 #ifdef CONFIG_VSX
 extern void enable_kernel_vsx(void);
 extern void flush_vsx_to_thread(struct task_struct *);
-static inline void disable_kernel_vsx(void)
-{
-	msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX);
-}
+extern void disable_kernel_vsx(void);
 #endif
 
 #ifdef CONFIG_SPE
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ef534831f..d15f2cb51 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -170,6 +170,29 @@ void __msr_check_and_clear(unsigned long bits)
 EXPORT_SYMBOL(__msr_check_and_clear);
 
 #ifdef CONFIG_PPC_FPU
+/*
+ * Track whether the kernel is using the FPU state
+ * currently.
+ *
+ * This flag is used:
+ *
+ *   - by IRQ context code to potentially use the FPU
+ *     if it's unused.
+ *
+ *   - to debug kernel_fpu/altivec/vsx_begin()/end() correctness
+ */
+static DEFINE_PER_CPU(bool, in_kernel_fpu);
+
+static bool kernel_fpu_disabled(void)
+{
+        return this_cpu_read(in_kernel_fpu);
+}
+
+static bool interrupted_kernel_fpu_idle(void)
+{
+        return !kernel_fpu_disabled();
+}
+
 static void __giveup_fpu(struct task_struct *tsk)
 {
 	unsigned long msr;
@@ -230,7 +253,8 @@ void enable_kernel_fp(void)
 {
 	unsigned long cpumsr;
 
-	WARN_ON(preemptible());
+        WARN_ON_ONCE(this_cpu_read(in_kernel_fpu));
+        this_cpu_write(in_kernel_fpu, true);
 
 	cpumsr = msr_check_and_set(MSR_FP);
 
@@ -251,6 +275,15 @@ void enable_kernel_fp(void)
 }
 EXPORT_SYMBOL(enable_kernel_fp);
 
+void disable_kernel_fp(void)
+{
+        WARN_ON_ONCE(!this_cpu_read(in_kernel_fpu));
+        this_cpu_write(in_kernel_fpu, false);
+
+        msr_check_and_clear(MSR_FP);
+}
+EXPORT_SYMBOL(disable_kernel_fp);
+
 static int restore_fp(struct task_struct *tsk)
 {
 	if (tsk->thread.load_fp || tm_active_with_fp(tsk)) {
@@ -295,7 +328,8 @@ void enable_kernel_altivec(void)
 {
 	unsigned long cpumsr;
 
-	WARN_ON(preemptible());
+	WARN_ON_ONCE(this_cpu_read(in_kernel_fpu));
+	this_cpu_write(in_kernel_fpu, true);
 
 	cpumsr = msr_check_and_set(MSR_VEC);
 
@@ -316,6 +350,14 @@ void enable_kernel_altivec(void)
 }
 EXPORT_SYMBOL(enable_kernel_altivec);
 
+extern void disable_kernel_altivec(void)
+{
+	WARN_ON_ONCE(!this_cpu_read(in_kernel_fpu));
+	this_cpu_write(in_kernel_fpu, false);
+	msr_check_and_clear(MSR_VEC);
+}
+EXPORT_SYMBOL(disable_kernel_altivec);
+
 /*
  * Make sure the VMX/Altivec register state in the
  * the thread_struct is up to date for task tsk.
@@ -371,7 +413,8 @@ static bool interrupted_user_mode(void)
 bool may_use_simd(void)
 {
 	return !in_interrupt() ||
-		interrupted_user_mode();
+		interrupted_user_mode() ||
+		interrupted_kernel_fpu_idle();
 }
 EXPORT_SYMBOL(may_use_simd);
 
@@ -411,7 +454,8 @@ void enable_kernel_vsx(void)
 {
 	unsigned long cpumsr;
 
-	WARN_ON(preemptible());
+	WARN_ON_ONCE(this_cpu_read(in_kernel_fpu));
+	this_cpu_write(in_kernel_fpu, true);
 
 	cpumsr = msr_check_and_set(MSR_FP|MSR_VEC|MSR_VSX);
 
@@ -433,6 +477,14 @@ void enable_kernel_vsx(void)
 }
 EXPORT_SYMBOL(enable_kernel_vsx);
 
+void disable_kernel_vsx(void)
+{
+	WARN_ON_ONCE(!this_cpu_read(in_kernel_fpu));
+	this_cpu_write(in_kernel_fpu, false);
+	msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX);
+}
+EXPORT_SYMBOL(disable_kernel_vsx);
+
 void flush_vsx_to_thread(struct task_struct *tsk)
 {
 	if (tsk->thread.regs) {
-- 
2.21.0.1020.gf2820cf01a


^ permalink raw reply related

* [v2 1/2] [PowerPC] Add simd.h implementation
From: Shawn Landden @ 2019-05-14  2:23 UTC (permalink / raw)
  Cc: Shawn Landden, Paul Mackerras, linuxppc-dev
In-Reply-To: <20190513005104.20140-1-shawn@git.icu>

Based off the x86 one.

WireGuard really wants to be able to do SIMD in interrupts,
so it can accelerate its in-bound path.

Signed-off-by: Shawn Landden <shawn@git.icu>
---
 arch/powerpc/include/asm/simd.h | 10 ++++++++++
 arch/powerpc/kernel/process.c   | 30 ++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)
 create mode 100644 arch/powerpc/include/asm/simd.h

diff --git a/arch/powerpc/include/asm/simd.h b/arch/powerpc/include/asm/simd.h
new file mode 100644
index 000000000..9b066d633
--- /dev/null
+++ b/arch/powerpc/include/asm/simd.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+/*
+ * may_use_simd - whether it is allowable at this time to issue SIMD
+ *                instructions or access the SIMD register file
+ *
+ * It's always ok in process context (ie "not interrupt")
+ * but it is sometimes ok even from an irq.
+ */
+extern bool may_use_simd(void);
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index dd9e0d538..ef534831f 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -345,6 +345,36 @@ static int restore_altivec(struct task_struct *tsk)
 	}
 	return 0;
 }
+
+/*
+ * Were we in user mode when we were
+ * interrupted?
+ *
+ * Doing kernel_altivec/vsx_begin/end() is ok if we are running
+ * in an interrupt context from user mode - we'll just
+ * save the FPU state as required.
+ */
+static bool interrupted_user_mode(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+
+	return regs && user_mode(regs);
+}
+
+/*
+ * Can we use FPU in kernel mode with the
+ * whole "kernel_fpu/altivec/vsx_begin/end()" sequence?
+ *
+ * It's always ok in process context (ie "not interrupt")
+ * but it is sometimes ok even from an irq.
+ */
+bool may_use_simd(void)
+{
+	return !in_interrupt() ||
+		interrupted_user_mode();
+}
+EXPORT_SYMBOL(may_use_simd);
+
 #else
 #define loadvec(thr) 0
 static inline int restore_altivec(struct task_struct *tsk) { return 0; }
-- 
2.21.0.1020.gf2820cf01a


^ permalink raw reply related

* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Sergey Senozhatsky @ 2019-05-14  2:07 UTC (permalink / raw)
  To: Petr Mladek
  Cc: linux-arch@vger.kernel.org, Sergey Senozhatsky, Heiko Carstens,
	linux-s390@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	Rasmus Villemoes, linux-kernel@vger.kernel.org, Steven Rostedt,
	Michal Hocko, Sergey Senozhatsky, David Laight, Stephen Rothwell,
	Andy Shevchenko, Linus Torvalds, Martin Schwidefsky,
	Tobin C . Harding
In-Reply-To: <20190513124220.wty2qbnz4wo52h3x@pathway.suse.cz>

On (05/13/19 14:42), Petr Mladek wrote:
> > The "(null)" is good enough by itself and already an established
> > practice..
> 
> (efault) made more sense with the probe_kernel_read() that
> checked wide range of addresses. Well, I still think that
> it makes sense to distinguish a pure NULL. And it still
> used also for IS_ERR_VALUE().

Wouldn't anything within first PAGE_SIZE bytes be reported as
a NULL deref?

       char *p = (char *)(PAGE_SIZE - 2);
       *p = 'a';

gives

 kernel: BUG: kernel NULL pointer dereference, address = 0000000000000ffe
 kernel: #PF: supervisor-privileged write access from kernel code
 kernel: #PF: error_code(0x0002) - not-present page


And I like Steven's "(fault)" idea.
How about this:

	if ptr < PAGE_SIZE		-> "(null)"
	if IS_ERR_VALUE(ptr)		-> "(fault)"

	-ss

^ permalink raw reply

* [PATCH 2/2] [PowerPC] Allow use of SIMD in interrupts from kernel code
From: Shawn Landden @ 2019-05-14  1:44 UTC (permalink / raw)
  Cc: Shawn Landden, Paul Mackerras, linuxppc-dev
In-Reply-To: <20190514014412.25373-1-shawn@git.icu>

This second patch is separate because it could be wrong,
like I am not sure about how kernel thread migration works,
and it is even allowing simd in preemptible kernel code.

Signed-off-by: Shawn Landden <shawn@git.icu>
---
 arch/powerpc/include/asm/simd.h      |  8 +++++
 arch/powerpc/include/asm/switch_to.h | 10 ++----
 arch/powerpc/kernel/process.c        | 50 ++++++++++++++++++++++++++--
 3 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/simd.h b/arch/powerpc/include/asm/simd.h
index 2c02ad531..7b582b07e 100644
--- a/arch/powerpc/include/asm/simd.h
+++ b/arch/powerpc/include/asm/simd.h
@@ -7,7 +7,15 @@
  * It's always ok in process context (ie "not interrupt")
  * but it is sometimes ok even from an irq.
  */
+#ifdef CONFIG_ALTIVEC
+extern bool irq_simd_usable(void);
 static __must_check inline bool may_use_simd(void)
 {
 	return irq_simd_usable();
 }
+#else
+static inline bool may_use_simd(void)
+{
+	return false;
+}
+#endif
diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h
index 5b03d8a82..537998997 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -44,10 +44,7 @@ extern void enable_kernel_altivec(void);
 extern void flush_altivec_to_thread(struct task_struct *);
 extern void giveup_altivec(struct task_struct *);
 extern void save_altivec(struct task_struct *);
-static inline void disable_kernel_altivec(void)
-{
-	msr_check_and_clear(MSR_VEC);
-}
+extern void disable_kernel_altivec(void);
 #else
 static inline void save_altivec(struct task_struct *t) { }
 static inline void __giveup_altivec(struct task_struct *t) { }
@@ -56,10 +53,7 @@ static inline void __giveup_altivec(struct task_struct *t) { }
 #ifdef CONFIG_VSX
 extern void enable_kernel_vsx(void);
 extern void flush_vsx_to_thread(struct task_struct *);
-static inline void disable_kernel_vsx(void)
-{
-	msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX);
-}
+extern void disable_kernel_vsx(void);
 #endif
 
 #ifdef CONFIG_SPE
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index e436d708a..41a0ab500 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -267,6 +267,29 @@ static int restore_fp(struct task_struct *tsk) { return 0; }
 #ifdef CONFIG_ALTIVEC
 #define loadvec(thr) ((thr).load_vec)
 
+/*
+ * Track whether the kernel is using the SIMD state
+ * currently.
+ *
+ * This flag is used:
+ *
+ *   - by IRQ context code to potentially use the FPU
+ *     if it's unused.
+ *
+ *   - to debug kernel_altivec/vsx_begin()/end() correctness
+ */
+static DEFINE_PER_CPU(bool, in_kernel_simd);
+
+static bool kernel_simd_disabled(void)
+{
+	return this_cpu_read(in_kernel_simd);
+}
+
+static bool interrupted_kernel_simd_idle(void)
+{
+	return !kernel_simd_disabled();
+}
+
 static void __giveup_altivec(struct task_struct *tsk)
 {
 	unsigned long msr;
@@ -295,7 +318,9 @@ void enable_kernel_altivec(void)
 {
 	unsigned long cpumsr;
 
-	WARN_ON(preemptible());
+	WARN_ON_ONCE(preemptible());
+	WARN_ON_ONCE(this_cpu_read(in_kernel_simd));
+	this_cpu_write(in_kernel_simd, true);
 
 	cpumsr = msr_check_and_set(MSR_VEC);
 
@@ -316,6 +341,14 @@ void enable_kernel_altivec(void)
 }
 EXPORT_SYMBOL(enable_kernel_altivec);
 
+extern void disable_kernel_altivec(void)
+{
+	WARN_ON_ONCE(!this_cpu_read(in_kernel_simd));
+	this_cpu_write(in_kernel_simd, false);
+	msr_check_and_clear(MSR_VEC);
+}
+EXPORT_SYMBOL(disable_kernel_altivec);
+
 /*
  * Make sure the VMX/Altivec register state in the
  * the thread_struct is up to date for task tsk.
@@ -371,7 +404,8 @@ static bool interrupted_user_mode(void)
 bool irq_simd_usable(void)
 {
 	return !in_interrupt() ||
-		interrupted_user_mode();
+		interrupted_user_mode() ||
+		interrupted_kernel_simd_idle();
 }
 EXPORT_SYMBOL(irq_simd_usable);
 
@@ -411,7 +445,9 @@ void enable_kernel_vsx(void)
 {
 	unsigned long cpumsr;
 
-	WARN_ON(preemptible());
+	WARN_ON_ONCE(preemptible());
+	WARN_ON_ONCE(this_cpu_read(in_kernel_simd));
+	this_cpu_write(in_kernel_simd, true);
 
 	cpumsr = msr_check_and_set(MSR_FP|MSR_VEC|MSR_VSX);
 
@@ -433,6 +469,14 @@ void enable_kernel_vsx(void)
 }
 EXPORT_SYMBOL(enable_kernel_vsx);
 
+void disable_kernel_vsx(void)
+{
+	WARN_ON_ONCE(!this_cpu_read(in_kernel_simd));
+	this_cpu_write(in_kernel_simd, false);
+	msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX);
+}
+EXPORT_SYMBOL(disable_kernel_vsx);
+
 void flush_vsx_to_thread(struct task_struct *tsk)
 {
 	if (tsk->thread.regs) {
-- 
2.21.0.1020.gf2820cf01a


^ permalink raw reply related

* [PATCH 1/2] [PowerPC] Add simd.h implementation
From: Shawn Landden @ 2019-05-14  1:44 UTC (permalink / raw)
  Cc: Shawn Landden, Paul Mackerras, linuxppc-dev
In-Reply-To: <20190513005104.20140-1-shawn@git.icu>

Based off the x86 one.

WireGuard really wants to be able to do SIMD in interrupts,
so it can accelerate its in-bound path.

Signed-off-by: Shawn Landden <shawn@git.icu>
---
 arch/powerpc/include/asm/simd.h | 13 +++++++++++++
 arch/powerpc/kernel/process.c   | 30 ++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)
 create mode 100644 arch/powerpc/include/asm/simd.h

diff --git a/arch/powerpc/include/asm/simd.h b/arch/powerpc/include/asm/simd.h
new file mode 100644
index 000000000..2c02ad531
--- /dev/null
+++ b/arch/powerpc/include/asm/simd.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+/*
+ * may_use_simd - whether it is allowable at this time to issue SIMD
+ *                instructions or access the SIMD register file
+ *
+ * It's always ok in process context (ie "not interrupt")
+ * but it is sometimes ok even from an irq.
+ */
+static __must_check inline bool may_use_simd(void)
+{
+	return irq_simd_usable();
+}
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index dd9e0d538..e436d708a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -345,6 +345,36 @@ static int restore_altivec(struct task_struct *tsk)
 	}
 	return 0;
 }
+
+/*
+ * Were we in user mode when we were
+ * interrupted?
+ *
+ * Doing kernel_altivec/vsx_begin/end() is ok if we are running
+ * in an interrupt context from user mode - we'll just
+ * save the FPU state as required.
+ */
+static bool interrupted_user_mode(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+
+	return regs && user_mode(regs);
+}
+
+/*
+ * Can we use SIMD in kernel mode with the
+ * whole "kernel_altivec/vsx_begin/end()" sequence?
+ *
+ * It's always ok in process context (ie "not interrupt")
+ * but it is sometimes ok even from an irq.
+ */
+bool irq_simd_usable(void)
+{
+	return !in_interrupt() ||
+		interrupted_user_mode();
+}
+EXPORT_SYMBOL(irq_simd_usable);
+
 #else
 #define loadvec(thr) 0
 static inline int restore_altivec(struct task_struct *tsk) { return 0; }
-- 
2.21.0.1020.gf2820cf01a


^ permalink raw reply related

* Re: Kernel OOPS followed by a panic on next20190507 with 4K page size
From: Aneesh Kumar K.V @ 2019-05-14  1:30 UTC (permalink / raw)
  To: Sachin Sant, linuxppc-dev; +Cc: linux-next
In-Reply-To: <A4247410-7C78-4E52-AB56-1C33A6C27FF3@linux.vnet.ibm.com>

On 5/8/19 4:30 PM, Sachin Sant wrote:
> While running LTP tests (specifically futex_wake04) against next-20199597
> build with 4K page size on a POWER8 LPAR following crash is observed.
> 
> [ 4233.214876] BUG: Kernel NULL pointer dereference at 0x0000001c
> [ 4233.214898] Faulting instruction address: 0xc000000001d1e58c
> [ 4233.214905] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 4233.214911] LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [ 4233.214920] Dumping ftrace buffer:
> [ 4233.214928]    (ftrace buffer empty)
> [ 4233.214933] Modules linked in: overlay rpadlpar_io rpaphp iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc kvm iptable_filter pseries_rng rng_core vmx_crypto ip_tables x_tables autofs4 [last unloaded: dummy_del_mod]
> [ 4233.214973] CPU: 3 PID: 4635 Comm: futex_wake04 Tainted: G        W  O      5.1.0-next-20190507-autotest #1
> [ 4233.214980] NIP:  c000000001d1e58c LR: c000000001d1e54c CTR: 0000000000000000
> [ 4233.214987] REGS: c000000004937890 TRAP: 0300   Tainted: G        W  O       (5.1.0-next-20190507-autotest)
> [ 4233.214993] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22424822  XER: 00000000
> [ 4233.215005] CFAR: c00000000183e9e0 DAR: 000000000000001c DSISR: 40000000 IRQMASK: 0
> [ 4233.215005] GPR00: c000000001901a80 c000000004937b20 c000000003938700 0000000000000000
> [ 4233.215005] GPR04: 0000000000400cc0 000000000003efff 000000027966e000 c000000003ba8700
> [ 4233.215005] GPR08: c000000003ba8700 000000000d601125 c000000003ba8700 0000000080000000
> [ 4233.215005] GPR12: 0000000022424822 c00000001ecae280 0000000000000000 0000000000000000
> [ 4233.215005] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 4233.215005] GPR20: 0000000000000018 c0000000039e2d30 c0000000039e2d28 c0000002762da460
> [ 4233.215005] GPR24: 000000000000001c 0000000000000000 0000000000000001 c000000001901a80
> [ 4233.215005] GPR28: 0000000000400cc0 0000000000000000 0000000000000000 0000000000400cc0
> [ 4233.215065] NIP [c000000001d1e58c] kmem_cache_alloc+0xbc/0x5a0
> [ 4233.215071] LR [c000000001d1e54c] kmem_cache_alloc+0x7c/0x5a0
> [ 4233.215075] Call Trace:
> [ 4233.215081] [c000000004937b20] [c000000001c91150] __pud_alloc+0x160/0x200 (unreliable)
> [ 4233.215090] [c000000004937b80] [c000000001901a80] huge_pte_alloc+0x580/0x950
> [ 4233.215098] [c000000004937c00] [c000000001cf7910] hugetlb_fault+0x9a0/0x1250
> [ 4233.215106] [c000000004937ce0] [c000000001c94a80] handle_mm_fault+0x490/0x4a0
> [ 4233.215114] [c000000004937d20] [c0000000018d529c] __do_page_fault+0x77c/0x1f00
> [ 4233.215121] [c000000004937e00] [c0000000018d6a48] do_page_fault+0x28/0x50
> [ 4233.215129] [c000000004937e20] [c00000000183b0d4] handle_page_fault+0x18/0x38
> [ 4233.215135] Instruction dump:
> [ 4233.215139] 39290001 f92ac1b0 419e009c 3ce20027 3ba00000 e927c1f0 39290001 f927c1f0
> [ 4233.215149] 3d420027 e92ac290 39290001 f92ac290 <8359001c> 83390018 60000000 3ce20027

I did send a patch to the list to handle page allocation failures in 
this patch. But i guess what we are finding here is get_current() 
crashing. Any chance to bisect this?

-aneesh


^ permalink raw reply

* [PATCH] powerpc/mm: Handle page table allocation failures
From: Aneesh Kumar K.V @ 2019-05-14  1:05 UTC (permalink / raw)
  To: npiggin, paulus, mpe; +Cc: Sachin Sant, Aneesh Kumar K.V, linuxppc-dev

This fix the below crash that arise due to not handling page table allocation
failures while allocating hugetlb page table.

 BUG: Kernel NULL pointer dereference at 0x0000001c
 Faulting instruction address: 0xc000000001d1e58c
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries

 CPU: 3 PID: 4635 Comm: futex_wake04 Tainted: G        W  O      5.1.0-next-20190507-autotest #1
 NIP:  c000000001d1e58c LR: c000000001d1e54c CTR: 0000000000000000
 REGS: c000000004937890 TRAP: 0300   Tainted: G        W  O       (5.1.0-next-20190507-autotest)
 MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22424822  XER: 00000000
 CFAR: c00000000183e9e0 DAR: 000000000000001c DSISR: 40000000 IRQMASK: 0
 GPR00: c000000001901a80 c000000004937b20 c000000003938700 0000000000000000
 GPR04: 0000000000400cc0 000000000003efff 000000027966e000 c000000003ba8700
 GPR08: c000000003ba8700 000000000d601125 c000000003ba8700 0000000080000000
 GPR12: 0000000022424822 c00000001ecae280 0000000000000000 0000000000000000
 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 GPR20: 0000000000000018 c0000000039e2d30 c0000000039e2d28 c0000002762da460
 GPR24: 000000000000001c 0000000000000000 0000000000000001 c000000001901a80
 GPR28: 0000000000400cc0 0000000000000000 0000000000000000 0000000000400cc0
 NIP [c000000001d1e58c] kmem_cache_alloc+0xbc/0x5a0
 LR [c000000001d1e54c] kmem_cache_alloc+0x7c/0x5a0
 Call Trace:
  [c000000001c91150] __pud_alloc+0x160/0x200 (unreliable)
  [c000000001901a80] huge_pte_alloc+0x580/0x950
  [c000000001cf7910] hugetlb_fault+0x9a0/0x1250
  [c000000001c94a80] handle_mm_fault+0x490/0x4a0
  [c0000000018d529c] __do_page_fault+0x77c/0x1f00
  [c0000000018d6a48] do_page_fault+0x28/0x50
  [c00000000183b0d4] handle_page_fault+0x18/0x38

Fixes: e2b3d202d1db ("powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---

Note: I did add a recent commit for the Fixes tag. But in reality we never checked for page table
allocation failure there. If we want to go to that old commit, then we may need.

Fixes: a4fe3ce7699b ("powerpc/mm: Allow more flexible layouts for hugepage pagetables")

 arch/powerpc/mm/hugetlbpage.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index c5c9ff2d7afc..ae9d71da5219 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -130,6 +130,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	} else {
 		pdshift = PUD_SHIFT;
 		pu = pud_alloc(mm, pg, addr);
+		if (!pu)
+			return NULL;
 		if (pshift == PUD_SHIFT)
 			return (pte_t *)pu;
 		else if (pshift > PMD_SHIFT) {
@@ -138,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 		} else {
 			pdshift = PMD_SHIFT;
 			pm = pmd_alloc(mm, pu, addr);
+			if (!pm)
+				return NULL;
 			if (pshift == PMD_SHIFT)
 				/* 16MB hugepage */
 				return (pte_t *)pm;
@@ -154,12 +158,16 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	} else {
 		pdshift = PUD_SHIFT;
 		pu = pud_alloc(mm, pg, addr);
+		if (!pu)
+			return NULL;
 		if (pshift >= PUD_SHIFT) {
 			ptl = pud_lockptr(mm, pu);
 			hpdp = (hugepd_t *)pu;
 		} else {
 			pdshift = PMD_SHIFT;
 			pm = pmd_alloc(mm, pu, addr);
+			if (!pm)
+				return NULL;
 			ptl = pmd_lockptr(mm, pm);
 			hpdp = (hugepd_t *)pm;
 		}
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH 0/4] Enabling secure boot on PowerNV systems
From: Matthew Garrett @ 2019-05-13 22:06 UTC (permalink / raw)
  To: Claudio Carvalho
  Cc: linux-efi, Ard Biesheuvel, Nayna Jain, Linux Kernel Mailing List,
	Matthew Garret, linuxppc-dev, Peter Jones, Paul Mackerras,
	Jeremy Kerr, linux-integrity
In-Reply-To: <e845d9f5-00bb-e68d-9d24-da802dd05549@linux.ibm.com>

On Fri, May 10, 2019 at 2:31 PM Claudio Carvalho <cclaudio@linux.ibm.com> wrote:
> On 4/10/19 2:36 PM, Matthew Garrett wrote:
> > I don't see the benefit in attempting to maintain compatibility with
> > existing tooling unless you're going to be *completely* compatible
> > with existing tooling. That means supporting dbx and dbt.

(snip)

> In OS secure boot domain (work in progress):
> - The skiroot container is verified as part of firmware secure boot.
> - Skiroot uses UEFI-like secure variables (PK, KEK and db) to verify OS
> kernels. Only X.509 certificates will be supported for these secure variables.

You don't support hashes? If so, this isn't compatible with UEFI
Secure Boot and we shouldn't try to make it look like UEFI Secure
Boot.

> How about dbx and dbt?
>
> The db keys will be used to verify only OS kernels via kexecs initiated by
> petitboot. So we only need the dbx to revoke kernel images, either via
> certs or hashes. Currently, the kernel loads certs and hashes from the dbx
> to the system blacklist keyring. The revoked certs are checked during pkcs7
> signature verification and loading of keys. However, there doesn't appear
> to be any verification against blacklisted hashes. Should kernel images be
> revoked only by keys and not hashes? We tried to find published revoked
> kernel lists but couldn't find any. How is kernel image revocation handled
> in practice?

Hash-based revocation is in active use in the UEFI world - to the best
of my knowledge, all existing dbx entries are hashes with the
exception of the invalidation of the Microsoft Windows 2010 CA.

> Also, we didn't see the shim or kernel loading anything from dbt.

dbt is currently only used for validation at the firmware level - the
way grub and kernel signatures are currently managed means it doesn't
make a huge amount of sense to use it in shim, but it would probably
be reasonable to extend shim's validation to include dbt.

> > So I do the following:
> >
> > 1) Boot
> > 2) Extend the contents of db
> > 3) Extend the contents of db again
> > 4) Read back the contents of db through efivarfs
> > 5) Reboot
> > 6) Read back the contents of db through efivarfs
> >
> > Is what I see in (4) and (6) the same? Does it contain the values form
> > both extensions?
>
> In (2) and (3) the extensions are added to the update queue, which is
> processed only in (5) by firmware. So, in (4) you should see the db content
> without the extensions.

Ok, this is not what we expect from UEFI systems. I'm strongly against
providing what looks like the same ABI on multiple platforms but
carrying subtle differences between those platforms - it's guaranteed
to break tooling in unexpected ways.

> In (5), firmware (skiboot) will process the update queue. The extensions
> will be applied only if *all* of them are valid and pass signature
> verification. Only in this case should you be able to see the extensions in
> (6). If any of the extensions fail, firmware will discard all of them,
> clear the queue, and do the proper logging.

I believe that this is also a violation of expectations.

> > Why would the intermediate level organisations not just have entries
> > in db?
>
> Because that seems to add more complexity than having three levels (PK, KEK
> and db).
>
> Typically, the intermediate level organisations (or KEK) are used to
> authorize new additions to db. However, if we also have them in the db, who
> would authorize the new additions to db. If that would be the intermediate
> level organisation entries now in the db, it seems we would need to
> implement a mechanism to determine which entries are for authorizing new
> additions and which are for kernel signature verification. If that would be
> the PK, we'd be burdening the PK owner to sign every new db addition if the
> platform is owned by a company that has intermediate level organizations.

Ok, in this scenario I don't understand why you wouldn't just want the
intermediates in PK. Or, put another way - if you have a business
justification for three layers of hierarchy, what do you do when
someone has a business justification for four? The three layer
hierarchy represents the weirdness of the PC industry where you have
Microsoft needing to be in KEK (because they need to be able to issue
updates to machines from multiple vendors) but not wanting to be in PK
(because vendors don't want Microsoft to have ultimate control over
their systems). If it weren't for this conflict, we'd just have a two
layer hierarchy, and if some other aspect of the market had evolved
over time we'd have a four layer hierarchy.

>
> >  The main reason we don't do it this way in UEFI is because we
> > need to support dbx, and if you're not supporting dbx I'm not sure I
> > see the benefit.
>
> I'm not sure I understand your question.  We would be using dbx to prevent
> kernels from being loaded. How is that related to having three levels in
> the key hierarchy (PK, KEK and db)?

dbx entries come from Microsoft, so we need the KEK layer so Microsoft
can update dbx. If Microsoft didn't need to update dbx then we'd leave
Microsoft out of KEK, and then KEK and PK would be the same and we'd
be able to get rid of KEK.

^ permalink raw reply

* [Bug 203597] kernel 4.9.175 fails to boot on a PowerMac G4 3,6 at early stage
From: bugzilla-daemon @ 2019-05-13 20:18 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-203597-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=203597

--- Comment #1 from Erhard F. (erhard_f@mailbox.org) ---
Created attachment 282745
  --> https://bugzilla.kernel.org/attachment.cgi?id=282745&action=edit
bisect.log

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox