* Re: 83a092cf95f28: powerpc: Link warning for orphan sections
From: Mathieu Malaterre @ 2018-07-03 7:37 UTC (permalink / raw)
To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <20180703172332.7210ec7a@roar.ozlabs.ibm.com>
On Tue, Jul 3, 2018 at 9:23 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> On Tue, 3 Jul 2018 09:03:46 +0200
> Mathieu Malaterre <malat@debian.org> wrote:
>
> > Hi Nick,
> >
> > I am building my kernel (ppc32) with both
> > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y and CONFIG_UBSAN=y. This leads
> > to ~316428 warnings such as:
> >
> > + powerpc-linux-gnu-ld -EB -m elf32ppc -Bstatic --orphan-handling=warn
> > --build-id --gc-sections -X -o .tmp_vmlinux1 -T
> > ./arch/powerpc/kernel/vmlinux.lds --who
> > le-archive built-in.a --no-whole-archive --start-group lib/lib.a --end-group
> > powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data393'
> > from `init/main.o' being placed in section `.data..Lubsan_data393'.
> > powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data394'
> > from `init/main.o' being placed in section `.data..Lubsan_data394'.
> > powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data395'
> > from `init/main.o' being placed in section `.data..Lubsan_data395'.
> > powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data396'
> > from `init/main.o' being placed in section `.data..Lubsan_data396'.
> > ...
> >
> > What would you recommend to reduce the number of warnings produced ? I
> > tried `--warn-once` but that only affect undefined symbols.
>
> I'm not sure if the linker can be quietened. You could try putting
> *(.data..Lubsan_data*) into the .data section in
> powerpc/kernel/vmlinux.lds.S
Nice ! Will submit a patch. Thanks
> Thanks,
> Nick
^ permalink raw reply
* Re: [PATCH v2 2/3] powerpc/fsl: Add barrier_nospec implementation for NXP PowerPC Book3E
From: Michael Ellerman @ 2018-07-03 7:26 UTC (permalink / raw)
To: Diana Craciun, linuxppc-dev
Cc: oss, leoyang.li, bharat.bhushan, Diana Craciun
In-Reply-To: <1528721608-15443-3-git-send-email-diana.craciun@nxp.com>
Hi Diana,
A few comments below ...
Diana Craciun <diana.craciun@nxp.com> writes:
> Implement the barrier_nospec as a isync;sync instruction sequence.
> The implementation uses the infrastructure built for BOOK3S 64.
Do you have any details on why that sequence functions as an effective
barrier on your chips?
In a lot of places we have eg:
+#if defined(CONFIG_PPC_BOOK3S_64) || defined(CONFIG_PPC_FSL_BOOK3E)
Can you please add a Kconfig symbol to capture that. eg.
config PPC_NOSPEC
bool
default y
depends on PPC_BOOK3S_64 || PPC_FSL_BOOK3E
And then use that everywhere.
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index f67b3f6..405d572 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -86,6 +86,16 @@ do { \
> // This also acts as a compiler barrier due to the memory clobber.
> #define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
>
> +#elif defined(CONFIG_PPC_FSL_BOOK3E)
> +/*
> + * Prevent the execution of subsequent instructions speculatively using a
> + * isync;sync instruction sequence.
> + */
> +#define barrier_nospec_asm NOSPEC_BARRIER_FIXUP_SECTION; nop; nop
> +
> +// This also acts as a compiler barrier due to the memory clobber.
> +#define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
> +
> #else /* !CONFIG_PPC_BOOK3S_64 */
> #define barrier_nospec_asm
> #define barrier_nospec()
If we have CONFIG_PPC_NOSPEC this can be done something like:
#ifdef CONFIG_PPC_BOOK3S_64
#define NOSPEC_BARRIER_SLOT nop
#elif defined(CONFIG_PPC_FSL_BOOK3E)
#define NOSPEC_BARRIER_SLOT nop; nop
#endif /* CONFIG_PPC_BOOK3S_64 */
#ifdef CONFIG_PPC_NOSPEC
/*
* Prevent execution of subsequent instructions until preceding branches have
* been fully resolved and are no longer executing speculatively.
*/
#define barrier_nospec_asm NOSPEC_BARRIER_FIXUP_SECTION; NOSPEC_BARRIER_SLOT
// This also acts as a compiler barrier due to the memory clobber.
#define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
#else
#define barrier_nospec_asm
#define barrier_nospec()
#endif /* CONFIG_PPC_NOSPEC */
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 2b4c40b2..d9dee43 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -76,7 +76,7 @@ endif
> obj64-$(CONFIG_HIBERNATION) += swsusp_asm64.o
> obj-$(CONFIG_MODULES) += module.o module_$(BITS).o
> obj-$(CONFIG_44x) += cpu_setup_44x.o
> -obj-$(CONFIG_PPC_FSL_BOOK3E) += cpu_setup_fsl_booke.o
> +obj-$(CONFIG_PPC_FSL_BOOK3E) += cpu_setup_fsl_booke.o security.o
> obj-$(CONFIG_PPC_DOORBELL) += dbell.o
> obj-$(CONFIG_JUMP_LABEL) += jump_label.o
Can we instead do:
obj-$(CONFIG_PPC_NOSPEC) += security.o
> diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
> index c55e102..797c975 100644
> --- a/arch/powerpc/kernel/security.c
> +++ b/arch/powerpc/kernel/security.c
> @@ -13,7 +13,9 @@
> #include <asm/setup.h>
>
>
> +#ifdef CONFIG_PPC_BOOK3S_64
> unsigned long powerpc_security_features __read_mostly = SEC_FTR_DEFAULT;
> +#endif /* CONFIG_PPC_BOOK3S_64 */
Why are you making that book3s specific?
By doing that you then can't use the existing versions of
setup_barrier_nospec() and cpu_show_spectre_v1/v2().
> @@ -24,6 +26,7 @@ static void enable_barrier_nospec(bool enable)
> do_barrier_nospec_fixups(enable);
> }
>
> +#ifdef CONFIG_PPC_BOOK3S_64
> void setup_barrier_nospec(void)
> {
> bool enable;
> @@ -46,6 +49,15 @@ void setup_barrier_nospec(void)
> if (!no_nospec)
> enable_barrier_nospec(enable);
> }
> +#endif /* CONFIG_PPC_BOOK3S_64 */
> +
> +#ifdef CONFIG_PPC_FSL_BOOK3E
> +void setup_barrier_nospec(void)
> +{
> + if (!no_nospec)
> + enable_barrier_nospec(true);
> +}
> +#endif /* CONFIG_PPC_FSL_BOOK3E */
eg. that is identical to the existing version except for the feature check:
enable = security_ftr_enabled(SEC_FTR_FAVOUR_SECURITY) &&
security_ftr_enabled(SEC_FTR_BNDS_CHK_SPEC_BAR);
Both those bits are enabled by default, so you shouldn't need to elide
that check.
So basically you should be able to use the existing
setup_barrier_nospec() if you just remove the ifdef around
powerpc_security_features. Or am I missing something?
> diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
> index 7445748..80c1e6e 100644
> --- a/arch/powerpc/kernel/setup_32.c
> +++ b/arch/powerpc/kernel/setup_32.c
> @@ -116,6 +116,11 @@ notrace void __init machine_init(u64 dt_ptr)
> /* Do some early initialization based on the flat device tree */
> early_init_devtree(__va(dt_ptr));
>
> + /* Apply the speculation barrier fixup */
> +#ifdef CONFIG_PPC_FSL_BOOK3E
> + setup_barrier_nospec();
> +#endif
This ifdef should be handled in a header with an empty version for the
#else case.
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 7a7ce8a..b2a644a 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -327,6 +327,12 @@ void __init early_setup(unsigned long dt_ptr)
>
> /* Apply all the dynamic patching */
> apply_feature_fixups();
> +
> + /* Apply the speculation barrier fixup */
> +#ifdef CONFIG_PPC_FSL_BOOK3E
> + setup_barrier_nospec();
> +#endif /* CONFIG_PPC_FSL_BOOK3E */
Can you call it from ppc_md->setup_arch() like the other platforms?
Failing that we could put it in setup_arch() after the call to
ppc_md->setup_arch(), so we can share it with powernv/pseries.
> diff --git a/arch/powerpc/lib/feature-fixups.c b/arch/powerpc/lib/feature-fixups.c
> index 2b9173d..bea2b87 100644
> --- a/arch/powerpc/lib/feature-fixups.c
> +++ b/arch/powerpc/lib/feature-fixups.c
> @@ -188,7 +188,40 @@ void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void *fixup_
>
> printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
> }
> +#endif /* CONFIG_PPC_BOOK3S_64 */
> +
> +#ifdef CONFIG_PPC_FSL_BOOK3E
> +void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void *fixup_end)
> +{
> + unsigned int instr[2], *dest;
> + long *start, *end;
> + int i;
> +
> + start = fixup_start;
> + end = fixup_end;
> +
> + instr[0] = PPC_INST_NOP;
> + instr[1] = PPC_INST_NOP;
> +
> + if (enable) {
> + pr_info("barrier_nospec; using isync; sync as a speculation barrier\n");
> + instr[0] = PPC_INST_ISYNC;
> + instr[1] = PPC_INST_SYNC;
> + }
> +
> + for (i = 0; start < end; start++, i++) {
> + dest = (void *)start + *start;
> + pr_devel("patching dest %lx\n", (unsigned long)dest);
>
> + patch_instruction(dest, instr[0]);
> + patch_instruction(dest + 1, instr[1]);
> + }
> +
> + pr_debug("barrier-nospec: patched %d locations\n", i);
> +}
> +#endif /* CONFIG_PPC_FSL_BOOK3E */
It's a bit unfortunate that we end up with two versions of that, which
are 80% the same. But merging them without ugly ifdefs would require a
bit more refactoring. So I guess it's OK for now.
cheers
^ permalink raw reply
* Re: [PATCH v2 3/3] powerpc/fsl: Implement cpu_show_spectre_v1/v2 for NXP PowerPC Book3E
From: Michael Ellerman @ 2018-07-03 7:26 UTC (permalink / raw)
To: Michal Suchánek, Bharat Bhushan
Cc: Diana Madalina Craciun, oss@buserror.net,
linuxppc-dev@lists.ozlabs.org, Leo Li
In-Reply-To: <20180612120737.31efff1c@naga.suse.cz>
Michal Such=C3=A1nek <msuchanek@suse.de> writes:
> On Tue, 12 Jun 2018 02:59:11 +0000
> Bharat Bhushan <bharat.bhushan@nxp.com> wrote:
>
>> Hi Diana,
>>=20
>> > -----Original Message-----
>> > From: Diana Craciun [mailto:diana.craciun@nxp.com]
>> > Sent: Monday, June 11, 2018 6:23 PM
>> > To: linuxppc-dev@lists.ozlabs.org
>> > Cc: mpe@ellerman.id.au; oss@buserror.net; Leo Li
>> > <leoyang.li@nxp.com>; Bharat Bhushan <bharat.bhushan@nxp.com>;
>> > Diana Madalina Craciun <diana.craciun@nxp.com>
>> > Subject: [PATCH v2 3/3] powerpc/fsl: Implement
>> > cpu_show_spectre_v1/v2 for NXP PowerPC Book3E=20=20
>>=20
>> Please add some description
>
> To me the subject is self-explanatory. It implements a kernel interface
> that was already described elsewhere.
>
> What are you missing here?
It should at least explain why it's reimplementing a function that
already exists for powerpc. ie. Why can't the existing version be used?
cheers
^ permalink raw reply
* Re: [PATCH v2 1/3] powerpc/fsl: Disable the speculation barrier from the command line
From: Michael Ellerman @ 2018-07-03 7:25 UTC (permalink / raw)
To: Diana Craciun, linuxppc-dev
Cc: oss, leoyang.li, bharat.bhushan, Diana Craciun
In-Reply-To: <1528721608-15443-2-git-send-email-diana.craciun@nxp.com>
Diana Craciun <diana.craciun@nxp.com> writes:
> The speculation barrier can be disabled from the command line
> with the parameter: "nospectre_v1".
Can you please send another patch adding it to Documentation/admin-guide/kernel-parameters.txt
You should mark it as "PPC" only for now.
cheers
^ permalink raw reply
* Re: 83a092cf95f28: powerpc: Link warning for orphan sections
From: Nicholas Piggin @ 2018-07-03 7:23 UTC (permalink / raw)
To: Mathieu Malaterre; +Cc: linuxppc-dev
In-Reply-To: <CA+7wUsz6AWY63hCwax2YSUPawRQbLfyje23qWKNGVUGM=OxM4Q@mail.gmail.com>
On Tue, 3 Jul 2018 09:03:46 +0200
Mathieu Malaterre <malat@debian.org> wrote:
> Hi Nick,
>
> I am building my kernel (ppc32) with both
> CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y and CONFIG_UBSAN=y. This leads
> to ~316428 warnings such as:
>
> + powerpc-linux-gnu-ld -EB -m elf32ppc -Bstatic --orphan-handling=warn
> --build-id --gc-sections -X -o .tmp_vmlinux1 -T
> ./arch/powerpc/kernel/vmlinux.lds --who
> le-archive built-in.a --no-whole-archive --start-group lib/lib.a --end-group
> powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data393'
> from `init/main.o' being placed in section `.data..Lubsan_data393'.
> powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data394'
> from `init/main.o' being placed in section `.data..Lubsan_data394'.
> powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data395'
> from `init/main.o' being placed in section `.data..Lubsan_data395'.
> powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data396'
> from `init/main.o' being placed in section `.data..Lubsan_data396'.
> ...
>
> What would you recommend to reduce the number of warnings produced ? I
> tried `--warn-once` but that only affect undefined symbols.
I'm not sure if the linker can be quietened. You could try putting
*(.data..Lubsan_data*) into the .data section in
powerpc/kernel/vmlinux.lds.S
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.
From: Mahesh Jagannath Salgaonkar @ 2018-07-03 7:20 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linuxppc-dev, Aneesh Kumar K.V, Laurent Dufour, Michal Suchanek
In-Reply-To: <20180703080814.5a57f52b@roar.ozlabs.ibm.com>
On 07/03/2018 03:38 AM, Nicholas Piggin wrote:
> On Mon, 02 Jul 2018 11:17:06 +0530
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
>
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> On pseries, as of today system crashes if we get a machine check
>> exceptions due to SLB errors. These are soft errors and can be fixed by
>> flushing the SLBs so the kernel can continue to function instead of
>> system crash. We do this in real mode before turning on MMU. Otherwise
>> we would run into nested machine checks. This patch now fetches the
>> rtas error log in real mode and flushes the SLBs on SLB errors.
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1
>> arch/powerpc/include/asm/machdep.h | 1
>> arch/powerpc/kernel/exceptions-64s.S | 42 +++++++++++++++++++++
>> arch/powerpc/kernel/mce.c | 16 +++++++-
>> arch/powerpc/mm/slb.c | 6 +++
>> arch/powerpc/platforms/powernv/opal.c | 1
>> arch/powerpc/platforms/pseries/pseries.h | 1
>> arch/powerpc/platforms/pseries/ras.c | 51 +++++++++++++++++++++++++
>> arch/powerpc/platforms/pseries/setup.c | 1
>> 9 files changed, 116 insertions(+), 4 deletions(-)
>>
>
>
>> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
>> +BEGIN_FTR_SECTION
>> + EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>> + mr r10,r1 /* Save r1 */
>> + ld r1,PACAMCEMERGSP(r13) /* Use MC emergency stack */
>> + subi r1,r1,INT_FRAME_SIZE /* alloc stack frame */
>> + mfspr r11,SPRN_SRR0 /* Save SRR0 */
>> + mfspr r12,SPRN_SRR1 /* Save SRR1 */
>> + EXCEPTION_PROLOG_COMMON_1()
>> + EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
>> + EXCEPTION_PROLOG_COMMON_3(0x200)
>> + addi r3,r1,STACK_FRAME_OVERHEAD
>> + BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
>
> Is there any reason you can't use the existing
> machine_check_powernv_early code to do all this?
I did think about that :-). But the machine_check_powernv_early code
does bit of extra stuff which isn't required in pseries like touching ME
bit in MSR and lots of checks that are done in
machine_check_handle_early() before going to virtual mode. But on second
look I see that we can bypass all that with HVMODE FTR section. Will
rename machine_check_powernv_early to machine_check_common_early and
reuse it.
>
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index efdd16a79075..221271c96a57 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
>> {
>> long handled = 0;
>>
>> - __this_cpu_inc(irq_stat.mce_exceptions);
>> + /*
>> + * For pSeries we count mce when we go into virtual mode machine
>> + * check handler. Hence skip it. Also, We can't access per cpu
>> + * variables in real mode for LPAR.
>> + */
>> + if (early_cpu_has_feature(CPU_FTR_HVMODE))
>> + __this_cpu_inc(irq_stat.mce_exceptions);
>>
>> - if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
>> + /*
>> + * See if platform is capable of handling machine check.
>> + * Otherwise fallthrough and allow CPU to handle this machine check.
>> + */
>> + if (ppc_md.machine_check_early)
>> + handled = ppc_md.machine_check_early(regs);
>> + else if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
>> handled = cur_cpu_spec->machine_check_early(regs);
>
> Would be good to add a powernv ppc_md handler which does the
> cur_cpu_spec->machine_check_early() call now that other platforms are
> calling this code. Because those aren't valid as a fallback call, but
> specific to powernv.
>
>> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
>> index 48fbb41af5d1..ed548d40a9e1 100644
>> --- a/arch/powerpc/platforms/powernv/opal.c
>> +++ b/arch/powerpc/platforms/powernv/opal.c
>> @@ -417,7 +417,6 @@ static int opal_recover_mce(struct pt_regs *regs,
>>
>> if (!(regs->msr & MSR_RI)) {
>> /* If MSR_RI isn't set, we cannot recover */
>> - pr_err("Machine check interrupt unrecoverable: MSR(RI=0)\n");
>
> What's the reason for this change?
Err... This is by mistake.. My bad. Thanks for catching this. Will
remove this hunk in next revision. We need a similar print for pSeries
in ras.c.
>
>> recovered = 0;
>> } else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>> /* Platform corrected itself */
>> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
>> index 60db2ee511fb..3611db5dd583 100644
>> --- a/arch/powerpc/platforms/pseries/pseries.h
>> +++ b/arch/powerpc/platforms/pseries/pseries.h
>> @@ -24,6 +24,7 @@ struct pt_regs;
>>
>> extern int pSeries_system_reset_exception(struct pt_regs *regs);
>> extern int pSeries_machine_check_exception(struct pt_regs *regs);
>> +extern int pSeries_machine_check_realmode(struct pt_regs *regs);
>>
>> #ifdef CONFIG_SMP
>> extern void smp_init_pseries(void);
>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>> index 851ce326874a..9aa7885e0148 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -427,6 +427,35 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>> return 0; /* need to perform reset */
>> }
>>
>> +static int mce_handle_error(struct rtas_error_log *errp)
>> +{
>> + struct pseries_errorlog *pseries_log;
>> + struct pseries_mc_errorlog *mce_log;
>> + int disposition = rtas_error_disposition(errp);
>> + uint8_t error_type;
>> +
>> + if (!rtas_error_extended(errp))
>> + goto out;
>> +
>> + pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>> + if (pseries_log == NULL)
>> + goto out;
>> +
>> + mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>> + error_type = rtas_mc_error_type(mce_log);
>> +
>> + if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
>> + (error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
>> + /* Store the old slb content someplace. */
>> + slb_flush_and_rebolt_realmode();
>> + disposition = RTAS_DISP_FULLY_RECOVERED;
>> + rtas_set_disposition_recovered(errp);
>> + }
>> +
>> +out:
>> + return disposition;
>> +}
>> +
>> /*
>> * Process MCE rtas errlog event.
>> */
>> @@ -503,11 +532,31 @@ int pSeries_machine_check_exception(struct pt_regs *regs)
>> struct rtas_error_log *errp;
>>
>> if (fwnmi_active) {
>> - errp = fwnmi_get_errinfo(regs);
>> fwnmi_release_errinfo();
>
> Should the fwnmi_release_errinfo be done in the realmode path as well
> now, or is there some reason to leave it here?
In real mode calling fwnmi_release_errinfo() causes kernel panic.
Couldn't debug further to find out why. So decided to keep it in virtual
mode. I have mentioned that in comment below in
pSeries_machine_check_realmode().
>
>> + errp = fwnmi_get_errlog();
>> if (errp && recover_mce(regs, errp))
>> return 1;
>> }
>>
>> return 0;
>> }
>> +
>> +int pSeries_machine_check_realmode(struct pt_regs *regs)
>> +{
>> + struct rtas_error_log *errp;
>> + int disposition;
>> +
>> + if (fwnmi_active) {
>> + errp = fwnmi_get_errinfo(regs);
>> + /*
>> + * Call to fwnmi_release_errinfo() in real mode causes kernel
>> + * to panic. Hence we will call it as soon as we go into
>> + * virtual mode.
>> + */
>> + disposition = mce_handle_error(errp);
>> + if (disposition == RTAS_DISP_FULLY_RECOVERED)
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> index 60a067a6e743..249b02bc5c41 100644
>> --- a/arch/powerpc/platforms/pseries/setup.c
>> +++ b/arch/powerpc/platforms/pseries/setup.c
>> @@ -999,6 +999,7 @@ define_machine(pseries) {
>> .calibrate_decr = generic_calibrate_decr,
>> .progress = rtas_progress,
>> .system_reset_exception = pSeries_system_reset_exception,
>> + .machine_check_early = pSeries_machine_check_realmode,
>> .machine_check_exception = pSeries_machine_check_exception,
>> #ifdef CONFIG_KEXEC_CORE
>> .machine_kexec = pSeries_machine_kexec,
>>
>
Thanks for your review.
^ permalink raw reply
* 83a092cf95f28: powerpc: Link warning for orphan sections
From: Mathieu Malaterre @ 2018-07-03 7:03 UTC (permalink / raw)
To: Nicholas Piggin; +Cc: linuxppc-dev
Hi Nick,
I am building my kernel (ppc32) with both
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y and CONFIG_UBSAN=y. This leads
to ~316428 warnings such as:
+ powerpc-linux-gnu-ld -EB -m elf32ppc -Bstatic --orphan-handling=warn
--build-id --gc-sections -X -o .tmp_vmlinux1 -T
./arch/powerpc/kernel/vmlinux.lds --who
le-archive built-in.a --no-whole-archive --start-group lib/lib.a --end-group
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data393'
from `init/main.o' being placed in section `.data..Lubsan_data393'.
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data394'
from `init/main.o' being placed in section `.data..Lubsan_data394'.
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data395'
from `init/main.o' being placed in section `.data..Lubsan_data395'.
powerpc-linux-gnu-ld: warning: orphan section `.data..Lubsan_data396'
from `init/main.o' being placed in section `.data..Lubsan_data396'.
...
What would you recommend to reduce the number of warnings produced ? I
tried `--warn-once` but that only affect undefined symbols.
Thanks
^ permalink raw reply
* [PATCHv3 4/4] Revert "driver core: correct device's shutdown order"
From: Pingfan Liu @ 2018-07-03 6:50 UTC (permalink / raw)
To: linux-kernel
Cc: Pingfan Liu, Greg Kroah-Hartman, Rafael J . Wysocki,
Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas, Dave Young,
linux-pci, linuxppc-dev
In-Reply-To: <1530600642-25090-1-git-send-email-kernelfans@gmail.com>
This reverts commit 52cdbdd49853dfa856082edb0f4c4c0249d9df07.
Since the device_shutdown() does not rely on the order in devices_kset any
more, reverting that commit safely.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
drivers/base/base.h | 1 -
drivers/base/core.c | 49 -------------------------------------------------
drivers/base/dd.c | 8 --------
3 files changed, 58 deletions(-)
diff --git a/drivers/base/base.h b/drivers/base/base.h
index a75c302..5bc9064 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -135,7 +135,6 @@ extern void device_unblock_probing(void);
/* /sys/devices directory */
extern struct kset *devices_kset;
-extern void devices_kset_move_last(struct device *dev);
#if defined(CONFIG_MODULES) && defined(CONFIG_SYSFS)
extern void module_add_driver(struct module *mod, struct device_driver *drv);
diff --git a/drivers/base/core.c b/drivers/base/core.c
index db3deb8..570aeee 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1259,52 +1259,6 @@ static DEVICE_ATTR_RO(dev);
struct kset *devices_kset;
/**
- * devices_kset_move_before - Move device in the devices_kset's list.
- * @deva: Device to move.
- * @devb: Device @deva should come before.
- */
-static void devices_kset_move_before(struct device *deva, struct device *devb)
-{
- if (!devices_kset)
- return;
- pr_debug("devices_kset: Moving %s before %s\n",
- dev_name(deva), dev_name(devb));
- spin_lock(&devices_kset->list_lock);
- list_move_tail(&deva->kobj.entry, &devb->kobj.entry);
- spin_unlock(&devices_kset->list_lock);
-}
-
-/**
- * devices_kset_move_after - Move device in the devices_kset's list.
- * @deva: Device to move
- * @devb: Device @deva should come after.
- */
-static void devices_kset_move_after(struct device *deva, struct device *devb)
-{
- if (!devices_kset)
- return;
- pr_debug("devices_kset: Moving %s after %s\n",
- dev_name(deva), dev_name(devb));
- spin_lock(&devices_kset->list_lock);
- list_move(&deva->kobj.entry, &devb->kobj.entry);
- spin_unlock(&devices_kset->list_lock);
-}
-
-/**
- * devices_kset_move_last - move the device to the end of devices_kset's list.
- * @dev: device to move
- */
-void devices_kset_move_last(struct device *dev)
-{
- if (!devices_kset)
- return;
- pr_debug("devices_kset: Moving %s to end of list\n", dev_name(dev));
- spin_lock(&devices_kset->list_lock);
- list_move_tail(&dev->kobj.entry, &devices_kset->list);
- spin_unlock(&devices_kset->list_lock);
-}
-
-/**
* device_create_file - create sysfs attribute file for device.
* @dev: device.
* @attr: device attribute descriptor.
@@ -2776,15 +2730,12 @@ int device_move(struct device *dev, struct device *new_parent,
break;
case DPM_ORDER_DEV_AFTER_PARENT:
device_pm_move_after(dev, new_parent);
- devices_kset_move_after(dev, new_parent);
break;
case DPM_ORDER_PARENT_BEFORE_DEV:
device_pm_move_before(new_parent, dev);
- devices_kset_move_before(new_parent, dev);
break;
case DPM_ORDER_DEV_LAST:
device_pm_move_last(dev);
- devices_kset_move_last(dev);
break;
}
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 1435d72..6ebcd65 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -434,14 +434,6 @@ static int really_probe(struct device *dev, struct device_driver *drv)
goto probe_failed;
}
- /*
- * Ensure devices are listed in devices_kset in correct order
- * It's important to move Dev to the end of devices_kset before
- * calling .probe, because it could be recursive and parent Dev
- * should always go first
- */
- devices_kset_move_last(dev);
-
if (dev->bus->probe) {
ret = dev->bus->probe(dev);
if (ret)
--
2.7.4
^ permalink raw reply related
* [PATCHv3 3/4] drivers/base: clean up the usage of devices_kset_move_last()
From: Pingfan Liu @ 2018-07-03 6:50 UTC (permalink / raw)
To: linux-kernel
Cc: Pingfan Liu, Greg Kroah-Hartman, Rafael J . Wysocki,
Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas, Dave Young,
linux-pci, linuxppc-dev
In-Reply-To: <1530600642-25090-1-git-send-email-kernelfans@gmail.com>
Clean up the referring to the code in commit 52cdbdd49853 ("driver core:
correct device's shutdown order"). So later we can revert it safely.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
drivers/base/core.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 684b994..db3deb8 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -127,13 +127,6 @@ static int device_reorder_to_tail(struct device *dev, void *not_used)
{
struct device_link *link;
- /*
- * Devices that have not been registered yet will be put to the ends
- * of the lists during the registration, so skip them here.
- */
- if (device_is_registered(dev))
- devices_kset_move_last(dev);
-
if (device_pm_initialized(dev))
device_pm_move_last(dev);
--
2.7.4
^ permalink raw reply related
* [PATCHv3 2/4] drivers/base: utilize device tree info to shutdown devices
From: Pingfan Liu @ 2018-07-03 6:50 UTC (permalink / raw)
To: linux-kernel
Cc: Pingfan Liu, Greg Kroah-Hartman, Rafael J . Wysocki,
Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas, Dave Young,
linux-pci, linuxppc-dev
In-Reply-To: <1530600642-25090-1-git-send-email-kernelfans@gmail.com>
commit 52cdbdd49853 ("driver core: correct device's shutdown order")
places an assumption of supplier<-consumer order on the process of probe.
But it turns out to break down the parent <- child order in some scene.
E.g in pci, a bridge is enabled by pci core, and behind it, the devices
have been probed. Then comes the bridge's module, which enables extra
feature(such as hotplug) on this bridge. This will break the
parent<-children order and cause failure when "kexec -e" in some scenario.
The detailed description of the scenario:
An IBM Power9 machine on which, two drivers portdrv_pci and shpchp(a mod)
match the PCI_CLASS_BRIDGE_PCI, but neither of them success to probe due
to some issue. For this case, the bridge is moved after its children in
devices_kset. Then, when "kexec -e", a ata-disk behind the bridge can not
write back buffer in flight due to the former shutdown of the bridge which
clears the BusMaster bit.
It is a little hard to impose both "parent<-child" and "supplier<-consumer"
order on devices_kset. Take the following scene:
step0: before a consumer's probing, (note child_a is supplier of consumer_a)
[ consumer-X, child_a, ...., child_z] [... consumer_a, ..., consumer_z, ...] supplier-X
^^^^^^^^^^ affected range ^^^^^^^^^^
step1: when probing, moving consumer-X after supplier-X
[ child_a, ...., child_z] [.... consumer_a, ..., consumer_z, ...] supplier-X, consumer-X
step2: the children of consumer-X should be re-ordered to maintain the seq
[... consumer_a, ..., consumer_z, ....] supplier-X [consumer-X, child_a, ...., child_z]
step3: the consumer_a should be re-ordered to maintain the seq
[... consumer_z, ...] supplier-X [ consumer-X, child_a, consumer_a ..., child_z]
It requires two nested recursion to drain out all out-of-order item in
"affected range". To avoid such complicated code, this patch suggests
to utilize the info in device tree, instead of using the order of
devices_kset during shutdown. It iterates the device tree, and firstly
shutdown a device's children and consumers. After this patch, the buggy
commit is hollow and left to clean.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
drivers/base/core.c | 48 +++++++++++++++++++++++++++++++++++++++++++-----
include/linux/device.h | 1 +
2 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index a48868f..684b994 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1446,6 +1446,7 @@ void device_initialize(struct device *dev)
INIT_LIST_HEAD(&dev->links.consumers);
INIT_LIST_HEAD(&dev->links.suppliers);
dev->links.status = DL_DEV_NO_DRIVER;
+ dev->shutdown = false;
}
EXPORT_SYMBOL_GPL(device_initialize);
@@ -2811,7 +2812,6 @@ static void __device_shutdown(struct device *dev)
* lock is to be held
*/
parent = get_device(dev->parent);
- get_device(dev);
/*
* Make sure the device is off the kset list, in the
* event that dev->*->shutdown() doesn't remove it.
@@ -2842,23 +2842,60 @@ static void __device_shutdown(struct device *dev)
dev_info(dev, "shutdown\n");
dev->driver->shutdown(dev);
}
-
+ dev->shutdown = true;
device_unlock(dev);
if (parent)
device_unlock(parent);
- put_device(dev);
put_device(parent);
spin_lock(&devices_kset->list_lock);
}
+/* shutdown dev's children and consumer firstly, then itself */
+static int device_for_each_child_shutdown(struct device *dev)
+{
+ struct klist_iter i;
+ struct device *child;
+ struct device_link *link;
+
+ /* already shutdown, then skip this sub tree */
+ if (dev->shutdown)
+ return 0;
+
+ if (!dev->p)
+ goto check_consumers;
+
+ /* there is breakage of lock in __device_shutdown(), and the redundant
+ * ref++ on srcu protected consumer is harmless since shutdown is not
+ * hot path.
+ */
+ get_device(dev);
+
+ klist_iter_init(&dev->p->klist_children, &i);
+ while ((child = next_device(&i)))
+ device_for_each_child_shutdown(child);
+ klist_iter_exit(&i);
+
+check_consumers:
+ list_for_each_entry_rcu(link, &dev->links.consumers, s_node) {
+ if (!link->consumer->shutdown)
+ device_for_each_child_shutdown(link->consumer);
+ }
+
+ __device_shutdown(dev);
+ put_device(dev);
+ return 0;
+}
+
/**
* device_shutdown - call ->shutdown() on each device to shutdown.
*/
void device_shutdown(void)
{
struct device *dev;
+ int idx;
+ idx = device_links_read_lock();
spin_lock(&devices_kset->list_lock);
/*
* Walk the devices list backward, shutting down each in turn.
@@ -2866,11 +2903,12 @@ void device_shutdown(void)
* devices offline, even as the system is shutting down.
*/
while (!list_empty(&devices_kset->list)) {
- dev = list_entry(devices_kset->list.prev, struct device,
+ dev = list_entry(devices_kset->list.next, struct device,
kobj.entry);
- __device_shutdown(dev);
+ device_for_each_child_shutdown(dev);
}
spin_unlock(&devices_kset->list_lock);
+ device_links_read_unlock(idx);
}
/*
diff --git a/include/linux/device.h b/include/linux/device.h
index 055a69d..8a0f784 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1003,6 +1003,7 @@ struct device {
bool offline:1;
bool of_node_reused:1;
bool dma_32bit_limit:1;
+ bool shutdown:1; /* one direction: false->true */
};
static inline struct device *kobj_to_dev(struct kobject *kobj)
--
2.7.4
^ permalink raw reply related
* [PATCHv3 1/4] drivers/base: fold the routine of device's shutdown into a func
From: Pingfan Liu @ 2018-07-03 6:50 UTC (permalink / raw)
To: linux-kernel
Cc: Pingfan Liu, Greg Kroah-Hartman, Rafael J . Wysocki,
Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas, Dave Young,
linux-pci, linuxppc-dev
In-Reply-To: <1530600642-25090-1-git-send-email-kernelfans@gmail.com>
Pack the code into a function to ease the using and reading.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
drivers/base/core.c | 100 +++++++++++++++++++++++++++-------------------------
1 file changed, 52 insertions(+), 48 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index df3e1a4..a48868f 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2802,12 +2802,62 @@ int device_move(struct device *dev, struct device *new_parent,
}
EXPORT_SYMBOL_GPL(device_move);
+static void __device_shutdown(struct device *dev)
+{
+ struct device *parent;
+ /*
+ * hold reference count of device's parent to
+ * prevent it from being freed because parent's
+ * lock is to be held
+ */
+ parent = get_device(dev->parent);
+ get_device(dev);
+ /*
+ * Make sure the device is off the kset list, in the
+ * event that dev->*->shutdown() doesn't remove it.
+ */
+ list_del_init(&dev->kobj.entry);
+ spin_unlock(&devices_kset->list_lock);
+
+ /* hold lock to avoid race with probe/release */
+ if (parent)
+ device_lock(parent);
+ device_lock(dev);
+
+ /* Don't allow any more runtime suspends */
+ pm_runtime_get_noresume(dev);
+ pm_runtime_barrier(dev);
+
+ if (dev->class && dev->class->shutdown_pre) {
+ if (initcall_debug)
+ dev_info(dev, "shutdown_pre\n");
+ dev->class->shutdown_pre(dev);
+ }
+ if (dev->bus && dev->bus->shutdown) {
+ if (initcall_debug)
+ dev_info(dev, "shutdown\n");
+ dev->bus->shutdown(dev);
+ } else if (dev->driver && dev->driver->shutdown) {
+ if (initcall_debug)
+ dev_info(dev, "shutdown\n");
+ dev->driver->shutdown(dev);
+ }
+
+ device_unlock(dev);
+ if (parent)
+ device_unlock(parent);
+
+ put_device(dev);
+ put_device(parent);
+ spin_lock(&devices_kset->list_lock);
+}
+
/**
* device_shutdown - call ->shutdown() on each device to shutdown.
*/
void device_shutdown(void)
{
- struct device *dev, *parent;
+ struct device *dev;
spin_lock(&devices_kset->list_lock);
/*
@@ -2818,53 +2868,7 @@ void device_shutdown(void)
while (!list_empty(&devices_kset->list)) {
dev = list_entry(devices_kset->list.prev, struct device,
kobj.entry);
-
- /*
- * hold reference count of device's parent to
- * prevent it from being freed because parent's
- * lock is to be held
- */
- parent = get_device(dev->parent);
- get_device(dev);
- /*
- * Make sure the device is off the kset list, in the
- * event that dev->*->shutdown() doesn't remove it.
- */
- list_del_init(&dev->kobj.entry);
- spin_unlock(&devices_kset->list_lock);
-
- /* hold lock to avoid race with probe/release */
- if (parent)
- device_lock(parent);
- device_lock(dev);
-
- /* Don't allow any more runtime suspends */
- pm_runtime_get_noresume(dev);
- pm_runtime_barrier(dev);
-
- if (dev->class && dev->class->shutdown_pre) {
- if (initcall_debug)
- dev_info(dev, "shutdown_pre\n");
- dev->class->shutdown_pre(dev);
- }
- if (dev->bus && dev->bus->shutdown) {
- if (initcall_debug)
- dev_info(dev, "shutdown\n");
- dev->bus->shutdown(dev);
- } else if (dev->driver && dev->driver->shutdown) {
- if (initcall_debug)
- dev_info(dev, "shutdown\n");
- dev->driver->shutdown(dev);
- }
-
- device_unlock(dev);
- if (parent)
- device_unlock(parent);
-
- put_device(dev);
- put_device(parent);
-
- spin_lock(&devices_kset->list_lock);
+ __device_shutdown(dev);
}
spin_unlock(&devices_kset->list_lock);
}
--
2.7.4
^ permalink raw reply related
* [PATCHv3 0/4] drivers/base: bugfix for supplier<-consumer ordering in device_kset
From: Pingfan Liu @ 2018-07-03 6:50 UTC (permalink / raw)
To: linux-kernel
Cc: Pingfan Liu, Greg Kroah-Hartman, Rafael J . Wysocki,
Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas, Dave Young,
linux-pci, linuxppc-dev
commit 52cdbdd49853 ("driver core: correct device's shutdown order")
places an assumption of supplier<-consumer order on the process of probe.
But it turns out to break down the parent <- child order in some scene.
E.g in pci, a bridge is enabled by pci core, and behind it, the devices
have been probed. Then comes the bridge's module, which enables extra
feature(such as hotplug) on this bridge.
This will break the parent<-children order and cause failure when
"kexec -e" in some scenario.
v2 -> v3:
It is a little hard to impose both "parent<-child" and "supplier<-consumer"
on devices_kset. Hence v3 drops this method, postpones the issue to shutdown
time instead of probing, and utilizes device-tree info during shutdown
instead of the item's seq inside devices_kset.
Pingfan Liu (4):
drivers/base: fold the routine of device's shutdown into a func
drivers/base: utilize device tree info to shutdown devices
drivers/base: clean up the usage of devices_kset_move_last()
Revert "driver core: correct device's shutdown order"
drivers/base/base.h | 1 -
drivers/base/core.c | 196 +++++++++++++++++++++++--------------------------
drivers/base/dd.c | 8 --
include/linux/device.h | 1 +
4 files changed, 92 insertions(+), 114 deletions(-)
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
--
2.7.4
^ permalink raw reply
* Re: [PATCHv2 2/2] drivers/base: reorder consumer and its children behind suppliers
From: Pingfan Liu @ 2018-07-03 6:48 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: linux-kernel, Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas,
Dave Young, linux-pci, linuxppc-dev, rafael.j.wysocki
In-Reply-To: <20180626115402.GB30256@kroah.com>
On Tue, Jun 26, 2018 at 7:54 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Jun 26, 2018 at 11:29:48AM +0800, Pingfan Liu wrote:
> > On Mon, Jun 25, 2018 at 6:45 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Mon, Jun 25, 2018 at 03:47:39PM +0800, Pingfan Liu wrote:
> > > > commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> > > > introduces supplier<-consumer order in devices_kset. The commit tries
> > > > to cleverly maintain both parent<-child and supplier<-consumer order by
> > > > reordering a device when probing. This method makes things simple and
> > > > clean, but unfortunately, breaks parent<-child order in some case,
> > > > which is described in next patch in this series.
> > >
> > > There is no "next patch in this series" :(
> > >
> > Oh, re-arrange the patches, and forget the comment in log
> >
> > > > Here this patch tries to resolve supplier<-consumer by only reordering a
> > > > device when it has suppliers, and takes care of the following scenario:
> > > > [consumer, children] [ ... potential ... ] supplier
> > > > ^ ^
> > > > After moving the consumer and its children after the supplier, the
> > > > potentail section may contain consumers whose supplier is inside
> > > > children, and this poses the requirement to dry out all consumpers in
> > > > the section recursively.
> > > >
> > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > > Cc: Grygorii Strashko <grygorii.strashko@ti.com>
> > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > Cc: Bjorn Helgaas <helgaas@kernel.org>
> > > > Cc: Dave Young <dyoung@redhat.com>
> > > > Cc: linux-pci@vger.kernel.org
> > > > Cc: linuxppc-dev@lists.ozlabs.org
> > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > ---
> > > > note: there is lock issue in this patch, should be fixed in next version
> > >
> > > Please send patches that you know are correct, why would I want to
> > > review this if you know it is not correct?
> > >
> > > And if the original commit is causing problems for you, why not just
> > > revert that instead of adding this much-increased complexity?
> > >
> > Revert the original commit, then it will expose the error order
> > "consumer <- supplier" again.
> > This patch tries to resolve the error and fix the following scenario:
> > step0: before the consumer device's probing, (note child_a is a
> > supplier of consumer_a, etc)
> > [ consumer-X, child_a, ...., child_z] [.... consumer_a, ...,
> > consumer_z, ....] supplier-X
> > ^^^
> > affected range during moving^^^
> > step1: When probing, moving consumer-X after supplier-X
> > [ child_a, ...., child_z] [.... consumer_a, ..., consumer_z,
> > ....] supplier-X, consumer-X
> > But it breaks "parent <-child" seq now, and should be fixed like:
> > step2:
> > [.... consumer_a, ..., consumer_z, ....] supplier-X [
> > consumer-X, child_a, ...., child_z] <---
> > descendants_reorder_after_pos() does it.
> > Again, the seq "consumer_a <- child_a" breaks the "supplier<-consumer"
> > order, should be fixed like:
> > step3:
> > [.... consumer_z, .....] supplier-X [ consumer-X, child_a,
> > consumer_a ...., child_z] <--- __device_reorder_consumer() does it.
> > ^^ affected range^^
> > The moving of consumer_a brings us to face the same scenario of step1,
> > hence we need an external recursion.
>
> Something really got messed up here, and this all does not make any
> sense :(
>
> Can you try again?
>
> Also, please cc: Rafael on all of this, as he wrote all of this
> consumer/supplier logic and I am not that familiar with it at all.
>
Cc Rafael J. Wysocki for the context. I will send out V3 soon.
Regards,
Pingfan
^ permalink raw reply
* Re: [PATCH 2/3] powerpc/powernv: DMA operations for discontiguous allocation
From: Russell Currey @ 2018-07-03 6:09 UTC (permalink / raw)
To: Alexey Kardashevskiy; +Cc: linuxppc-dev, benh, alistair, tpearson
In-Reply-To: <20180702184756.72cfde38@aik.ozlabs.ibm.com>
On Mon, 2018-07-02 at 18:47 +1000, Alexey Kardashevskiy wrote:
> On Fri, 29 Jun 2018 17:34:36 +1000
> Russell Currey <ruscur@russell.cc> wrote:
>
> > DMA pseudo-bypass is a new set of DMA operations that solve some
> > issues for
> > devices that want to address more than 32 bits but can't address
> > the 59
> > bits required to enable direct DMA.
> >
> > The previous implementation for POWER8/PHB3 worked around this by
> > configuring a bypass from the default 32-bit address space into 64-
> > bit
> > address space. This approach does not work for POWER9/PHB4 because
> > regions of memory are discontiguous and many devices will be unable
> > to
> > address memory beyond the first node.
> >
> > Instead, implement a new set of DMA operations that allocate TCEs
> > as DMA
> > mappings are requested so that all memory is addressable even when
> > a
> > one-to-one mapping between real addresses and DMA addresses isn't
> > possible. These TCEs are the maximum size available on the
> > platform,
> > which is 256M on PHB3 and 1G on PHB4.
> >
> > Devices can now map any region of memory up to the maximum amount
> > they can
> > address according to the DMA mask set, in chunks of the largest
> > available
> > TCE size.
> >
> > This implementation replaces the need for the existing PHB3
> > solution and
> > should be compatible with future PHB versions.
> >
> > It is, however, rather naive. There is no unmapping, and as a
> > result
> > devices can eventually run out of space if they address their
> > entire
> > DMA mask worth of TCEs. An implementation with unmap() will come
> > in
> > future (and requires a much more complex implementation), but this
> > is a
> > good start due to the drastic performance improvement.
>
>
> Why does not dma_iommu_ops work in this case? I keep asking and yet
> no
> comment in the commit log or mails...
Yes, I should cover this in the commit message.
So the primary reason that the IOMMU doesn't work for this case is the
TCE allocation - the IOMMU doesn't have a refcount and will allocate
(in this case on P9) 1GB TCEs to each map which will quickly fail.
This isn't intended to be a replacement for the IOMMU, it's a
roundabout way of achieving what the direct ops do (like NVIDIA devices
on P8 can do today).
^ permalink raw reply
* Re: [PATCH 1/3] powerpc/powernv/pci: Track largest available TCE order per PHB
From: Russell Currey @ 2018-07-03 5:49 UTC (permalink / raw)
To: Alexey Kardashevskiy; +Cc: linuxppc-dev, benh, alistair, tpearson
In-Reply-To: <20180702173426.1254ecd0@aik.ozlabs.ibm.com>
On Mon, 2018-07-02 at 17:34 +1000, Alexey Kardashevskiy wrote:
> On Mon, 2 Jul 2018 17:32:56 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> > On Fri, 29 Jun 2018 17:34:35 +1000
> > Russell Currey <ruscur@russell.cc> wrote:
> >
> > > Knowing the largest possible TCE size of a PHB is useful, so get
> > > it
> > > out of the device tree. This relies on the property being added
> > > in
> > > OPAL.
> > >
> > > It is assumed that any PHB4 or later machine would be running
> > > firmware that implemented this property, and otherwise assumed to
> > > be PHB3, which has a maximum TCE order of 28 bits or 256MB TCEs.
> > >
> > > This is used later in the series.
> > >
> > > Signed-off-by: Russell Currey <ruscur@russell.cc>
> > > ---
> > > arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++++++++++++++++
> > > arch/powerpc/platforms/powernv/pci.h | 3 +++
> > > 2 files changed, 19 insertions(+)
> > >
> > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> > > b/arch/powerpc/platforms/powernv/pci-ioda.c index
> > > 5bd0eb6681bc..17c590087279 100644 ---
> > > a/arch/powerpc/platforms/powernv/pci-ioda.c +++
> > > b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3873,11 +3873,13
> > > @@
> > > static void __init pnv_pci_init_ioda_phb(struct device_node *np,
> > > struct resource r; const __be64 *prop64;
> > > const __be32 *prop32;
> > > + struct property *prop;
> > > int len;
> > > unsigned int segno;
> > > u64 phb_id;
> > > void *aux;
> > > long rc;
> > > + u32 val;
> > >
> > > if (!of_device_is_available(np))
> > > return;
> > > @@ -4016,6 +4018,20 @@ static void __init
> > > pnv_pci_init_ioda_phb(struct device_node *np, }
> > > phb->ioda.pe_array = aux + pemap_off;
> > >
> > > + phb->ioda.max_tce_order = 0;
> > > + /* Get TCE order from the DT. If it's not present,
> > > assume
> > > P8 */
> > > + if (!of_get_property(np, "ibm,supported-tce-sizes",
> > > NULL))
> > > {
> > > + phb->ioda.max_tce_order = 28; /* assume P8 256mb
> > > TCEs */
> > > + } else {
> > > + of_property_for_each_u32(np,
> > > "ibm,supported-tce-sizes", prop,
> > > + prop32, val) {
> > > + if (val > phb->ioda.max_tce_order)
> > > + phb->ioda.max_tce_order = val;
> > > + }
> > > + pr_debug("PHB%llx Found max TCE order of %d
> > > bits\n",
> > > + phb->opal_id, phb->ioda.max_tce_order);
> > > + }
> >
> >
> > pnv_ioda_parse_tce_sizes() does this, use it. It even reports 256MB
> > pages for P8 as in v4.18-rc3.
>
>
> ah, not, not in rc3, my bad. I'll post it soon.
Sure, whatever works, no need for duplication
>
>
> --
> Alexey
^ permalink raw reply
* Re: [PATCH v2 1/2] powernv/cpuidle: Parse dt idle properties into global structure
From: kbuild test robot @ 2018-07-03 4:19 UTC (permalink / raw)
To: Akshay Adiga
Cc: kbuild-all, linux-kernel, linuxppc-dev, linux-pm, rjw, stewart,
benh, svaidy, ego, npiggin, mpe, Akshay Adiga
In-Reply-To: <1530541401-19726-2-git-send-email-akshay.adiga@linux.vnet.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 9226 bytes --]
Hi Akshay,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.18-rc3 next-20180702]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Akshay-Adiga/powernv-cpuidle-Device-tree-parsing-cleanup/20180703-024607
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc
Note: the linux-review/Akshay-Adiga/powernv-cpuidle-Device-tree-parsing-cleanup/20180703-024607 HEAD 4beae9263d77036dae7f43905823867c6d982690 builds fine.
It only hurts bisectibility.
All errors (new ones prefixed by >>):
drivers/cpuidle/cpuidle-powernv.c: In function 'powernv_add_idle_states':
>> drivers/cpuidle/cpuidle-powernv.c:417:28: error: 'state' undeclared (first use in this function); did you mean 'statx'?
if (has_stop_states && !(state->valid))
^~~~~
statx
drivers/cpuidle/cpuidle-powernv.c:417:28: note: each undeclared identifier is reported only once for each function it appears in
vim +417 drivers/cpuidle/cpuidle-powernv.c
262
263 extern u32 pnv_get_supported_cpuidle_states(void);
264 static int powernv_add_idle_states(void)
265 {
266 struct device_node *power_mgt;
267 int nr_idle_states = 1; /* Snooze */
268 int dt_idle_states, count;
269 u32 latency_ns[CPUIDLE_STATE_MAX];
270 u32 residency_ns[CPUIDLE_STATE_MAX];
271 u32 flags[CPUIDLE_STATE_MAX];
272 u64 psscr_val[CPUIDLE_STATE_MAX];
273 u64 psscr_mask[CPUIDLE_STATE_MAX];
274 const char *names[CPUIDLE_STATE_MAX];
275 u32 has_stop_states = 0;
276 int i, rc;
277 u32 supported_flags = pnv_get_supported_cpuidle_states();
278
279
280 /* Currently we have snooze statically defined */
281
282 power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
283 if (!power_mgt) {
284 pr_warn("opal: PowerMgmt Node not found\n");
285 goto out;
286 }
287
288 /* Read values of any property to determine the num of idle states */
289 dt_idle_states = of_property_count_u32_elems(power_mgt, "ibm,cpu-idle-state-flags");
290 if (dt_idle_states < 0) {
291 pr_warn("cpuidle-powernv: no idle states found in the DT\n");
292 goto out;
293 }
294
295 count = of_property_count_u32_elems(power_mgt,
296 "ibm,cpu-idle-state-latencies-ns");
297
298 if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
299 "ibm,cpu-idle-state-latencies-ns",
300 count) != 0)
301 goto out;
302
303 count = of_property_count_strings(power_mgt,
304 "ibm,cpu-idle-state-names");
305 if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
306 "ibm,cpu-idle-state-names",
307 count) != 0)
308 goto out;
309
310 /*
311 * Since snooze is used as first idle state, max idle states allowed is
312 * CPUIDLE_STATE_MAX -1
313 */
314 if (dt_idle_states > CPUIDLE_STATE_MAX - 1) {
315 pr_warn("cpuidle-powernv: discovered idle states more than allowed");
316 dt_idle_states = CPUIDLE_STATE_MAX - 1;
317 }
318
319 if (of_property_read_u32_array(power_mgt,
320 "ibm,cpu-idle-state-flags", flags, dt_idle_states)) {
321 pr_warn("cpuidle-powernv : missing ibm,cpu-idle-state-flags in DT\n");
322 goto out;
323 }
324
325 if (of_property_read_u32_array(power_mgt,
326 "ibm,cpu-idle-state-latencies-ns", latency_ns,
327 dt_idle_states)) {
328 pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-latencies-ns in DT\n");
329 goto out;
330 }
331 if (of_property_read_string_array(power_mgt,
332 "ibm,cpu-idle-state-names", names, dt_idle_states) < 0) {
333 pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-names in DT\n");
334 goto out;
335 }
336
337 /*
338 * If the idle states use stop instruction, probe for psscr values
339 * and psscr mask which are necessary to specify required stop level.
340 */
341 has_stop_states = (flags[0] &
342 (OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP));
343 if (has_stop_states) {
344 count = of_property_count_u64_elems(power_mgt,
345 "ibm,cpu-idle-state-psscr");
346 if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
347 dt_idle_states,
348 "ibm,cpu-idle-state-psscr",
349 count) != 0)
350 goto out;
351
352 count = of_property_count_u64_elems(power_mgt,
353 "ibm,cpu-idle-state-psscr-mask");
354 if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
355 dt_idle_states,
356 "ibm,cpu-idle-state-psscr-mask",
357 count) != 0)
358 goto out;
359
360 if (of_property_read_u64_array(power_mgt,
361 "ibm,cpu-idle-state-psscr", psscr_val, dt_idle_states)) {
362 pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr in DT\n");
363 goto out;
364 }
365
366 if (of_property_read_u64_array(power_mgt,
367 "ibm,cpu-idle-state-psscr-mask",
368 psscr_mask, dt_idle_states)) {
369 pr_warn("cpuidle-powernv:Missing ibm,cpu-idle-state-psscr-mask in DT\n");
370 goto out;
371 }
372 }
373
374 count = of_property_count_u32_elems(power_mgt,
375 "ibm,cpu-idle-state-residency-ns");
376
377 if (count < 0) {
378 rc = count;
379 } else if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
380 dt_idle_states,
381 "ibm,cpu-idle-state-residency-ns",
382 count) != 0) {
383 goto out;
384 } else {
385 rc = of_property_read_u32_array(power_mgt,
386 "ibm,cpu-idle-state-residency-ns",
387 residency_ns, dt_idle_states);
388 }
389
390 for (i = 0; i < dt_idle_states; i++) {
391 unsigned int exit_latency, target_residency;
392 bool stops_timebase = false;
393
394 /*
395 * Skip the platform idle state whose flag isn't in
396 * the supported_cpuidle_states flag mask.
397 */
398 if ((flags[i] & supported_flags) != flags[i])
399 continue;
400 /*
401 * If an idle state has exit latency beyond
402 * POWERNV_THRESHOLD_LATENCY_NS then don't use it
403 * in cpu-idle.
404 */
405 if (latency_ns[i] > POWERNV_THRESHOLD_LATENCY_NS)
406 continue;
407 /*
408 * Firmware passes residency and latency values in ns.
409 * cpuidle expects it in us.
410 */
411 exit_latency = DIV_ROUND_UP(latency_ns[i], 1000);
412 if (!rc)
413 target_residency = DIV_ROUND_UP(residency_ns[i], 1000);
414 else
415 target_residency = 0;
416
> 417 if (has_stop_states && !(state->valid))
418 continue;
419
420 if (flags[i] & OPAL_PM_TIMEBASE_STOP)
421 stops_timebase = true;
422
423 /*
424 * For nap and fastsleep, use default target_residency
425 * values if f/w does not expose it.
426 */
427 if (flags[i] & OPAL_PM_NAP_ENABLED) {
428 if (!rc)
429 target_residency = 100;
430 /* Add NAP state */
431 add_powernv_state(nr_idle_states, "Nap",
432 CPUIDLE_FLAG_NONE, nap_loop,
433 target_residency, exit_latency, 0, 0);
434 } else if (has_stop_states && !stops_timebase) {
435 add_powernv_state(nr_idle_states, names[i],
436 CPUIDLE_FLAG_NONE, stop_loop,
437 target_residency, exit_latency,
438 psscr_val[i], psscr_mask[i]);
439 }
440
441 /*
442 * All cpuidle states with CPUIDLE_FLAG_TIMER_STOP set must come
443 * within this config dependency check.
444 */
445 #ifdef CONFIG_TICK_ONESHOT
446 else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
447 flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
448 if (!rc)
449 target_residency = 300000;
450 /* Add FASTSLEEP state */
451 add_powernv_state(nr_idle_states, "FastSleep",
452 CPUIDLE_FLAG_TIMER_STOP,
453 fastsleep_loop,
454 target_residency, exit_latency, 0, 0);
455 } else if (has_stop_states && stops_timebase) {
456 add_powernv_state(nr_idle_states, names[i],
457 CPUIDLE_FLAG_TIMER_STOP, stop_loop,
458 target_residency, exit_latency,
459 psscr_val[i], psscr_mask[i]);
460 }
461 #endif
462 else
463 continue;
464 nr_idle_states++;
465 }
466 out:
467 return nr_idle_states;
468 }
469
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23378 bytes --]
^ permalink raw reply
* Re: [PATCH] kbuild: move bin2c back to scripts/ from scripts/basic/
From: Masahiro Yamada @ 2018-07-03 4:02 UTC (permalink / raw)
To: Linux Kbuild mailing list
Cc: Masahiro Yamada, linux-s390, H. Peter Anvin, Kentaro Takeda,
Michael Ellerman, Heiko Carstens, X86 ML, Benjamin Herrenschmidt,
Linux Kernel Mailing List, Thomas Gleixner, Michal Marek,
Paul Mackerras, Tetsuo Handa, Serge E. Hallyn, James Morris,
Ingo Molnar, linuxppc-dev, linux-security-module,
Martin Schwidefsky
In-Reply-To: <1529944823-22228-1-git-send-email-yamada.masahiro@socionext.com>
2018-06-26 1:40 GMT+09:00 Masahiro Yamada <yamada.masahiro@socionext.com>:
> Commit 8370edea81e3 ("bin2c: move bin2c in scripts/basic") moved bin2c
> to the scripts/basic/ directory, incorrectly stating "Kexec wants to
> use bin2c and it wants to use it really early in the build process.
> See arch/x86/purgatory/ code in later patches."
>
> Commit bdab125c9301 ("Revert "kexec/purgatory: Add clean-up for
> purgatory directory"") and commit d6605b6bbee8 ("x86/build: Remove
> unnecessary preparation for purgatory") removed the redundant
> purgatory build magic entirely.
>
> That means that the move of bin2c was unnecessary in the first place.
>
> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Applied to linux-kbuild.
>
> arch/powerpc/purgatory/Makefile | 3 +--
> arch/s390/purgatory/Makefile | 3 +--
> arch/x86/purgatory/Makefile | 3 +--
> kernel/Makefile | 2 +-
> scripts/.gitignore | 1 +
> scripts/Makefile | 1 +
> scripts/basic/.gitignore | 1 -
> scripts/basic/Makefile | 1 -
> scripts/{basic => }/bin2c.c | 0
> security/tomoyo/Makefile | 2 +-
> 10 files changed, 7 insertions(+), 10 deletions(-)
> rename scripts/{basic => }/bin2c.c (100%)
>
> diff --git a/arch/powerpc/purgatory/Makefile b/arch/powerpc/purgatory/Makefile
> index 30e05de..4314ba5 100644
> --- a/arch/powerpc/purgatory/Makefile
> +++ b/arch/powerpc/purgatory/Makefile
> @@ -6,9 +6,8 @@ LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined
> $(obj)/purgatory.ro: $(obj)/trampoline.o FORCE
> $(call if_changed,ld)
>
> -CMD_BIN2C = $(objtree)/scripts/basic/bin2c
> quiet_cmd_bin2c = BIN2C $@
> - cmd_bin2c = $(CMD_BIN2C) kexec_purgatory < $< > $@
> + cmd_bin2c = $(objtree)/scripts/bin2c kexec_purgatory < $< > $@
>
> $(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
> $(call if_changed,bin2c)
> diff --git a/arch/s390/purgatory/Makefile b/arch/s390/purgatory/Makefile
> index 1ace023..445c460 100644
> --- a/arch/s390/purgatory/Makefile
> +++ b/arch/s390/purgatory/Makefile
> @@ -27,9 +27,8 @@ KBUILD_CFLAGS += $(call cc-option,-fno-PIE)
> $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
> $(call if_changed,ld)
>
> -CMD_BIN2C = $(objtree)/scripts/basic/bin2c
> quiet_cmd_bin2c = BIN2C $@
> - cmd_bin2c = $(CMD_BIN2C) kexec_purgatory < $< > $@
> + cmd_bin2c = $(objtree)/scripts/bin2c kexec_purgatory < $< > $@
>
> $(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
> $(call if_changed,bin2c)
> diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> index 2e9ee02..d6ac098 100644
> --- a/arch/x86/purgatory/Makefile
> +++ b/arch/x86/purgatory/Makefile
> @@ -28,9 +28,8 @@ $(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
>
> targets += kexec-purgatory.c
>
> -CMD_BIN2C = $(objtree)/scripts/basic/bin2c
> quiet_cmd_bin2c = BIN2C $@
> - cmd_bin2c = $(CMD_BIN2C) kexec_purgatory < $< > $@
> + cmd_bin2c = $(objtree)/scripts/bin2c kexec_purgatory < $< > $@
>
> $(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
> $(call if_changed,bin2c)
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 04bc07c..7a63d56 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -123,7 +123,7 @@ targets += config_data.gz
> $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
> $(call if_changed,gzip)
>
> - filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
> + filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
> targets += config_data.h
> $(obj)/config_data.h: $(obj)/config_data.gz FORCE
> $(call filechk,ikconfiggz)
> diff --git a/scripts/.gitignore b/scripts/.gitignore
> index 0442c06..12d302d 100644
> --- a/scripts/.gitignore
> +++ b/scripts/.gitignore
> @@ -1,6 +1,7 @@
> #
> # Generated files
> #
> +bin2c
> conmakehash
> kallsyms
> pnmtologo
> diff --git a/scripts/Makefile b/scripts/Makefile
> index 25ab143..59c21ec 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -10,6 +10,7 @@
>
> HOST_EXTRACFLAGS += -I$(srctree)/tools/include
>
> +hostprogs-$(CONFIG_BUILD_BIN2C) += bin2c
> hostprogs-$(CONFIG_KALLSYMS) += kallsyms
> hostprogs-$(CONFIG_LOGO) += pnmtologo
> hostprogs-$(CONFIG_VT) += conmakehash
> diff --git a/scripts/basic/.gitignore b/scripts/basic/.gitignore
> index 9528ec9..a776371 100644
> --- a/scripts/basic/.gitignore
> +++ b/scripts/basic/.gitignore
> @@ -1,2 +1 @@
> fixdep
> -bin2c
> diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
> index 0372b33..af49b44 100644
> --- a/scripts/basic/Makefile
> +++ b/scripts/basic/Makefile
> @@ -9,7 +9,6 @@
> # fixdep: Used to generate dependency information during build process
>
> hostprogs-y := fixdep
> -hostprogs-$(CONFIG_BUILD_BIN2C) += bin2c
> always := $(hostprogs-y)
>
> # fixdep is needed to compile other host programs
> diff --git a/scripts/basic/bin2c.c b/scripts/bin2c.c
> similarity index 100%
> rename from scripts/basic/bin2c.c
> rename to scripts/bin2c.c
> diff --git a/security/tomoyo/Makefile b/security/tomoyo/Makefile
> index b7c6a7f..cca5a30 100644
> --- a/security/tomoyo/Makefile
> +++ b/security/tomoyo/Makefile
> @@ -4,7 +4,7 @@ obj-y = audit.o common.o condition.o domain.o environ.o file.o gc.o group.o load
> targets += builtin-policy.h
> define do_policy
> echo "static char tomoyo_builtin_$(1)[] __initdata ="; \
> -$(objtree)/scripts/basic/bin2c <$(firstword $(wildcard $(obj)/policy/$(1).conf $(srctree)/$(src)/policy/$(1).conf.default) /dev/null); \
> +$(objtree)/scripts/bin2c <$(firstword $(wildcard $(obj)/policy/$(1).conf $(srctree)/$(src)/policy/$(1).conf.default) /dev/null); \
> echo ";"
> endef
> quiet_cmd_policy = POLICY $@
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards
Masahiro Yamada
^ permalink raw reply
* Re: [PATCH v2 2/2] powernv/cpuidle: Use parsed device tree values for cpuidle_init
From: Nicholas Piggin @ 2018-07-03 3:31 UTC (permalink / raw)
To: Akshay Adiga
Cc: linux-kernel, linuxppc-dev, linux-pm, rjw, stewart, benh, svaidy,
ego, mpe
In-Reply-To: <1530541401-19726-3-git-send-email-akshay.adiga@linux.vnet.ibm.com>
On Mon, 2 Jul 2018 19:53:21 +0530
Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> wrote:
> Export pnv_idle_states and nr_pnv_idle_states so that its accessible to
> cpuidle driver. Use properties from pnv_idle_states structure for powernv
> cpuidle_init.
>
> Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
^ permalink raw reply
* Re: [PATCH 2/2] powerpc/ptrace-pkeys: execute-permission on keys are disabled by default
From: Thiago Jung Bauermann @ 2018-07-03 3:30 UTC (permalink / raw)
To: Ram Pai
Cc: mpe, linuxppc-dev, hbabu, mhocko, bauerman, Ulrich.Weigand,
fweimer, msuchanek
In-Reply-To: <1529979376-7292-4-git-send-email-linuxram@us.ibm.com>
Ram Pai <linuxram@us.ibm.com> writes:
> The test case assumes execute-permissions of unallocated keys are
> enabled by default.
>
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
> .../testing/selftests/powerpc/ptrace/ptrace-pkey.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-pkey.c b/tools/testing/selftests/powerpc/ptrace/ptrace-pkey.c
> index 5cf631f..559c6cb 100644
> --- a/tools/testing/selftests/powerpc/ptrace/ptrace-pkey.c
> +++ b/tools/testing/selftests/powerpc/ptrace/ptrace-pkey.c
> @@ -104,6 +104,11 @@ static int child(struct shared_info *info)
>
> if (disable_execute)
> info->expected_iamr |= 1ul << pkeyshift(pkey1);
> + else
> + info->expected_iamr &= ~(1ul << pkeyshift(pkey1));
> + info->expected_iamr &= ~(1ul << pkeyshift(pkey2) | 1ul << pkeyshift(pkey3));
> +
> +
>
> info->expected_uamor |= 3ul << pkeyshift(pkey1) |
> 3ul << pkeyshift(pkey2);
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v2 1/2] powernv/cpuidle: Parse dt idle properties into global structure
From: Nicholas Piggin @ 2018-07-03 3:30 UTC (permalink / raw)
To: Akshay Adiga
Cc: linux-kernel, linuxppc-dev, linux-pm, rjw, stewart, benh, svaidy,
ego, mpe
In-Reply-To: <1530541401-19726-2-git-send-email-akshay.adiga@linux.vnet.ibm.com>
On Mon, 2 Jul 2018 19:53:20 +0530
Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> wrote:
> Device-tree parsing happens twice, once while deciding idle state to be
> used for hotplug and once during cpuidle init. Hence, parsing the device
> tree and caching it will reduce code duplication. Parsing code has been
> moved to pnv_parse_cpuidle_dt() from pnv_probe_idle_states(). In addition
> to the properties in the device tree the number of available states is
> also required.
>
> Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
> ---
> arch/powerpc/include/asm/cpuidle.h | 11 ++
> arch/powerpc/platforms/powernv/idle.c | 216 ++++++++++++++++----------
> drivers/cpuidle/cpuidle-powernv.c | 11 +-
> 3 files changed, 151 insertions(+), 87 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
> index e210a83eb196..574b0ce1d671 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -79,6 +79,17 @@ struct stop_sprs {
> u64 mmcra;
> };
>
> +#define PNV_IDLE_NAME_LEN 16
> +struct pnv_idle_states_t {
> + char name[PNV_IDLE_NAME_LEN];
> + u32 latency_ns;
> + u32 residency_ns;
> + u64 psscr_val;
> + u64 psscr_mask;
> + u32 flags;
> + bool valid;
> +};
This is a nice looking cleanup.
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
^ permalink raw reply
* Re: [PATCH 1/2] powerpc/core-pkeys: execute-permission on keys are disabled by default
From: Thiago Jung Bauermann @ 2018-07-03 3:30 UTC (permalink / raw)
To: Ram Pai
Cc: mpe, linuxppc-dev, hbabu, mhocko, bauerman, Ulrich.Weigand,
fweimer, msuchanek
In-Reply-To: <1529979376-7292-3-git-send-email-linuxram@us.ibm.com>
Ram Pai <linuxram@us.ibm.com> writes:
> Only when the key is allocated, its permission are enabled.
>
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
> tools/testing/selftests/powerpc/ptrace/core-pkey.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/tools/testing/selftests/powerpc/ptrace/core-pkey.c b/tools/testing/selftests/powerpc/ptrace/core-pkey.c
> index 36bc312..b353d86 100644
> --- a/tools/testing/selftests/powerpc/ptrace/core-pkey.c
> +++ b/tools/testing/selftests/powerpc/ptrace/core-pkey.c
> @@ -140,6 +140,10 @@ static int child(struct shared_info *info)
>
> if (disable_execute)
> info->iamr |= 1ul << pkeyshift(pkey1);
> + else
> + info->iamr &= ~(1ul << pkeyshift(pkey1));
> + info->iamr &= ~(1ul << pkeyshift(pkey2) | 1ul << pkeyshift(pkey3));
> +
>
> info->uamor |= 3ul << pkeyshift(pkey1) | 3ul << pkeyshift(pkey2);
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v5 2/7] powerpc/pseries: Defer the logging of rtas error to irq work queue.
From: Nicholas Piggin @ 2018-07-03 3:25 UTC (permalink / raw)
To: Mahesh J Salgaonkar
Cc: linuxppc-dev, stable, Aneesh Kumar K.V, Laurent Dufour,
Michal Suchanek
In-Reply-To: <153051038591.30541.12824711468905210952.stgit@jupiter.in.ibm.com>
On Mon, 02 Jul 2018 11:16:29 +0530
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> rtas_log_buf is a buffer to hold RTAS event data that are communicated
> to kernel by hypervisor. This buffer is then used to pass RTAS event
> data to user through proc fs. This buffer is allocated from vmalloc
> (non-linear mapping) area.
>
> On Machine check interrupt, register r3 points to RTAS extended event
> log passed by hypervisor that contains the MCE event. The pseries
> machine check handler then logs this error into rtas_log_buf. The
> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
> page fault (vector 0x300) while accessing it. Since machine check
> interrupt handler runs in NMI context we can not afford to take any
> page fault. Page faults are not honored in NMI context and causes
> kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
> also takes a spin_lock while logging error which is not safe in NMI
> context. It may endup in deadlock if we get another MCE before releasing
> the lock. Fix this by deferring the logging of rtas error to irq work queue.
>
> Current implementation uses two different buffers to hold rtas error log
> depending on whether extended log is provided or not. This makes bit
> difficult to identify which buffer has valid data that needs to logged
> later in irq work. Simplify this using single buffer, one per paca, and
> copy rtas log to it irrespective of whether extended log is provided or
> not. Allocate this buffer below RMA region so that it can be accessed
> in real mode mce handler.
>
> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
I think this looks reasonable. It doesn't fix that commit so much as
fixes the problem that's apparent after it's applied. I don't know if
we should backport this to a wider set of stable kernels? Aside from
that,
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Thanks,
Nick
> ---
> arch/powerpc/include/asm/paca.h | 3 ++
> arch/powerpc/platforms/pseries/ras.c | 47 ++++++++++++++++++++++----------
> arch/powerpc/platforms/pseries/setup.c | 16 +++++++++++
> 3 files changed, 51 insertions(+), 15 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 3f109a3e3edb..b441fef53077 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -251,6 +251,9 @@ struct paca_struct {
> void *rfi_flush_fallback_area;
> u64 l1d_flush_size;
> #endif
> +#ifdef CONFIG_PPC_PSERIES
> + u8 *mce_data_buf; /* buffer to hold per cpu rtas errlog */
> +#endif /* CONFIG_PPC_PSERIES */
> } ____cacheline_aligned;
>
> extern void copy_mm_to_paca(struct mm_struct *mm);
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index ef104144d4bc..14a46b07ab2f 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -22,6 +22,7 @@
> #include <linux/of.h>
> #include <linux/fs.h>
> #include <linux/reboot.h>
> +#include <linux/irq_work.h>
>
> #include <asm/machdep.h>
> #include <asm/rtas.h>
> @@ -32,11 +33,13 @@
> static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
> static DEFINE_SPINLOCK(ras_log_buf_lock);
>
> -static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
> -static DEFINE_PER_CPU(__u64, mce_data_buf);
> -
> static int ras_check_exception_token;
>
> +static void mce_process_errlog_event(struct irq_work *work);
> +static struct irq_work mce_errlog_process_work = {
> + .func = mce_process_errlog_event,
> +};
> +
> #define EPOW_SENSOR_TOKEN 9
> #define EPOW_SENSOR_INDEX 0
>
> @@ -330,16 +333,20 @@ static irqreturn_t ras_error_interrupt(int irq, void *dev_id)
> ((((A) >= 0x7000) && ((A) < 0x7ff0)) || \
> (((A) >= rtas.base) && ((A) < (rtas.base + rtas.size - 16))))
>
> +static inline struct rtas_error_log *fwnmi_get_errlog(void)
> +{
> + return (struct rtas_error_log *)local_paca->mce_data_buf;
> +}
> +
> /*
> * Get the error information for errors coming through the
> * FWNMI vectors. The pt_regs' r3 will be updated to reflect
> * the actual r3 if possible, and a ptr to the error log entry
> * will be returned if found.
> *
> - * If the RTAS error is not of the extended type, then we put it in a per
> - * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
> + * Use one buffer mce_data_buf per cpu to store RTAS error.
> *
> - * The global_mce_data_buf does not have any locks or protection around it,
> + * The mce_data_buf does not have any locks or protection around it,
> * if a second machine check comes in, or a system reset is done
> * before we have logged the error, then we will get corruption in the
> * error log. This is preferable over holding off on calling
> @@ -349,7 +356,7 @@ static irqreturn_t ras_error_interrupt(int irq, void *dev_id)
> static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
> {
> unsigned long *savep;
> - struct rtas_error_log *h, *errhdr = NULL;
> + struct rtas_error_log *h;
>
> /* Mask top two bits */
> regs->gpr[3] &= ~(0x3UL << 62);
> @@ -362,22 +369,20 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
> savep = __va(regs->gpr[3]);
> regs->gpr[3] = savep[0]; /* restore original r3 */
>
> - /* If it isn't an extended log we can use the per cpu 64bit buffer */
> h = (struct rtas_error_log *)&savep[1];
> + /* Use the per cpu buffer from paca to store rtas error log */
> + memset(local_paca->mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
> if (!rtas_error_extended(h)) {
> - memcpy(this_cpu_ptr(&mce_data_buf), h, sizeof(__u64));
> - errhdr = (struct rtas_error_log *)this_cpu_ptr(&mce_data_buf);
> + memcpy(local_paca->mce_data_buf, h, sizeof(__u64));
> } else {
> int len, error_log_length;
>
> error_log_length = 8 + rtas_error_extended_log_length(h);
> len = min_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
> - memset(global_mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
> - memcpy(global_mce_data_buf, h, len);
> - errhdr = (struct rtas_error_log *)global_mce_data_buf;
> + memcpy(local_paca->mce_data_buf, h, len);
> }
>
> - return errhdr;
> + return (struct rtas_error_log *)local_paca->mce_data_buf;
> }
>
> /* Call this when done with the data returned by FWNMI_get_errinfo.
> @@ -422,6 +427,17 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
> return 0; /* need to perform reset */
> }
>
> +/*
> + * Process MCE rtas errlog event.
> + */
> +static void mce_process_errlog_event(struct irq_work *work)
> +{
> + struct rtas_error_log *err;
> +
> + err = fwnmi_get_errlog();
> + log_error((char *)err, ERR_TYPE_RTAS_LOG, 0);
> +}
> +
> /*
> * See if we can recover from a machine check exception.
> * This is only called on power4 (or above) and only via
> @@ -466,7 +482,8 @@ static int recover_mce(struct pt_regs *regs, struct rtas_error_log *err)
> recovered = 1;
> }
>
> - log_error((char *)err, ERR_TYPE_RTAS_LOG, 0);
> + /* Queue irq work to log this rtas event later. */
> + irq_work_queue(&mce_errlog_process_work);
>
> return recovered;
> }
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index fdb32e056ef4..60a067a6e743 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -41,6 +41,7 @@
> #include <linux/root_dev.h>
> #include <linux/of.h>
> #include <linux/of_pci.h>
> +#include <linux/memblock.h>
>
> #include <asm/mmu.h>
> #include <asm/processor.h>
> @@ -101,6 +102,9 @@ static void pSeries_show_cpuinfo(struct seq_file *m)
> static void __init fwnmi_init(void)
> {
> unsigned long system_reset_addr, machine_check_addr;
> + u8 *mce_data_buf;
> + unsigned int i;
> + int nr_cpus = num_possible_cpus();
>
> int ibm_nmi_register = rtas_token("ibm,nmi-register");
> if (ibm_nmi_register == RTAS_UNKNOWN_SERVICE)
> @@ -114,6 +118,18 @@ static void __init fwnmi_init(void)
> if (0 == rtas_call(ibm_nmi_register, 2, 1, NULL, system_reset_addr,
> machine_check_addr))
> fwnmi_active = 1;
> +
> + /*
> + * Allocate a chunk for per cpu buffer to hold rtas errorlog.
> + * It will be used in real mode mce handler, hence it needs to be
> + * below RMA.
> + */
> + mce_data_buf = __va(memblock_alloc_base(RTAS_ERROR_LOG_MAX * nr_cpus,
> + RTAS_ERROR_LOG_MAX, ppc64_rma_size));
> + for_each_possible_cpu(i) {
> + paca_ptrs[i]->mce_data_buf = mce_data_buf +
> + (RTAS_ERROR_LOG_MAX * i);
> + }
> }
>
> static void pseries_8259_cascade(struct irq_desc *desc)
>
^ permalink raw reply
* Re: [PATCH] powerpc: mpc5200: Remove VLA usage
From: Michael Ellerman @ 2018-07-03 1:41 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Kees Cook, Arnd Bergmann, Paul Mackerras, Anatolij Gustschin,
linuxppc-dev, Linux Kernel Mailing List
In-Reply-To: <20180702171631.GA16221@gate.crashing.org>
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Mon, Jul 02, 2018 at 11:33:32AM +1000, Michael Ellerman wrote:
>> What if we write it:
>>
>> char saved_0x500[0x600 - 0x500];
>>
>> Hopefully the compiler is smart enough not to generate a VLA for that :)
>
> It is a VLA if the array size is not an integer constant expression. This
> is defined by C; the compiler has nothing to do with it. 0x600-0x500 is
> an integer constant expression, so this is not a VLA.
Thanks.
That wasn't meant as a dig at GCC. Kees had an epic struggle with the
kernel's min/max() macros which were causing expressions that looked
like they should be constant to generate VLAs.
> But if you meant if GCC will ever do a dynamic stack allocation for a fixed
> size local variable: yes indeed, I hope not!
Hey that would be cool, just-in-time local variable allocation :)
cheers
^ permalink raw reply
* Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: David Gibson @ 2018-07-03 1:36 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Paul Mackerras
In-Reply-To: <20180702163227.0a4f0ea8@aik.ozlabs.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 6731 bytes --]
On Mon, Jul 02, 2018 at 04:32:27PM +1000, Alexey Kardashevskiy wrote:
> On Mon, 2 Jul 2018 14:52:43 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Mon, Jul 02, 2018 at 02:33:30PM +1000, Alexey Kardashevskiy wrote:
> > > On Mon, 2 Jul 2018 14:08:52 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Fri, Jun 29, 2018 at 05:07:47PM +1000, Alexey Kardashevskiy wrote:
> > > > > On Fri, 29 Jun 2018 15:18:20 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >
> > > > > > On Fri, 29 Jun 2018 14:57:02 +1000
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >
> > > > > > > On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote:
> > > > > > > > On Fri, 29 Jun 2018 14:12:41 +1000
> > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > >
> > > > > > > > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:
> > > > > > > > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > > > > > > > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > > > > > > > > get access to unassigned host memory.
> > > > > > > > > >
> > > > > > > > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > > > > > > > > > code) so the user space can pin memory backed with 64k pages and create
> > > > > > > > > > a hardware TCE table with a bigger page size. We were lucky so far and
> > > > > > > > > > did not hit this yet as the very first time the mapping happens
> > > > > > > > > > we do not have tbl::it_userspace allocated yet and fall back to
> > > > > > > > > > the userspace which in turn calls VFIO IOMMU driver and that fails
> > > > > > > > > > because of the check in vfio_iommu_spapr_tce.c which is really
> > > > > > > > > > sustainable solution.
> > > > > > > > > >
> > > > > > > > > > This stores the smallest preregistered page size in the preregistered
> > > > > > > > > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > > > > > > > > the IOMMU page size.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > > > > > > ---
> > > > > > > > > > Changes:
> > > > > > > > > > v2:
> > > > > > > > > > * explicitly check for compound pages before calling compound_order()
> > > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > > > > > > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > > > > > > > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > > > > > > > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > > > > > > > >
> > > > > > > > > > With the change, mapping will fail in KVM and the guest will print:
> > > > > > > > > >
> > > > > > > > > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > > > > > > > > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > > > > > > > > > mlx5_core 0000:00:00.0: failed to map direct window for
> > > > > > > > > > /pci@800000020000000/ethernet@0: -1
> > > > > > > > >
> > > > > > > > > [snip]
> > > > > > > > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > > > > > struct mm_iommu_table_group_mem_t **pmem)
> > > > > > > > > > {
> > > > > > > > > > struct mm_iommu_table_group_mem_t *mem;
> > > > > > > > > > - long i, j, ret = 0, locked_entries = 0;
> > > > > > > > > > + long i, j, ret = 0, locked_entries = 0, pageshift;
> > > > > > > > > > struct page *page = NULL;
> > > > > > > > > >
> > > > > > > > > > mutex_lock(&mem_list_mutex);
> > > > > > > > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > > > > > > > > > goto unlock_exit;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > + mem->pageshift = 30; /* start from 1G pages - the biggest we have */
> > > > > > > > >
> > > > > > > > > What about 16G pages on an HPT system?
> > > > > > > >
> > > > > > > >
> > > > > > > > Below in the loop mem->pageshift will reduce to the biggest actual size
> > > > > > > > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
> > > > > > > > pinned, no loss there.
> > > > > > >
> > > > > > > Are you saying that 16G IOMMU pages aren't supported? Or that there's
> > > > > > > some reason a guest can never use them?
> > > > > >
> > > > > >
> > > > > > ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will
> > > > > > lift the limit up to 64 then, easier this way.
> > > > >
> > > > >
> > > > > Ah, no, rather this as the upper limit:
> > > > >
> > > > > mem->pageshift = ilog2(entries) + PAGE_SHIFT;
> > > >
> > > > I can't make sense of this comment in context. I see how you're
> > > > computing the minimum page size in the reserved region.
> > > >
> > > > My question is about what the is - the starting
> > > > value from which you calculate. Currently it's 1G, but I can't
> > > > immediately see a reason that 16G is impossible here.
> > >
> > >
> > > 16GB is impossible if the chunk we are preregistering here is smaller
> > > than that, for example, the entire guest ram is 4GB.
> >
> > Of course. Just like it was for 1GiB if you had a 512MiB guest, for
> > example. I'm talking about a case where you have a guest that's
> > >=16GiB and you *have* allocated 16GiB hugepages to back it.
>
>
> Then, assuming we are preregistering entire RAM as a single chunk, the
> "maximum minimum" will be initialized as ">=16GiB" (but floor-aligned
> to power of two) before the pinning loop and then reduce to the actual
> page size, inside the loop. I feel like I am missing something in the
> question, what is that?
I wish I knew...
By the "maximum minimum" I'm talking about the shift value you start
with before you start looking at the actual pages. Currently that is
exactly 30 (1 GiB). AFAICT that prevents using 16GiB pages, and I
can't see a good reason for that.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 2/2] powerpc/pkeys: key allocation/deallocation must not change pkey registers
From: Thiago Jung Bauermann @ 2018-07-03 1:35 UTC (permalink / raw)
To: Ram Pai
Cc: mpe, fweimer, Ulrich.Weigand, mhocko, bauerman, msuchanek,
linuxppc-dev
In-Reply-To: <1529979376-7292-2-git-send-email-linuxram@us.ibm.com>
Ram Pai <linuxram@us.ibm.com> writes:
> Key allocation and deallocation has the side effect of programming the
> UAMOR/AMR/IAMR registers. This is wrong, since its the responsibility of
> the application and not that of the kernel, to modify the permission on
> the key.
>
> Do not modify the pkey registers at key allocation/deallocation.
>
> This patch also fixes a bug where a sys_pkey_free() resets the UAMOR
> bits of the key, thus making its permissions unmodifiable from user
> space. Latter if the same key gets reallocated from a different thread
> this thread will no longer be able to change the permissions on the key.
>
> Problem noticed/reported by Michael Ellermen while running
> selftests/core-pkeys
>
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
> arch/powerpc/include/asm/pkeys.h | 11 -----------
> arch/powerpc/mm/pkeys.c | 27 ---------------------------
> 2 files changed, 0 insertions(+), 38 deletions(-)
LGTM.
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox