LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] powerpc/pkeys: key allocation/deallocation must not change pkey registers
From: Thiago Jung Bauermann @ 2018-07-03  1:35 UTC (permalink / raw)
  To: Ram Pai
  Cc: mpe, fweimer, Ulrich.Weigand, mhocko, bauerman, msuchanek,
	linuxppc-dev
In-Reply-To: <1529979376-7292-2-git-send-email-linuxram@us.ibm.com>


Ram Pai <linuxram@us.ibm.com> writes:

> Key allocation and deallocation has the side effect of programming the
> UAMOR/AMR/IAMR registers. This is wrong, since its the responsibility of
> the application and not that of the kernel, to modify the permission on
> the key.
>
> Do not modify the pkey registers at key allocation/deallocation.
>
> This patch also fixes a bug where a sys_pkey_free() resets the UAMOR
> bits of the key, thus making its permissions unmodifiable from user
> space.  Latter if the same key gets reallocated from a different thread
> this thread will no longer be able to change the permissions on the key.
>
> Problem noticed/reported by Michael Ellermen while running
> selftests/core-pkeys
>
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  arch/powerpc/include/asm/pkeys.h |   11 -----------
>  arch/powerpc/mm/pkeys.c          |   27 ---------------------------
>  2 files changed, 0 insertions(+), 38 deletions(-)

LGTM.

Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

^ permalink raw reply

* [PATCH] mm: allow arch to supply p??_free_tlb functions
From: Nicholas Piggin @ 2018-07-03  1:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicholas Piggin, linux-mm, ppc-dev, linux-arch, Aneesh Kumar K. V,
	Minchan Kim, Mel Gorman, Nadav Amit, Andrew Morton

The mmu_gather APIs keep track of the invalidated address range
including the span covered by invalidated page table pages. Ranges
covered by page tables but not ptes (and therefore no TLBs) still need
to be invalidated because some architectures (x86) can cache
intermediate page table entries, and invalidate those with normal TLB
invalidation instructions to be almost-backward-compatible.

Architectures which don't cache intermediate page table entries, or
which invalidate these caches separately from TLB invalidation, do not
require TLB invalidation range expanded over page tables.

Allow architectures to supply their own p??_free_tlb functions, which
can avoid the __tlb_adjust_range.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
Just wanted your ack/nack on this approach, I just tidied the patch
and re-did the changelog. We left off with you wondering if overriding
__tlb_adjust_range for page tables would be the better option, but I
couldn't see any real benefit over this way. Actually I think this is
cleaner, powerpc will simply switch the name of its function from
__pte_free_tlb to pte_free_tlb to take over the tlb management for it.

And is this something that you'd merge at this point of the cycle, so
that arch changes for next window won't include generic code changes or
have cross tree dependencies?

Thanks,
Nick

 include/asm-generic/tlb.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faddde44de8c..3063125197ad 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -265,33 +265,41 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
  * For now w.r.t page table cache, mark the range_size as PAGE_SIZE
  */
 
+#ifndef pte_free_tlb
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
+#endif
 
+#ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
+#endif
 
 #ifndef __ARCH_HAS_4LEVEL_HACK
+#ifndef pud_free_tlb
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
+#endif
 
 #ifndef __ARCH_HAS_5LEVEL_HACK
+#ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)			\
 	do {							\
 		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
 		__p4d_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
+#endif
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.
From: Nicholas Piggin @ 2018-07-02 22:08 UTC (permalink / raw)
  To: Mahesh J Salgaonkar
  Cc: linuxppc-dev, Aneesh Kumar K.V, Laurent Dufour, Michal Suchanek
In-Reply-To: <153051042206.30541.2156877677180900261.stgit@jupiter.in.ibm.com>

On Mon, 02 Jul 2018 11:17:06 +0530
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:

> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> On pseries, as of today system crashes if we get a machine check
> exceptions due to SLB errors. These are soft errors and can be fixed by
> flushing the SLBs so the kernel can continue to function instead of
> system crash. We do this in real mode before turning on MMU. Otherwise
> we would run into nested machine checks. This patch now fetches the
> rtas error log in real mode and flushes the SLBs on SLB errors.
> 
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |    1 
>  arch/powerpc/include/asm/machdep.h            |    1 
>  arch/powerpc/kernel/exceptions-64s.S          |   42 +++++++++++++++++++++
>  arch/powerpc/kernel/mce.c                     |   16 +++++++-
>  arch/powerpc/mm/slb.c                         |    6 +++
>  arch/powerpc/platforms/powernv/opal.c         |    1 
>  arch/powerpc/platforms/pseries/pseries.h      |    1 
>  arch/powerpc/platforms/pseries/ras.c          |   51 +++++++++++++++++++++++++
>  arch/powerpc/platforms/pseries/setup.c        |    1 
>  9 files changed, 116 insertions(+), 4 deletions(-)
> 


> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
> +BEGIN_FTR_SECTION
> +	EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
> +	mr	r10,r1			/* Save r1 */
> +	ld	r1,PACAMCEMERGSP(r13)	/* Use MC emergency stack */
> +	subi	r1,r1,INT_FRAME_SIZE	/* alloc stack frame		*/
> +	mfspr	r11,SPRN_SRR0		/* Save SRR0 */
> +	mfspr	r12,SPRN_SRR1		/* Save SRR1 */
> +	EXCEPTION_PROLOG_COMMON_1()
> +	EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
> +	EXCEPTION_PROLOG_COMMON_3(0x200)
> +	addi	r3,r1,STACK_FRAME_OVERHEAD
> +	BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */

Is there any reason you can't use the existing
machine_check_powernv_early code to do all this?

> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index efdd16a79075..221271c96a57 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
>  {
>  	long handled = 0;
>  
> -	__this_cpu_inc(irq_stat.mce_exceptions);
> +	/*
> +	 * For pSeries we count mce when we go into virtual mode machine
> +	 * check handler. Hence skip it. Also, We can't access per cpu
> +	 * variables in real mode for LPAR.
> +	 */
> +	if (early_cpu_has_feature(CPU_FTR_HVMODE))
> +		__this_cpu_inc(irq_stat.mce_exceptions);
>  
> -	if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
> +	/*
> +	 * See if platform is capable of handling machine check.
> +	 * Otherwise fallthrough and allow CPU to handle this machine check.
> +	 */
> +	if (ppc_md.machine_check_early)
> +		handled = ppc_md.machine_check_early(regs);
> +	else if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
>  		handled = cur_cpu_spec->machine_check_early(regs);

Would be good to add a powernv ppc_md handler which does the
cur_cpu_spec->machine_check_early() call now that other platforms are
calling this code. Because those aren't valid as a fallback call, but
specific to powernv.

> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index 48fbb41af5d1..ed548d40a9e1 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -417,7 +417,6 @@ static int opal_recover_mce(struct pt_regs *regs,
>  
>  	if (!(regs->msr & MSR_RI)) {
>  		/* If MSR_RI isn't set, we cannot recover */
> -		pr_err("Machine check interrupt unrecoverable: MSR(RI=0)\n");

What's the reason for this change?

>  		recovered = 0;
>  	} else if (evt->disposition == MCE_DISPOSITION_RECOVERED) {
>  		/* Platform corrected itself */
> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
> index 60db2ee511fb..3611db5dd583 100644
> --- a/arch/powerpc/platforms/pseries/pseries.h
> +++ b/arch/powerpc/platforms/pseries/pseries.h
> @@ -24,6 +24,7 @@ struct pt_regs;
>  
>  extern int pSeries_system_reset_exception(struct pt_regs *regs);
>  extern int pSeries_machine_check_exception(struct pt_regs *regs);
> +extern int pSeries_machine_check_realmode(struct pt_regs *regs);
>  
>  #ifdef CONFIG_SMP
>  extern void smp_init_pseries(void);
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index 851ce326874a..9aa7885e0148 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -427,6 +427,35 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>  	return 0; /* need to perform reset */
>  }
>  
> +static int mce_handle_error(struct rtas_error_log *errp)
> +{
> +	struct pseries_errorlog *pseries_log;
> +	struct pseries_mc_errorlog *mce_log;
> +	int disposition = rtas_error_disposition(errp);
> +	uint8_t error_type;
> +
> +	if (!rtas_error_extended(errp))
> +		goto out;
> +
> +	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
> +	if (pseries_log == NULL)
> +		goto out;
> +
> +	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
> +	error_type = rtas_mc_error_type(mce_log);
> +
> +	if ((disposition == RTAS_DISP_NOT_RECOVERED) &&
> +			(error_type == PSERIES_MC_ERROR_TYPE_SLB)) {
> +		/* Store the old slb content someplace. */
> +		slb_flush_and_rebolt_realmode();
> +		disposition = RTAS_DISP_FULLY_RECOVERED;
> +		rtas_set_disposition_recovered(errp);
> +	}
> +
> +out:
> +	return disposition;
> +}
> +
>  /*
>   * Process MCE rtas errlog event.
>   */
> @@ -503,11 +532,31 @@ int pSeries_machine_check_exception(struct pt_regs *regs)
>  	struct rtas_error_log *errp;
>  
>  	if (fwnmi_active) {
> -		errp = fwnmi_get_errinfo(regs);
>  		fwnmi_release_errinfo();

Should the fwnmi_release_errinfo be done in the realmode path as well
now, or is there some reason to leave it here?

> +		errp = fwnmi_get_errlog();
>  		if (errp && recover_mce(regs, errp))
>  			return 1;
>  	}
>  
>  	return 0;
>  }
> +
> +int pSeries_machine_check_realmode(struct pt_regs *regs)
> +{
> +	struct rtas_error_log *errp;
> +	int disposition;
> +
> +	if (fwnmi_active) {
> +		errp = fwnmi_get_errinfo(regs);
> +		/*
> +		 * Call to fwnmi_release_errinfo() in real mode causes kernel
> +		 * to panic. Hence we will call it as soon as we go into
> +		 * virtual mode.
> +		 */
> +		disposition = mce_handle_error(errp);
> +		if (disposition == RTAS_DISP_FULLY_RECOVERED)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 60a067a6e743..249b02bc5c41 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -999,6 +999,7 @@ define_machine(pseries) {
>  	.calibrate_decr		= generic_calibrate_decr,
>  	.progress		= rtas_progress,
>  	.system_reset_exception = pSeries_system_reset_exception,
> +	.machine_check_early	= pSeries_machine_check_realmode,
>  	.machine_check_exception = pSeries_machine_check_exception,
>  #ifdef CONFIG_KEXEC_CORE
>  	.machine_kexec          = pSeries_machine_kexec,
> 

^ permalink raw reply

* Re: [Update] Regression in 4.18 - 32-bit PowerPC crashes on boot - bisected to commit 1d40a5ea01d5
From: Larry Finger @ 2018-07-02 20:51 UTC (permalink / raw)
  To: Michael Ellerman, Linus Torvalds
  Cc: Matthew Wilcox, Kirill A. Shutemov, Vlastimil Babka,
	Christoph Lameter, Dave Hansen, Jerome Glisse, Lai Jiangshan,
	Martin Schwidefsky, Pekka Enberg, Randy Dunlap, Andrey Ryabinin,
	Andrew Morton, Benjamin Herrenschmidt, Paul Mackerras, ppc-dev,
	Linux Kernel Mailing List
In-Reply-To: <87d0w6mfvi.fsf@concordia.ellerman.id.au>

On 07/01/2018 11:16 PM, Michael Ellerman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> On Fri, Jun 29, 2018 at 1:42 PM Larry Finger <Larry.Finger@lwfinger.net> wrote:
>>>
>>> I have more information regarding this BUG. Line 700 of page-flags.h is the
>>> macro PAGE_TYPE_OPS(Table, table). For further debugging, I manually expanded
>>> the macro, and found that the bug line is VM_BUG_ON_PAGE(!PageTable(page), page)
>>> in routine __ClearPageTable(), which is called from pgtable_page_dtor() in
>>> include/linux/mm.h. I also added a printk call to PageTable() that logs
>>> page->page_type. The routine was called twice. The first had page_type of
>>> 0xfffffbff, which would have been expected for a . The second call had
>>> 0xffffffff, which led to the BUG.
>>
>> So it looks to me like the tear-down of the page tables first found a
>> page that is indeed a page table, and cleared the page table bit
>> (well, it set it - the bits are reversed).
> ...
>>
>> That said, can some ppc person who knows the 32-bit ppc code and maybe
>> knows what that "interrupt: 700" means talk about that oddity in the
>> trace, please?
> 
> I think everyone else answered your questions here, and it should be
> fixed now in your tree.
> 
> Larry let me know if you're still seeing a crash with 4.18-rc3.

The problem is fixed in 4.18-rc3. Thanks to all that helped.

Larry

^ permalink raw reply

* Re: [v3 PATCH 5/5] powerpc/pseries: Display machine check error details.
From: Michal Suchánek @ 2018-07-02 18:01 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Mahesh J Salgaonkar, linuxppc-dev, Laurent Dufour,
	Aneesh Kumar K.V
In-Reply-To: <20180608115136.7a6db415@roar.ozlabs.ibm.com>

On Fri, 8 Jun 2018 11:51:36 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Thu, 07 Jun 2018 22:59:04 +0530
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> wrote:
> 
> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > 
> > Extract the MCE error details from RTAS extended log and display it
> > to console.
> > 
> > With this patch you should now see mce logs like below:
> > 
> > [  142.371818] Severe Machine check interrupt [Recovered]
> > [  142.371822]   NIP [d00000000ca301b8]: init_module+0x1b8/0x338
> > [bork_kernel] [  142.371822]   Initiator: CPU
> > [  142.371823]   Error type: SLB [Multihit]
> > [  142.371824]     Effective address: d00000000ca70000
> > 
> > Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/include/asm/rtas.h      |    5 +
> >  arch/powerpc/platforms/pseries/ras.c |  128
> > +++++++++++++++++++++++++++++++++- 2 files changed, 131
> > insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/rtas.h
> > b/arch/powerpc/include/asm/rtas.h index 3f2fba7ef23b..8100a95c133a
> > 100644 --- a/arch/powerpc/include/asm/rtas.h
> > +++ b/arch/powerpc/include/asm/rtas.h
> > @@ -190,6 +190,11 @@ static inline uint8_t
> > rtas_error_extended(const struct rtas_error_log *elog) return
> > (elog->byte1 & 0x04) >> 2; }
> >  
> > +static inline uint8_t rtas_error_initiator(const struct
> > rtas_error_log *elog) +{
> > +	return (elog->byte2 & 0xf0) >> 4;
> > +}
> > +
> >  #define rtas_error_type(x)	((x)->byte3)
> >  
> >  static inline
> > diff --git a/arch/powerpc/platforms/pseries/ras.c
> > b/arch/powerpc/platforms/pseries/ras.c index
> > e56759d92356..cd9446980092 100644 ---
> > a/arch/powerpc/platforms/pseries/ras.c +++
> > b/arch/powerpc/platforms/pseries/ras.c @@ -422,7 +422,130 @@ int
> > pSeries_system_reset_exception(struct pt_regs *regs) return 0; /*
> > need to perform reset */ }
> >  
> > -static int mce_handle_error(struct rtas_error_log *errp)
> > +#define VAL_TO_STRING(ar, val)	((val < ARRAY_SIZE(ar)) ?
> > ar[val] : "Unknown") +
> > +static void pseries_print_mce_info(struct pt_regs *regs,
> > +				struct rtas_error_log *errp, int
> > disposition) +{
> > +	const char *level, *sevstr;
> > +	struct pseries_errorlog *pseries_log;
> > +	struct pseries_mc_errorlog *mce_log;
> > +	uint8_t error_type, err_sub_type;
> > +	uint8_t initiator = rtas_error_initiator(errp);
> > +	uint64_t addr;
> > +
> > +	static const char * const initiators[] = {
> > +		"Unknown",
> > +		"CPU",
> > +		"PCI",
> > +		"ISA",
> > +		"Memory",
> > +		"Power Mgmt",
> > +	};
> > +	static const char * const mc_err_types[] = {
> > +		"UE",
> > +		"SLB",
> > +		"ERAT",
> > +		"TLB",
> > +		"D-Cache",
> > +		"Unknown",
> > +		"I-Cache",
> > +	};
> > +	static const char * const mc_ue_types[] = {
> > +		"Indeterminate",
> > +		"Instruction fetch",
> > +		"Page table walk ifetch",
> > +		"Load/Store",
> > +		"Page table walk Load/Store",
> > +	};
> > +
> > +	/* SLB sub errors valid values are 0x0, 0x1, 0x2 */
> > +	static const char * const mc_slb_types[] = {
> > +		"Parity",
> > +		"Multihit",
> > +		"Indeterminate",
> > +	};
> > +
> > +	/* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3
> > */
> > +	static const char * const mc_soft_types[] = {
> > +		"Unknown",
> > +		"Parity",
> > +		"Multihit",
> > +		"Indeterminate",
> > +	};
> > +
> > +	pseries_log = get_pseries_errorlog(errp,
> > PSERIES_ELOG_SECT_ID_MCE);
> > +	if (pseries_log == NULL)
> > +		return;
> > +
> > +	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
> > +
> > +	error_type = rtas_mc_error_type(mce_log);
> > +	err_sub_type = rtas_mc_error_sub_type(mce_log);
> > +
> > +	switch (rtas_error_severity(errp)) {
> > +	case RTAS_SEVERITY_NO_ERROR:
> > +		level = KERN_INFO;
> > +		sevstr = "Harmless";
> > +		break;
> > +	case RTAS_SEVERITY_WARNING:
> > +		level = KERN_WARNING;
> > +		sevstr = "";
> > +		break;
> > +	case RTAS_SEVERITY_ERROR:
> > +	case RTAS_SEVERITY_ERROR_SYNC:
> > +		level = KERN_ERR;
> > +		sevstr = "Severe";
> > +		break;
> > +	case RTAS_SEVERITY_FATAL:
> > +	default:
> > +		level = KERN_ERR;
> > +		sevstr = "Fatal";
> > +		break;
> > +	}
> > +
> > +	printk("%s%s Machine check interrupt [%s]\n", level,
> > sevstr,
> > +		disposition == RTAS_DISP_FULLY_RECOVERED ?
> > +		"Recovered" : "Not recovered");
> > +	if (user_mode(regs)) {
> > +		printk("%s  NIP: [%016lx] PID: %d Comm: %s\n",
> > level,
> > +			regs->nip, current->pid, current->comm);
> > +	} else {
> > +		printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
> > +			(void *)regs->nip);
> > +	}  
> 
> I think it's probably still useful to print pid/comm for kernel mode
> faults if !in_interrupt()... I see you're basically taking
> kernel/mce.c and doing the same thing.
> 
> Is there any reasonable way to share code here?
> 

I don't think so. In commit 36df96f8acaf ("powerpc/book3s: Decode and
save machine check event.") these enums are added:

enum MCE_ErrorType {
        MCE_ERROR_TYPE_UNKNOWN = 0,
        MCE_ERROR_TYPE_UE = 1,
        MCE_ERROR_TYPE_SLB = 2,
        MCE_ERROR_TYPE_ERAT = 3,
        MCE_ERROR_TYPE_TLB = 4,
};

enum MCE_UeErrorType {
        MCE_UE_ERROR_INDETERMINATE = 0,
        MCE_UE_ERROR_IFETCH = 1,
        MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH = 2,
        MCE_UE_ERROR_LOAD_STORE = 3,
        MCE_UE_ERROR_PAGE_TABLE_WALK_LOAD_STORE = 4,
};

enum MCE_SlbErrorType {
        MCE_SLB_ERROR_INDETERMINATE = 0,
        MCE_SLB_ERROR_PARITY = 1,
        MCE_SLB_ERROR_MULTIHIT = 2,
};

enum MCE_EratErrorType {
        MCE_ERAT_ERROR_INDETERMINATE = 0,
        MCE_ERAT_ERROR_PARITY = 1,
        MCE_ERAT_ERROR_MULTIHIT = 2,
};

enum MCE_TlbErrorType {
        MCE_TLB_ERROR_INDETERMINATE = 0,
        MCE_TLB_ERROR_PARITY = 1,
        MCE_TLB_ERROR_MULTIHIT = 2,
};

And the patch in the series adds slightly different definitions:

/* RTAS pseries MCE error types */
#define PSERIES_MC_ERROR_TYPE_UE                0x00
#define PSERIES_MC_ERROR_TYPE_SLB               0x01
#define PSERIES_MC_ERROR_TYPE_ERAT              0x02
#define PSERIES_MC_ERROR_TYPE_TLB               0x04
#define PSERIES_MC_ERROR_TYPE_D_CACHE           0x05
#define PSERIES_MC_ERROR_TYPE_I_CACHE           0x07

/* RTAS pseries MCE error sub types */
#define PSERIES_MC_ERROR_UE_INDETERMINATE               0
#define PSERIES_MC_ERROR_UE_IFETCH                      1
#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH      2
#define PSERIES_MC_ERROR_UE_LOAD_STORE                  3
#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE  4

#define PSERIES_MC_ERROR_SLB_PARITY             0
#define PSERIES_MC_ERROR_SLB_MULTIHIT           1
#define PSERIES_MC_ERROR_SLB_INDETERMINATE      2

#define PSERIES_MC_ERROR_ERAT_PARITY            1
#define PSERIES_MC_ERROR_ERAT_MULTIHIT          2
#define PSERIES_MC_ERROR_ERAT_INDETERMINATE     3

#define PSERIES_MC_ERROR_TLB_PARITY             1
#define PSERIES_MC_ERROR_TLB_MULTIHIT           2
#define PSERIES_MC_ERROR_TLB_INDETERMINATE      3


If the MCEs are indeed intentionally different between pSeries and
powernv it might be worth mentioning somewhere.

Thanks

Michal

^ permalink raw reply

* Re: [PATCH 0/9] Fix references for some missing documentation files
From: Jonathan Corbet @ 2018-07-02 17:27 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Doc Mailing List, Mauro Carvalho Chehab, linux-kernel,
	Jacek Anaszewski, devicetree, Ingo Molnar, linux-kernel,
	Andrew Morton, linux-leds, intel-wired-lan, Mark Rutland,
	linux-gpio, David S. Miller, James Morris, Jeff Kirsher,
	Changbin Du, Masami Hiramatsu, netdev, Steven Rostedt,
	linux-input, linuxppc-dev, linux-scsi, kvm, virtualization,
	Andy Whitcroft, Joe Perches
In-Reply-To: <cover.1530005114.git.mchehab+samsung@kernel.org>

On Tue, 26 Jun 2018 06:49:02 -0300
Mauro Carvalho Chehab <mchehab+samsung@kernel.org> wrote:

> Having nothing to do while waiting for my plane to arrive while
> returning back from Japan, I ended by writing a small series of 
> patches meant to reduce the number of bad Documentation/* 
> links that are detected by:
> 	./scripts/documentation-file-ref-check

I've applied everything except the two networking patches, since I expect
those to go through Dave's tree.

Thanks,

jon

^ permalink raw reply

* Re: [PATCH] powerpc: mpc5200: Remove VLA usage
From: Segher Boessenkool @ 2018-07-02 17:16 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Kees Cook, Arnd Bergmann, Paul Mackerras, Anatolij Gustschin,
	linuxppc-dev, Linux Kernel Mailing List
In-Reply-To: <87fu12mnf7.fsf@concordia.ellerman.id.au>

On Mon, Jul 02, 2018 at 11:33:32AM +1000, Michael Ellerman wrote:
> What if we write it:
> 
>        char saved_0x500[0x600 - 0x500];
> 
> Hopefully the compiler is smart enough not to generate a VLA for that :)

It is a VLA if the array size is not an integer constant expression.  This
is defined by C; the compiler has nothing to do with it.  0x600-0x500 is
an integer constant expression, so this is not a VLA.

But if you meant if GCC will ever do a dynamic stack allocation for a fixed
size local variable: yes indeed, I hope not!

(Sometimes GCC can avoid this even with VLAs; but in this example we do
not even have a VLA, so it's easier than that :-) )

Segher

^ permalink raw reply

* [PATCH v2] powerpc: mpc5200: Remove VLA usage
From: Kees Cook @ 2018-07-02 15:56 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Arnd Bergmann, Anatolij Gustschin, Benjamin Herrenschmidt,
	Paul Mackerras, linux-kernel, linuxppc-dev

In the quest to remove all stack VLA usage from the kernel[1], this
switches to using a stack size large enough for the saved routine and
adds a sanity check making sure the routine doesn't overflow into the
0x600 exception handler.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
---
v2: use "0x600-0x500" for size calculation to illustrate handler sizes
---
 arch/powerpc/platforms/52xx/mpc52xx_pm.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/52xx/mpc52xx_pm.c b/arch/powerpc/platforms/52xx/mpc52xx_pm.c
index 31d3515672f3..b1d208ded981 100644
--- a/arch/powerpc/platforms/52xx/mpc52xx_pm.c
+++ b/arch/powerpc/platforms/52xx/mpc52xx_pm.c
@@ -117,7 +117,10 @@ int mpc52xx_pm_enter(suspend_state_t state)
 	u32 intr_main_mask;
 	void __iomem * irq_0x500 = (void __iomem *)CONFIG_KERNEL_START + 0x500;
 	unsigned long irq_0x500_stop = (unsigned long)irq_0x500 + mpc52xx_ds_cached_size;
-	char saved_0x500[mpc52xx_ds_cached_size];
+	char saved_0x500[0x600-0x500];
+
+	if (WARN_ON(mpc52xx_ds_cached_size > sizeof(saved_0x500)))
+		return -ENOMEM;
 
 	/* disable all interrupts in PIC */
 	intr_main_mask = in_be32(&intr->main_mask);
-- 
2.17.1


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [PATCH] powerpc: mpc5200: Remove VLA usage
From: Kees Cook @ 2018-07-02 15:52 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Arnd Bergmann, linuxppc-dev, Anatolij Gustschin, Paul Mackerras,
	Linux Kernel Mailing List
In-Reply-To: <87fu12mnf7.fsf@concordia.ellerman.id.au>

On Sun, Jul 1, 2018 at 6:33 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Fri, Jun 29, 2018 at 2:02 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> On Fri, Jun 29, 2018 at 8:53 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> In the quest to remove all stack VLA usage from the kernel[1], this
>>>> switches to using a stack size large enough for the saved routine and
>>>> adds a sanity check.
>>>>
>>>> [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
>>>>
>>>> Signed-off-by: Kees Cook <keescook@chromium.org>
>>>
>>> This seems particularly nice, not only avoids it the dynamic stack
>>> allocation, it
>>> also makes sure the new 0x500 handler doesn't overflow into the 0x600
>>> exception handler.
>>>
>>> It would help to explain how you arrived at that '256 byte' number in
>>> the changelog though.
>>
>> Honestly, I just counted instructions, multiplied by 8 and rounded up
>> to the next nearest power of 2, and the result felt right for giving
>> some level of flexibility for code growth before tripping the WARN. :P
>>
>> I'm happy to adjust, of course. :)
>
> What if we write it:
>
>        char saved_0x500[0x600 - 0x500];
>
> Hopefully the compiler is smart enough not to generate a VLA for that :)

Sure, that's fine. I'll send an updated patch.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* [PATCH v2 2/2] powernv/cpuidle: Use parsed device tree values for cpuidle_init
From: Akshay Adiga @ 2018-07-02 14:23 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, linux-pm
  Cc: rjw, stewart, benh, svaidy, ego, npiggin, mpe, Akshay Adiga
In-Reply-To: <1530541401-19726-1-git-send-email-akshay.adiga@linux.vnet.ibm.com>

Export pnv_idle_states and nr_pnv_idle_states so that its accessible to
cpuidle driver. Use properties from pnv_idle_states structure for powernv
cpuidle_init.

Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/cpuidle.h |   2 +
 drivers/cpuidle/cpuidle-powernv.c  | 143 +++++------------------------
 2 files changed, 26 insertions(+), 119 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index 574b0ce1d671..43e5f31fe64d 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -90,6 +90,8 @@ struct pnv_idle_states_t {
 	bool valid;
 };
 
+extern struct pnv_idle_states_t *pnv_idle_states;
+extern int nr_pnv_idle_states;
 extern u32 pnv_fastsleep_workaround_at_entry[];
 extern u32 pnv_fastsleep_workaround_at_exit[];
 
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 7ab613d4dca1..ec93b2ae7b17 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -242,6 +242,7 @@ static inline void add_powernv_state(int index, const char *name,
 	powernv_states[index].target_residency = target_residency;
 	powernv_states[index].exit_latency = exit_latency;
 	powernv_states[index].enter = idle_fn;
+	/* For power8 and below psscr_* will be 0 */
 	stop_psscr_table[index].val = psscr_val;
 	stop_psscr_table[index].mask = psscr_mask;
 }
@@ -263,179 +264,84 @@ static inline int validate_dt_prop_sizes(const char *prop1, int prop1_len,
 extern u32 pnv_get_supported_cpuidle_states(void);
 static int powernv_add_idle_states(void)
 {
-	struct device_node *power_mgt;
 	int nr_idle_states = 1; /* Snooze */
-	int dt_idle_states, count;
-	u32 latency_ns[CPUIDLE_STATE_MAX];
-	u32 residency_ns[CPUIDLE_STATE_MAX];
-	u32 flags[CPUIDLE_STATE_MAX];
-	u64 psscr_val[CPUIDLE_STATE_MAX];
-	u64 psscr_mask[CPUIDLE_STATE_MAX];
-	const char *names[CPUIDLE_STATE_MAX];
+	int dt_idle_states;
 	u32 has_stop_states = 0;
-	int i, rc;
+	int i;
 	u32 supported_flags = pnv_get_supported_cpuidle_states();
 
 
 	/* Currently we have snooze statically defined */
-
-	power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
-	if (!power_mgt) {
-		pr_warn("opal: PowerMgmt Node not found\n");
+	if (nr_pnv_idle_states <= 0) {
+		pr_warn("cpuidle-powernv : Only Snooze is available\n");
 		goto out;
 	}
 
-	/* Read values of any property to determine the num of idle states */
-	dt_idle_states = of_property_count_u32_elems(power_mgt, "ibm,cpu-idle-state-flags");
-	if (dt_idle_states < 0) {
-		pr_warn("cpuidle-powernv: no idle states found in the DT\n");
-		goto out;
-	}
-
-	count = of_property_count_u32_elems(power_mgt,
-					    "ibm,cpu-idle-state-latencies-ns");
-
-	if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
-				   "ibm,cpu-idle-state-latencies-ns",
-				   count) != 0)
-		goto out;
-
-	count = of_property_count_strings(power_mgt,
-					  "ibm,cpu-idle-state-names");
-	if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
-				   "ibm,cpu-idle-state-names",
-				   count) != 0)
-		goto out;
+	/* TODO: Count only states which are eligible for cpuidle */
+	dt_idle_states = nr_pnv_idle_states;
 
 	/*
 	 * Since snooze is used as first idle state, max idle states allowed is
 	 * CPUIDLE_STATE_MAX -1
 	 */
-	if (dt_idle_states > CPUIDLE_STATE_MAX - 1) {
+	if (nr_pnv_idle_states > CPUIDLE_STATE_MAX - 1) {
 		pr_warn("cpuidle-powernv: discovered idle states more than allowed");
 		dt_idle_states = CPUIDLE_STATE_MAX - 1;
 	}
 
-	if (of_property_read_u32_array(power_mgt,
-			"ibm,cpu-idle-state-flags", flags, dt_idle_states)) {
-		pr_warn("cpuidle-powernv : missing ibm,cpu-idle-state-flags in DT\n");
-		goto out;
-	}
-
-	if (of_property_read_u32_array(power_mgt,
-		"ibm,cpu-idle-state-latencies-ns", latency_ns,
-		dt_idle_states)) {
-		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-latencies-ns in DT\n");
-		goto out;
-	}
-	if (of_property_read_string_array(power_mgt,
-		"ibm,cpu-idle-state-names", names, dt_idle_states) < 0) {
-		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-names in DT\n");
-		goto out;
-	}
-
 	/*
 	 * If the idle states use stop instruction, probe for psscr values
 	 * and psscr mask which are necessary to specify required stop level.
 	 */
-	has_stop_states = (flags[0] &
+	has_stop_states = (pnv_idle_states[0].flags &
 			   (OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP));
-	if (has_stop_states) {
-		count = of_property_count_u64_elems(power_mgt,
-						    "ibm,cpu-idle-state-psscr");
-		if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
-					   dt_idle_states,
-					   "ibm,cpu-idle-state-psscr",
-					   count) != 0)
-			goto out;
-
-		count = of_property_count_u64_elems(power_mgt,
-						    "ibm,cpu-idle-state-psscr-mask");
-		if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
-					   dt_idle_states,
-					   "ibm,cpu-idle-state-psscr-mask",
-					   count) != 0)
-			goto out;
-
-		if (of_property_read_u64_array(power_mgt,
-		    "ibm,cpu-idle-state-psscr", psscr_val, dt_idle_states)) {
-			pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr in DT\n");
-			goto out;
-		}
-
-		if (of_property_read_u64_array(power_mgt,
-					       "ibm,cpu-idle-state-psscr-mask",
-						psscr_mask, dt_idle_states)) {
-			pr_warn("cpuidle-powernv:Missing ibm,cpu-idle-state-psscr-mask in DT\n");
-			goto out;
-		}
-	}
-
-	count = of_property_count_u32_elems(power_mgt,
-					    "ibm,cpu-idle-state-residency-ns");
-
-	if (count < 0) {
-		rc = count;
-	} else if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
-					  dt_idle_states,
-					  "ibm,cpu-idle-state-residency-ns",
-					  count) != 0) {
-		goto out;
-	} else {
-		rc = of_property_read_u32_array(power_mgt,
-						"ibm,cpu-idle-state-residency-ns",
-						residency_ns, dt_idle_states);
-	}
 
 	for (i = 0; i < dt_idle_states; i++) {
 		unsigned int exit_latency, target_residency;
 		bool stops_timebase = false;
+		struct pnv_idle_states_t *state = &pnv_idle_states[i];
 
 		/*
 		 * Skip the platform idle state whose flag isn't in
 		 * the supported_cpuidle_states flag mask.
 		 */
-		if ((flags[i] & supported_flags) != flags[i])
+		if ((state->flags & supported_flags) != state->flags)
 			continue;
 		/*
 		 * If an idle state has exit latency beyond
 		 * POWERNV_THRESHOLD_LATENCY_NS then don't use it
 		 * in cpu-idle.
 		 */
-		if (latency_ns[i] > POWERNV_THRESHOLD_LATENCY_NS)
+		if (state->latency_ns > POWERNV_THRESHOLD_LATENCY_NS)
 			continue;
 		/*
 		 * Firmware passes residency and latency values in ns.
 		 * cpuidle expects it in us.
 		 */
-		exit_latency = DIV_ROUND_UP(latency_ns[i], 1000);
-		if (!rc)
-			target_residency = DIV_ROUND_UP(residency_ns[i], 1000);
-		else
-			target_residency = 0;
+		exit_latency = DIV_ROUND_UP(state->latency_ns, 1000);
+		target_residency = DIV_ROUND_UP(state->residency_ns, 1000);
 
 		if (has_stop_states && !(state->valid))
 			continue;
 
-		if (flags[i] & OPAL_PM_TIMEBASE_STOP)
+		if (state->flags & OPAL_PM_TIMEBASE_STOP)
 			stops_timebase = true;
 
 		/*
 		 * For nap and fastsleep, use default target_residency
 		 * values if f/w does not expose it.
 		 */
-		if (flags[i] & OPAL_PM_NAP_ENABLED) {
-			if (!rc)
-				target_residency = 100;
+		if (state->flags & OPAL_PM_NAP_ENABLED) {
 			/* Add NAP state */
 			add_powernv_state(nr_idle_states, "Nap",
 					  CPUIDLE_FLAG_NONE, nap_loop,
 					  target_residency, exit_latency, 0, 0);
 		} else if (has_stop_states && !stops_timebase) {
-			add_powernv_state(nr_idle_states, names[i],
+			add_powernv_state(nr_idle_states, state->name,
 					  CPUIDLE_FLAG_NONE, stop_loop,
 					  target_residency, exit_latency,
-					  psscr_val[i], psscr_mask[i]);
+					  state->psscr_val,
+					  state->psscr_mask);
 		}
 
 		/*
@@ -443,20 +349,19 @@ static int powernv_add_idle_states(void)
 		 * within this config dependency check.
 		 */
 #ifdef CONFIG_TICK_ONESHOT
-		else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
-			 flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
-			if (!rc)
-				target_residency = 300000;
+		else if (state->flags & OPAL_PM_SLEEP_ENABLED ||
+			 state->flags & OPAL_PM_SLEEP_ENABLED_ER1) {
 			/* Add FASTSLEEP state */
 			add_powernv_state(nr_idle_states, "FastSleep",
 					  CPUIDLE_FLAG_TIMER_STOP,
 					  fastsleep_loop,
 					  target_residency, exit_latency, 0, 0);
 		} else if (has_stop_states && stops_timebase) {
-			add_powernv_state(nr_idle_states, names[i],
+			add_powernv_state(nr_idle_states, state->name,
 					  CPUIDLE_FLAG_TIMER_STOP, stop_loop,
 					  target_residency, exit_latency,
-					  psscr_val[i], psscr_mask[i]);
+					  state->psscr_val,
+					  state->psscr_mask);
 		}
 #endif
 		else
-- 
2.18.0.rc2.85.g1fb9df7

^ permalink raw reply related

* [PATCH v2 1/2] powernv/cpuidle: Parse dt idle properties into global structure
From: Akshay Adiga @ 2018-07-02 14:23 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, linux-pm
  Cc: rjw, stewart, benh, svaidy, ego, npiggin, mpe, Akshay Adiga
In-Reply-To: <1530541401-19726-1-git-send-email-akshay.adiga@linux.vnet.ibm.com>

Device-tree parsing happens twice, once while deciding idle state to be
used for hotplug and once during cpuidle init. Hence, parsing the device
tree and caching it will reduce code duplication. Parsing code has been
moved to pnv_parse_cpuidle_dt() from pnv_probe_idle_states(). In addition
to the properties in the device tree the number of available states is
also required.

Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/cpuidle.h    |  11 ++
 arch/powerpc/platforms/powernv/idle.c | 216 ++++++++++++++++----------
 drivers/cpuidle/cpuidle-powernv.c     |  11 +-
 3 files changed, 151 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index e210a83eb196..574b0ce1d671 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -79,6 +79,17 @@ struct stop_sprs {
 	u64 mmcra;
 };
 
+#define PNV_IDLE_NAME_LEN    16
+struct pnv_idle_states_t {
+	char name[PNV_IDLE_NAME_LEN];
+	u32 latency_ns;
+	u32 residency_ns;
+	u64 psscr_val;
+	u64 psscr_mask;
+	u32 flags;
+	bool valid;
+};
+
 extern u32 pnv_fastsleep_workaround_at_entry[];
 extern u32 pnv_fastsleep_workaround_at_exit[];
 
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 1c5d0675b43c..7cf71b3e03a1 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -36,6 +36,8 @@
 #define P9_STOP_SPR_PSSCR      855
 
 static u32 supported_cpuidle_states;
+struct pnv_idle_states_t *pnv_idle_states;
+int nr_pnv_idle_states;
 
 /*
  * The default stop state that will be used by ppc_md.power_save
@@ -622,48 +624,10 @@ int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags)
  * @dt_idle_states: Number of idle state entries
  * Returns 0 on success
  */
-static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
-					int dt_idle_states)
+static int __init pnv_power9_idle_init(void)
 {
-	u64 *psscr_val = NULL;
-	u64 *psscr_mask = NULL;
-	u32 *residency_ns = NULL;
 	u64 max_residency_ns = 0;
-	int rc = 0, i;
-
-	psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL);
-	psscr_mask = kcalloc(dt_idle_states, sizeof(*psscr_mask), GFP_KERNEL);
-	residency_ns = kcalloc(dt_idle_states, sizeof(*residency_ns),
-			       GFP_KERNEL);
-
-	if (!psscr_val || !psscr_mask || !residency_ns) {
-		rc = -1;
-		goto out;
-	}
-
-	if (of_property_read_u64_array(np,
-		"ibm,cpu-idle-state-psscr",
-		psscr_val, dt_idle_states)) {
-		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr in DT\n");
-		rc = -1;
-		goto out;
-	}
-
-	if (of_property_read_u64_array(np,
-				       "ibm,cpu-idle-state-psscr-mask",
-				       psscr_mask, dt_idle_states)) {
-		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr-mask in DT\n");
-		rc = -1;
-		goto out;
-	}
-
-	if (of_property_read_u32_array(np,
-				       "ibm,cpu-idle-state-residency-ns",
-					residency_ns, dt_idle_states)) {
-		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-residency-ns in DT\n");
-		rc = -1;
-		goto out;
-	}
+	int i;
 
 	/*
 	 * Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
@@ -679,33 +643,36 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
 	 * the shallowest (OPAL_PM_STOP_INST_FAST) loss-less stop state.
 	 */
 	pnv_first_deep_stop_state = MAX_STOP_STATE;
-	for (i = 0; i < dt_idle_states; i++) {
+	for (i = 0; i < nr_pnv_idle_states; i++) {
 		int err;
-		u64 psscr_rl = psscr_val[i] & PSSCR_RL_MASK;
+		struct pnv_idle_states_t *state = &pnv_idle_states[i];
+		u64 psscr_rl = state->psscr_val & PSSCR_RL_MASK;
 
-		if ((flags[i] & OPAL_PM_LOSE_FULL_CONTEXT) &&
-		     (pnv_first_deep_stop_state > psscr_rl))
+		if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
+		    pnv_first_deep_stop_state > psscr_rl)
 			pnv_first_deep_stop_state = psscr_rl;
 
-		err = validate_psscr_val_mask(&psscr_val[i], &psscr_mask[i],
-					      flags[i]);
+		err = validate_psscr_val_mask(&state->psscr_val,
+					      &state->psscr_mask,
+					      state->flags);
 		if (err) {
-			report_invalid_psscr_val(psscr_val[i], err);
+			state->valid = false;
+			report_invalid_psscr_val(state->psscr_val, err);
 			continue;
 		}
 
-		if (max_residency_ns < residency_ns[i]) {
-			max_residency_ns = residency_ns[i];
-			pnv_deepest_stop_psscr_val = psscr_val[i];
-			pnv_deepest_stop_psscr_mask = psscr_mask[i];
-			pnv_deepest_stop_flag = flags[i];
+		if (max_residency_ns < state->residency_ns) {
+			max_residency_ns = state->residency_ns;
+			pnv_deepest_stop_psscr_val = state->psscr_val;
+			pnv_deepest_stop_psscr_mask = state->psscr_mask;
+			pnv_deepest_stop_flag = state->flags;
 			deepest_stop_found = true;
 		}
 
 		if (!default_stop_found &&
-		    (flags[i] & OPAL_PM_STOP_INST_FAST)) {
-			pnv_default_stop_val = psscr_val[i];
-			pnv_default_stop_mask = psscr_mask[i];
+		    (state->flags & OPAL_PM_STOP_INST_FAST)) {
+			pnv_default_stop_val = state->psscr_val;
+			pnv_default_stop_mask = state->psscr_mask;
 			default_stop_found = true;
 		}
 	}
@@ -728,11 +695,8 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
 
 	pr_info("cpuidle-powernv: Requested Level (RL) value of first deep stop = 0x%llx\n",
 		pnv_first_deep_stop_state);
-out:
-	kfree(psscr_val);
-	kfree(psscr_mask);
-	kfree(residency_ns);
-	return rc;
+
+	return 0;
 }
 
 /*
@@ -740,50 +704,146 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
  */
 static void __init pnv_probe_idle_states(void)
 {
-	struct device_node *np;
-	int dt_idle_states;
-	u32 *flags = NULL;
 	int i;
 
+	if (nr_pnv_idle_states < 0) {
+		pr_warn("cpuidle-powernv: no idle states found in the DT\n");
+		return;
+	}
+
+	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+		if (pnv_power9_idle_init())
+			return;
+	}
+
+	for (i = 0; i < nr_pnv_idle_states; i++)
+		supported_cpuidle_states |= pnv_idle_states[i].flags;
+}
+
+/*
+ * This function parses device-tree and populates all the information
+ * into pnv_idle_states structure. It also sets up nr_pnv_idle_states
+ * which is the number of cpuidle states discovered through device-tree.
+ */
+
+static int pnv_parse_cpuidle_dt(void)
+{
+	struct device_node *np;
+	int nr_idle_states, i;
+	int rc = 0;
+	u32 *temp_u32;
+	u64 *temp_u64;
+	const char **temp_string;
+
 	np = of_find_node_by_path("/ibm,opal/power-mgt");
 	if (!np) {
 		pr_warn("opal: PowerMgmt Node not found\n");
-		goto out;
+		return -ENODEV;
 	}
-	dt_idle_states = of_property_count_u32_elems(np,
-			"ibm,cpu-idle-state-flags");
-	if (dt_idle_states < 0) {
-		pr_warn("cpuidle-powernv: no idle states found in the DT\n");
+	nr_idle_states = of_property_count_u32_elems(np,
+						"ibm,cpu-idle-state-flags");
+
+	pnv_idle_states = kcalloc(nr_idle_states, sizeof(*pnv_idle_states),
+				  GFP_KERNEL);
+	temp_u32 = kcalloc(nr_idle_states, sizeof(u32),  GFP_KERNEL);
+	temp_u64 = kcalloc(nr_idle_states, sizeof(u64),  GFP_KERNEL);
+	temp_string = kcalloc(nr_idle_states, sizeof(char *),  GFP_KERNEL);
+
+	if (!(pnv_idle_states && temp_u32 && temp_u64 && temp_string)) {
+		pr_err("Could not allocate memory for dt parsing\n");
+		rc = -ENOMEM;
 		goto out;
 	}
 
-	flags = kcalloc(dt_idle_states, sizeof(*flags),  GFP_KERNEL);
-
-	if (of_property_read_u32_array(np,
-			"ibm,cpu-idle-state-flags", flags, dt_idle_states)) {
+	/* Read flags */
+	if (of_property_read_u32_array(np, "ibm,cpu-idle-state-flags",
+				       temp_u32, nr_idle_states)) {
 		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-flags in DT\n");
+		rc = -EINVAL;
 		goto out;
 	}
+	for (i = 0; i < nr_idle_states; i++)
+		pnv_idle_states[i].flags = temp_u32[i];
+
+	/* Read latencies */
+	if (of_property_read_u32_array(np, "ibm,cpu-idle-state-latencies-ns",
+				       temp_u32, nr_idle_states)) {
+		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-latencies-ns in DT\n");
+		rc = -EINVAL;
+		goto out;
+	}
+	for (i = 0; i < nr_idle_states; i++)
+		pnv_idle_states[i].latency_ns = temp_u32[i];
+
+	/* Read residencies */
+	if (of_property_read_u32_array(np, "ibm,cpu-idle-state-residency-ns",
+				       temp_u32, nr_idle_states)) {
+		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-latencies-ns in DT\n");
+		rc = -EINVAL;
+		goto out;
+	}
+	for (i = 0; i < nr_idle_states; i++)
+		pnv_idle_states[i].residency_ns = temp_u32[i];
 
+	/* For power9 */
 	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
-		if (pnv_power9_idle_init(np, flags, dt_idle_states))
+		/* Read pm_crtl_val */
+		if (of_property_read_u64_array(np, "ibm,cpu-idle-state-psscr",
+					       temp_u64, nr_idle_states)) {
+			pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr in DT\n");
+			rc = -EINVAL;
+			goto out;
+		}
+		for (i = 0; i < nr_idle_states; i++)
+			pnv_idle_states[i].psscr_val = temp_u64[i];
+
+		/* Read pm_crtl_mask */
+		if (of_property_read_u64_array(np, "ibm,cpu-idle-state-psscr-mask",
+					       temp_u64, nr_idle_states)) {
+			pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr-mask in DT\n");
+			rc = -EINVAL;
 			goto out;
+		}
+		for (i = 0; i < nr_idle_states; i++)
+			pnv_idle_states[i].psscr_mask = temp_u64[i];
 	}
 
-	for (i = 0; i < dt_idle_states; i++)
-		supported_cpuidle_states |= flags[i];
+	/*
+	 * power8 specific properties ibm,cpu-idle-state-pmicr-mask and
+	 * ibm,cpu-idle-state-pmicr-val were never used and there is no
+	 * plan to use it in near future. Hence, not parsing these properties
+	 */
 
+	if (of_property_read_string_array(np, "ibm,cpu-idle-state-names",
+					  temp_string, nr_idle_states) < 0) {
+		pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-names in DT\n");
+		rc = -EINVAL;
+		goto out;
+	}
+	for (i = 0; i < nr_idle_states; i++)
+		strncpy(pnv_idle_states[i].name, temp_string[i],
+			PNV_IDLE_NAME_LEN);
+	nr_pnv_idle_states = nr_idle_states;
+	rc = 0;
 out:
-	kfree(flags);
+	kfree(temp_u32);
+	kfree(temp_u64);
+	kfree(temp_string);
+	return rc;
 }
+
 static int __init pnv_init_idle_states(void)
 {
-
+	int rc = 0;
 	supported_cpuidle_states = 0;
 
+	/* In case we error out nr_pnv_idle_states will be zero */
+	nr_pnv_idle_states = 0;
 	if (cpuidle_disable != IDLE_NO_OVERRIDE)
 		goto out;
-
+	rc = pnv_parse_cpuidle_dt();
+	if (rc)
+		return rc;
 	pnv_probe_idle_states();
 
 	if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index d29e4f041efe..7ab613d4dca1 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -414,15 +414,8 @@ static int powernv_add_idle_states(void)
 		else
 			target_residency = 0;
 
-		if (has_stop_states) {
-			int err = validate_psscr_val_mask(&psscr_val[i],
-							  &psscr_mask[i],
-							  flags[i]);
-			if (err) {
-				report_invalid_psscr_val(psscr_val[i], err);
-				continue;
-			}
-		}
+		if (has_stop_states && !(state->valid))
+			continue;
 
 		if (flags[i] & OPAL_PM_TIMEBASE_STOP)
 			stops_timebase = true;
-- 
2.18.0.rc2.85.g1fb9df7

^ permalink raw reply related

* [PATCH v2 0/2] powernv/cpuidle Device-tree parsing cleanup
From: Akshay Adiga @ 2018-07-02 14:23 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, linux-pm
  Cc: rjw, stewart, benh, svaidy, ego, npiggin, mpe, Akshay Adiga

Device-tree parsed multiple time in powernv cpuidle and powernv
hotplug code. 

First to identify supported flags. Second time, to identify deepest_state
and first deep state. Third time, during cpuidle init to find the available
idle states. Any change in device-tree format will lead to make changes in
these 3 places. Errors in device-tree can be handled in a better manner.

This series adds code to parse device tree once and save in global structure.

Changes from v1 :                                                          
 - folded first 2 patches into 1                                             
 - rename pm_ctrl_reg_* as psscr_*                                         
 - added comment stating removal of pmicr parsing code                     
 - removed parsing code for pmicr                                          
 - add member valid in pnv_idle_states_t to indicate if the psscr-mask/val 
are valid combination,                                                     
 - Change function description of pnv_parse_cpuidle_dt                     
 - Added error handling code.                      

Akshay Adiga (2):
  powernv/cpuidle: Parse dt idle properties into global structure
  powernv/cpuidle: Use parsed device tree values for cpuidle_init

 arch/powerpc/include/asm/cpuidle.h    |  13 ++
 arch/powerpc/platforms/powernv/idle.c | 216 ++++++++++++++++----------
 drivers/cpuidle/cpuidle-powernv.c     | 156 ++++---------------
 3 files changed, 178 insertions(+), 207 deletions(-)

-- 
2.18.0.rc2.85.g1fb9df7

^ permalink raw reply

* Re: [PATCH v4 01/11] macintosh/via-pmu: Fix section mismatch warning
From: Finn Thain @ 2018-07-02 12:07 UTC (permalink / raw)
  To: Mathieu Malaterre
  Cc: Benjamin Herrenschmidt, Michael Schmitz, linux-m68k, linuxppc-dev,
	LKML
In-Reply-To: <CA+7wUsyfkC3FvbFyt9JRdBas3s+52tp7OH8h2jiJM9z+ynoyyw@mail.gmail.com>

On Mon, 2 Jul 2018, Mathieu Malaterre wrote:

> On Mon, Jul 2, 2018 at 10:25 AM Finn Thain <fthain@telegraphics.com.au> 
> wrote:
> >
> > The pmu_init() function has the __init qualifier, but the ops struct 
> > that holds a pointer to it does not. This causes a build warning. The 
> > driver works fine because the pointer is only dereferenced early.
> >
> > The function is so small that there's negligible benefit from using 
> > the __init qualifier. Remove it to fix the warning, consistent with 
> > the other ADB drivers.
> 
> Would you mind copy/pasting the warning you are seeing.
> 
> Make sure you have:
> 
> 58935176ad17 powerpc/via-pmu: Fix section mismatch warning
> 
> Thanks
> 

It's true, the section mismatch warning from 'make' has disappeared since 
I wrote this patch, but that doesn't mean it is wrong.

Before this patch:

$ powerpc-linux-gnu-objdump -xda vmlinux |egrep -w "via_pmu_driver|pmu_init"
c0711c84 l     F .init.text     0000001c pmu_init
c05eb408 g     O .rodata        00000028 via_pmu_driver
c0711c84 <pmu_init>:
$ 

After:

$ powerpc-linux-gnu-objdump -xda vmlinux |egrep -w "via_pmu_driver|pmu_init"
c038e42c l     F .text  0000001c pmu_init
c05e1e58 g     O .rodata        00000028 via_pmu_driver
c038e42c <pmu_init>:
$

I gather that commit 58935176ad17 ("powerpc/via-pmu: Fix section mismatch 
warning") has moved via_pmu_driver from .data to .rodata, but I'm afraid I 
don't see the point of that change. The commit log entry doesn't explain 
it either.

If .rodata is not discarded then the dangling pointer remains, right?

-- 

^ permalink raw reply

* Re: [PATCH v4 01/11] macintosh/via-pmu: Fix section mismatch warning
From: Mathieu Malaterre @ 2018-07-02 10:35 UTC (permalink / raw)
  To: fthain
  Cc: Benjamin Herrenschmidt, Michael Schmitz, linux-m68k, linuxppc-dev,
	LKML
In-Reply-To: <720e9899186bed184fb9b5cce070e2dc519f9665.1530519301.git.fthain@telegraphics.com.au>

On Mon, Jul 2, 2018 at 10:25 AM Finn Thain <fthain@telegraphics.com.au> wrote:
>
> The pmu_init() function has the __init qualifier, but the ops struct
> that holds a pointer to it does not. This causes a build warning.
> The driver works fine because the pointer is only dereferenced early.
>
> The function is so small that there's negligible benefit from using
> the __init qualifier. Remove it to fix the warning, consistent with
> the other ADB drivers.

Would you mind copy/pasting the warning you are seeing.

Make sure you have:

58935176ad17 powerpc/via-pmu: Fix section mismatch warning

Thanks

> Tested-by: Stan Johnson <userm57@yahoo.com>
> Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
> ---
>  drivers/macintosh/via-pmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
> index 25c1ce811053..f8a2c917201f 100644
> --- a/drivers/macintosh/via-pmu.c
> +++ b/drivers/macintosh/via-pmu.c
> @@ -378,7 +378,7 @@ static int pmu_probe(void)
>         return vias == NULL? -ENODEV: 0;
>  }
>
> -static int __init pmu_init(void)
> +static int pmu_init(void)
>  {
>         if (vias == NULL)
>                 return -ENODEV;
> --
> 2.16.4
>

^ permalink raw reply

* [PATCH] powerpc: hwrng; fix missing of_node_put()
From: Nicholas Mc Guire @ 2018-07-02  9:08 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	linux-kernel, Nicholas Mc Guire

 The call to of_find_compatible_node() returns a node pointer with refcount
incremented thus it must be explicitly decremented here before returning.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: commit a489043f4626 ("powerpc/pseries: Implement arch_get_random_long() based on H_RANDOM")
---
Problem found with experimental coccinelle script

Patch was compiletested with: ppc64_defconfig (implies CONFIG_PPC_PSERIES=y)
with some unrelated sparse warnings (which I did not understand)
./arch/powerpc/include/asm/head-64.h:13:36: warning: Unknown escape '('
./arch/powerpc/include/asm/head-64.h:16:36: warning: Unknown escape '('
./arch/powerpc/include/asm/head-64.h:19:36: warning: Unknown escape '('

Patch is aginst 4.18-rc2 (localversion-next is next-20180702)

 arch/powerpc/platforms/pseries/rng.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/rng.c b/arch/powerpc/platforms/pseries/rng.c
index 31ca557..262b8c5 100644
--- a/arch/powerpc/platforms/pseries/rng.c
+++ b/arch/powerpc/platforms/pseries/rng.c
@@ -40,6 +40,7 @@ static __init int rng_init(void)
 
 	ppc_md.get_random_seed = pseries_get_random_long;
 
+	of_node_put(dn);
 	return 0;
 }
 machine_subsys_initcall(pseries, rng_init);
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v11 00/26] Speculative page faults
From: Laurent Dufour @ 2018-07-02  8:59 UTC (permalink / raw)
  To: Song, HaiyanX
  Cc: akpm@linux-foundation.org, mhocko@kernel.org,
	peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com,
	dave@stgolabs.net, jack@suse.cz, Matthew Wilcox,
	khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
	benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
	Thomas Gleixner, Ingo Molnar, hpa@zytor.com, Will Deacon,
	Sergey Senozhatsky, sergey.senozhatsky.work@gmail.com,
	Andrea Arcangeli, Alexei Starovoitov, Wang, Kemi, Daniel Jordan,
	David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
	Punit Agrawal, vinayak menon, Yang Shi,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haren@linux.vnet.ibm.com, npiggin@gmail.com,
	bsingharora@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen,
	linuxppc-dev@lists.ozlabs.org, x86@kernel.org
In-Reply-To: <9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com>

On 11/06/2018 09:49, Song, HaiyanX wrote:
> Hi Laurent,
> 
> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)
> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on
> V9 patch serials.
> 
> The regression result is sorted by the metric will-it-scale.per_thread_ops.
> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126
> commit id:
>   head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12
>   base commit : ba98a1cdad71d259a194461b3a61471b49b14df1
> Benchmark: will-it-scale
> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
> 
> Metrics:
>   will-it-scale.per_process_ops=processes/nr_cpu
>   will-it-scale.per_thread_ops=threads/nr_cpu
>   test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
> THP: enable / disable
> nr_task:100%
> 
> 1. Regressions:
> 
> a). Enable THP
> testcase                          base           change      head           metric
> page_fault3/enable THP           10519          -20.5%        836      will-it-scale.per_thread_ops
> page_fault2/enalbe THP            8281          -18.8%       6728      will-it-scale.per_thread_ops
> brk1/eanble THP                 998475           -2.2%     976893      will-it-scale.per_process_ops
> context_switch1/enable THP      223910           -1.3%     220930      will-it-scale.per_process_ops
> context_switch1/enable THP      233722           -1.0%     231288      will-it-scale.per_thread_ops
> 
> b). Disable THP
> page_fault3/disable THP          10856          -23.1%       8344      will-it-scale.per_thread_ops
> page_fault2/disable THP           8147          -18.8%       6613      will-it-scale.per_thread_ops
> brk1/disable THP                   957           -7.9%        881      will-it-scale.per_thread_ops
> context_switch1/disable THP     237006           -2.2%     231907      will-it-scale.per_thread_ops
> brk1/disable THP                997317           -2.0%     977778      will-it-scale.per_process_ops
> page_fault3/disable THP         467454           -1.8%     459251      will-it-scale.per_process_ops
> context_switch1/disable THP     224431           -1.3%     221567      will-it-scale.per_process_ops
> 
> Notes: for the above  values of test result, the higher is better.

I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't
get reproducible results. The results have huge variation, even on the vanilla
kernel, and I can't state on any changes due to that.

I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't
measure any changes between the vanilla and the SPF patched ones:

test THP enabled		4.17.0-rc4-mm1	spf		delta
page_fault3_threads		2697.7		2683.5		-0.53%
page_fault2_threads		170660.6	169574.1	-0.64%
context_switch1_threads		6915269.2	6877507.3	-0.55%
context_switch1_processes	6478076.2	6529493.5	0.79%
brk1				243391.2	238527.5	-2.00%

Tests were run 10 times, no high variation detected.

Did you see high variation on your side ? How many times the test were run to
compute the average values ?

Thanks,
Laurent.


> 
> 2. Improvement: not found improvement based on the selected test cases.
> 
> 
> Best regards
> Haiyan Song
> ________________________________________
> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
> Sent: Monday, May 28, 2018 4:54 PM
> To: Song, HaiyanX
> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
> Subject: Re: [PATCH v11 00/26] Speculative page faults
> 
> On 28/05/2018 10:22, Haiyan Song wrote:
>> Hi Laurent,
>>
>> Yes, these tests are done on V9 patch.
> 
> Do you plan to give this V11 a run ?
> 
>>
>>
>> Best regards,
>> Haiyan Song
>>
>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>>
>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series
>>>> tested on Intel 4s Skylake platform.
>>>
>>> Hi,
>>>
>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch
>>> series" while responding to the v11 header series...
>>> Were these tests done on v9 or v11 ?
>>>
>>> Cheers,
>>> Laurent.
>>>
>>>>
>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)
>>>> Commit id:
>>>>     base commit: d55f34411b1b126429a823d06c3124c16283231f
>>>>     head commit: 0355322b3577eeab7669066df42c550a56801110
>>>> Benchmark suite: will-it-scale
>>>> Download link:
>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>>> Metrics:
>>>>     will-it-scale.per_process_ops=processes/nr_cpu
>>>>     will-it-scale.per_thread_ops=threads/nr_cpu
>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>> THP: enable / disable
>>>> nr_task: 100%
>>>>
>>>> 1. Regressions:
>>>> a) THP enabled:
>>>> testcase                        base            change          head       metric
>>>> page_fault3/ enable THP         10092           -17.5%          8323       will-it-scale.per_thread_ops
>>>> page_fault2/ enable THP          8300           -17.2%          6869       will-it-scale.per_thread_ops
>>>> brk1/ enable THP                  957.67         -7.6%           885       will-it-scale.per_thread_ops
>>>> page_fault3/ enable THP        172821            -5.3%        163692       will-it-scale.per_process_ops
>>>> signal1/ enable THP              9125            -3.2%          8834       will-it-scale.per_process_ops
>>>>
>>>> b) THP disabled:
>>>> testcase                        base            change          head       metric
>>>> page_fault3/ disable THP        10107           -19.1%          8180       will-it-scale.per_thread_ops
>>>> page_fault2/ disable THP         8432           -17.8%          6931       will-it-scale.per_thread_ops
>>>> context_switch1/ disable THP   215389            -6.8%        200776       will-it-scale.per_thread_ops
>>>> brk1/ disable THP                 939.67         -6.6%           877.33    will-it-scale.per_thread_ops
>>>> page_fault3/ disable THP       173145            -4.7%        165064       will-it-scale.per_process_ops
>>>> signal1/ disable THP             9162            -3.9%          8802       will-it-scale.per_process_ops
>>>>
>>>> 2. Improvements:
>>>> a) THP enabled:
>>>> testcase                        base            change          head       metric
>>>> malloc1/ enable THP               66.33        +469.8%           383.67    will-it-scale.per_thread_ops
>>>> writeseek3/ enable THP          2531             +4.5%          2646       will-it-scale.per_thread_ops
>>>> signal1/ enable THP              989.33          +2.8%          1016       will-it-scale.per_thread_ops
>>>>
>>>> b) THP disabled:
>>>> testcase                        base            change          head       metric
>>>> malloc1/ disable THP              90.33        +417.3%           467.33    will-it-scale.per_thread_ops
>>>> read2/ disable THP             58934            +39.2%         82060       will-it-scale.per_thread_ops
>>>> page_fault1/ disable THP        8607            +36.4%         11736       will-it-scale.per_thread_ops
>>>> read1/ disable THP            314063            +12.7%        353934       will-it-scale.per_thread_ops
>>>> writeseek3/ disable THP         2452            +12.5%          2759       will-it-scale.per_thread_ops
>>>> signal1/ disable THP             971.33          +5.5%          1024       will-it-scale.per_thread_ops
>>>>
>>>> Notes: for above values in column "change", the higher value means that the related testcase result
>>>> on head commit is better than that on base commit for this benchmark.
>>>>
>>>>
>>>> Best regards
>>>> Haiyan Song
>>>>
>>>> ________________________________________
>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>> Sent: Thursday, May 17, 2018 7:06 PM
>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi
>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>> Subject: [PATCH v11 00/26] Speculative page faults
>>>>
>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
>>>> page fault without holding the mm semaphore [1].
>>>>
>>>> The idea is to try to handle user space page faults without holding the
>>>> mmap_sem. This should allow better concurrency for massively threaded
>>>> process since the page fault handler will not wait for other threads memory
>>>> layout change to be done, assuming that this change is done in another part
>>>> of the process's memory space. This type page fault is named speculative
>>>> page fault. If the speculative page fault fails because of a concurrency is
>>>> detected or because underlying PMD or PTE tables are not yet allocating, it
>>>> is failing its processing and a classic page fault is then tried.
>>>>
>>>> The speculative page fault (SPF) has to look for the VMA matching the fault
>>>> address without holding the mmap_sem, this is done by introducing a rwlock
>>>> which protects the access to the mm_rb tree. Previously this was done using
>>>> SRCU but it was introducing a lot of scheduling to process the VMA's
>>>> freeing operation which was hitting the performance by 20% as reported by
>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
>>>> limiting the locking contention to these operations which are expected to
>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in
>>>> our back a reference count is added and 2 services (get_vma() and
>>>> put_vma()) are introduced to handle the reference count. Once a VMA is
>>>> fetched from the RB tree using get_vma(), it must be later freed using
>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale
>>>> benchmark anymore.
>>>>
>>>> The VMA's attributes checked during the speculative page fault processing
>>>> have to be protected against parallel changes. This is done by using a per
>>>> VMA sequence lock. This sequence lock allows the speculative page fault
>>>> handler to fast check for parallel changes in progress and to abort the
>>>> speculative page fault in that case.
>>>>
>>>> Once the VMA has been found, the speculative page fault handler would check
>>>> for the VMA's attributes to verify that the page fault has to be handled
>>>> correctly or not. Thus, the VMA is protected through a sequence lock which
>>>> allows fast detection of concurrent VMA changes. If such a change is
>>>> detected, the speculative page fault is aborted and a *classic* page fault
>>>> is tried.  VMA sequence lockings are added when VMA attributes which are
>>>> checked during the page fault are modified.
>>>>
>>>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>>>> so once the page table is locked, the VMA is valid, so any other changes
>>>> leading to touching this PTE will need to lock the page table, so no
>>>> parallel change is possible at this time.
>>>>
>>>> The locking of the PTE is done with interrupts disabled, this allows
>>>> checking for the PMD to ensure that there is not an ongoing collapsing
>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>>>> valid at the time the PTE is locked, we have the guarantee that the
>>>> collapsing operation will have to wait on the PTE lock to move forward.
>>>> This allows the SPF handler to map the PTE safely. If the PMD value is
>>>> different from the one recorded at the beginning of the SPF operation, the
>>>> classic page fault handler will be called to handle the operation while
>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
>>>> the lock is done using spin_trylock() to avoid dead lock when handling a
>>>> page fault while a TLB invalidate is requested by another CPU holding the
>>>> PTE.
>>>>
>>>> In pseudo code, this could be seen as:
>>>>     speculative_page_fault()
>>>>     {
>>>>             vma = get_vma()
>>>>             check vma sequence count
>>>>             check vma's support
>>>>             disable interrupt
>>>>                   check pgd,p4d,...,pte
>>>>                   save pmd and pte in vmf
>>>>                   save vma sequence counter in vmf
>>>>             enable interrupt
>>>>             check vma sequence count
>>>>             handle_pte_fault(vma)
>>>>                     ..
>>>>                     page = alloc_page()
>>>>                     pte_map_lock()
>>>>                             disable interrupt
>>>>                                     abort if sequence counter has changed
>>>>                                     abort if pmd or pte has changed
>>>>                                     pte map and lock
>>>>                             enable interrupt
>>>>                     if abort
>>>>                        free page
>>>>                        abort
>>>>                     ...
>>>>     }
>>>>
>>>>     arch_fault_handler()
>>>>     {
>>>>             if (speculative_page_fault(&vma))
>>>>                goto done
>>>>     again:
>>>>             lock(mmap_sem)
>>>>             vma = find_vma();
>>>>             handle_pte_fault(vma);
>>>>             if retry
>>>>                unlock(mmap_sem)
>>>>                goto again;
>>>>     done:
>>>>             handle fault error
>>>>     }
>>>>
>>>> Support for THP is not done because when checking for the PMD, we can be
>>>> confused by an in progress collapsing operation done by khugepaged. The
>>>> issue is that pmd_none() could be true either if the PMD is not already
>>>> populated or if the underlying PTE are in the way to be collapsed. So we
>>>> cannot safely allocate a PMD if pmd_none() is true.
>>>>
>>>> This series add a new software performance event named 'speculative-faults'
>>>> or 'spf'. It counts the number of successful page fault event handled
>>>> speculatively. When recording 'faults,spf' events, the faults one is
>>>> counting the total number of page fault events while 'spf' is only counting
>>>> the part of the faults processed speculatively.
>>>>
>>>> There are some trace events introduced by this series. They allow
>>>> identifying why the page faults were not processed speculatively. This
>>>> doesn't take in account the faults generated by a monothreaded process
>>>> which directly processed while holding the mmap_sem. This trace events are
>>>> grouped in a system named 'pagefault', they are:
>>>>  - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>>>  - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>>>>  - pagefault:spf_vma_notsup : the VMA's type is not supported
>>>>  - pagefault:spf_vma_access : the VMA's access right are not respected
>>>>  - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>>>>    back.
>>>>
>>>> To record all the related events, the easier is to run perf with the
>>>> following arguments :
>>>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>>>
>>>> There is also a dedicated vmstat counter showing the number of successful
>>>> page fault handled speculatively. I can be seen this way:
>>>> $ grep speculative_pgfault /proc/vmstat
>>>>
>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional
>>>> on x86, PowerPC and arm64.
>>>>
>>>> ---------------------
>>>> Real Workload results
>>>>
>>>> As mentioned in previous email, we did non official runs using a "popular
>>>> in memory multithreaded database product" on 176 cores SMT8 Power system
>>>> which showed a 30% improvements in the number of transaction processed per
>>>> second. This run has been done on the v6 series, but changes introduced in
>>>> this new version should not impact the performance boost seen.
>>>>
>>>> Here are the perf data captured during 2 of these runs on top of the v8
>>>> series:
>>>>                 vanilla         spf
>>>> faults          89.418          101.364         +13%
>>>> spf                n/a           97.989
>>>>
>>>> With the SPF kernel, most of the page fault were processed in a speculative
>>>> way.
>>>>
>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave
>>>> it a try on an android device. He reported that the application launch time
>>>> was improved in average by 6%, and for large applications (~100 threads) by
>>>> 20%.
>>>>
>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom
>>>> MSM845 (8 cores) with 6GB (the less is better):
>>>>
>>>> Application                             4.9     4.9+spf delta
>>>> com.tencent.mm                          416     389     -7%
>>>> com.eg.android.AlipayGphone             1135    986     -13%
>>>> com.tencent.mtt                         455     454     0%
>>>> com.qqgame.hlddz                        1497    1409    -6%
>>>> com.autonavi.minimap                    711     701     -1%
>>>> com.tencent.tmgp.sgame                  788     748     -5%
>>>> com.immomo.momo                         501     487     -3%
>>>> com.tencent.peng                        2145    2112    -2%
>>>> com.smile.gifmaker                      491     461     -6%
>>>> com.baidu.BaiduMap                      479     366     -23%
>>>> com.taobao.taobao                       1341    1198    -11%
>>>> com.baidu.searchbox                     333     314     -6%
>>>> com.tencent.mobileqq                    394     384     -3%
>>>> com.sina.weibo                          907     906     0%
>>>> com.youku.phone                         816     731     -11%
>>>> com.happyelements.AndroidAnimal.qq      763     717     -6%
>>>> com.UCMobile                            415     411     -1%
>>>> com.tencent.tmgp.ak                     1464    1431    -2%
>>>> com.tencent.qqmusic                     336     329     -2%
>>>> com.sankuai.meituan                     1661    1302    -22%
>>>> com.netease.cloudmusic                  1193    1200    1%
>>>> air.tv.douyu.android                    4257    4152    -2%
>>>>
>>>> ------------------
>>>> Benchmarks results
>>>>
>>>> Base kernel is v4.17.0-rc4-mm1
>>>> SPF is BASE + this series
>>>>
>>>> Kernbench:
>>>> ----------
>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15
>>>> kernel (kernel is build 5 times):
>>>>
>>>> Average Half load -j 8
>>>>                  Run    (std deviation)
>>>>                  BASE                   SPF
>>>> Elapsed Time     1448.65 (5.72312)      1455.84 (4.84951)       0.50%
>>>> User    Time     10135.4 (30.3699)      10148.8 (31.1252)       0.13%
>>>> System  Time     900.47  (2.81131)      923.28  (7.52779)       2.53%
>>>> Percent CPU      761.4   (1.14018)      760.2   (0.447214)      -0.16%
>>>> Context Switches 85380   (3419.52)      84748   (1904.44)       -0.74%
>>>> Sleeps           105064  (1240.96)      105074  (337.612)       0.01%
>>>>
>>>> Average Optimal load -j 16
>>>>                  Run    (std deviation)
>>>>                  BASE                   SPF
>>>> Elapsed Time     920.528 (10.1212)      927.404 (8.91789)       0.75%
>>>> User    Time     11064.8 (981.142)      11085   (990.897)       0.18%
>>>> System  Time     979.904 (84.0615)      1001.14 (82.5523)       2.17%
>>>> Percent CPU      1089.5  (345.894)      1086.1  (343.545)       -0.31%
>>>> Context Switches 159488  (78156.4)      158223  (77472.1)       -0.79%
>>>> Sleeps           110566  (5877.49)      110388  (5617.75)       -0.16%
>>>>
>>>>
>>>> During a run on the SPF, perf events were captured:
>>>>  Performance counter stats for '../kernbench -M':
>>>>          526743764      faults
>>>>                210      spf
>>>>                  3      pagefault:spf_vma_changed
>>>>                  0      pagefault:spf_vma_noanon
>>>>               2278      pagefault:spf_vma_notsup
>>>>                  0      pagefault:spf_vma_access
>>>>                  0      pagefault:spf_pmd_changed
>>>>
>>>> Very few speculative page faults were recorded as most of the processes
>>>> involved are monothreaded (sounds that on this architecture some threads
>>>> were created during the kernel build processing).
>>>>
>>>> Here are the kerbench results on a 80 CPUs Power8 system:
>>>>
>>>> Average Half load -j 40
>>>>                  Run    (std deviation)
>>>>                  BASE                   SPF
>>>> Elapsed Time     117.152 (0.774642)     117.166 (0.476057)      0.01%
>>>> User    Time     4478.52 (24.7688)      4479.76 (9.08555)       0.03%
>>>> System  Time     131.104 (0.720056)     134.04  (0.708414)      2.24%
>>>> Percent CPU      3934    (19.7104)      3937.2  (19.0184)       0.08%
>>>> Context Switches 92125.4 (576.787)      92581.6 (198.622)       0.50%
>>>> Sleeps           317923  (652.499)      318469  (1255.59)       0.17%
>>>>
>>>> Average Optimal load -j 80
>>>>                  Run    (std deviation)
>>>>                  BASE                   SPF
>>>> Elapsed Time     107.73  (0.632416)     107.31  (0.584936)      -0.39%
>>>> User    Time     5869.86 (1466.72)      5871.71 (1467.27)       0.03%
>>>> System  Time     153.728 (23.8573)      157.153 (24.3704)       2.23%
>>>> Percent CPU      5418.6  (1565.17)      5436.7  (1580.91)       0.33%
>>>> Context Switches 223861  (138865)       225032  (139632)        0.52%
>>>> Sleeps           330529  (13495.1)      332001  (14746.2)       0.45%
>>>>
>>>> During a run on the SPF, perf events were captured:
>>>>  Performance counter stats for '../kernbench -M':
>>>>          116730856      faults
>>>>                  0      spf
>>>>                  3      pagefault:spf_vma_changed
>>>>                  0      pagefault:spf_vma_noanon
>>>>                476      pagefault:spf_vma_notsup
>>>>                  0      pagefault:spf_vma_access
>>>>                  0      pagefault:spf_pmd_changed
>>>>
>>>> Most of the processes involved are monothreaded so SPF is not activated but
>>>> there is no impact on the performance.
>>>>
>>>> Ebizzy:
>>>> -------
>>>> The test is counting the number of records per second it can manage, the
>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>>>> consistent result I repeated the test 100 times and measure the average
>>>> result. The number is the record processes per second, the higher is the
>>>> best.
>>>>
>>>>                 BASE            SPF             delta
>>>> 16 CPUs x86 VM  742.57          1490.24         100.69%
>>>> 80 CPUs P8 node 13105.4         24174.23        84.46%
>>>>
>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>>>>  Performance counter stats for './ebizzy -mTt 16':
>>>>            1706379      faults
>>>>            1674599      spf
>>>>              30588      pagefault:spf_vma_changed
>>>>                  0      pagefault:spf_vma_noanon
>>>>                363      pagefault:spf_vma_notsup
>>>>                  0      pagefault:spf_vma_access
>>>>                  0      pagefault:spf_pmd_changed
>>>>
>>>> And the ones captured during a run on a 80 CPUs Power node:
>>>>  Performance counter stats for './ebizzy -mTt 80':
>>>>            1874773      faults
>>>>            1461153      spf
>>>>             413293      pagefault:spf_vma_changed
>>>>                  0      pagefault:spf_vma_noanon
>>>>                200      pagefault:spf_vma_notsup
>>>>                  0      pagefault:spf_vma_access
>>>>                  0      pagefault:spf_pmd_changed
>>>>
>>>> In ebizzy's case most of the page fault were handled in a speculative way,
>>>> leading the ebizzy performance boost.
>>>>
>>>> ------------------
>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):
>>>>  - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran
>>>>    and Minchan Kim, hopefully.
>>>>  - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in
>>>>    __do_page_fault().
>>>>  - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails
>>>>    instead
>>>>    of aborting the speculative page fault handling. Dropping the now
>>>> useless
>>>>    trace event pagefault:spf_pte_lock.
>>>>  - No more try to reuse the fetched VMA during the speculative page fault
>>>>    handling when retrying is needed. This adds a lot of complexity and
>>>>    additional tests done didn't show a significant performance improvement.
>>>>  - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.
>>>>
>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>>> [2] https://patchwork.kernel.org/patch/9999687/
>>>>
>>>>
>>>> Laurent Dufour (20):
>>>>   mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>>>>   x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>   powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>   mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>>>>   mm: make pte_unmap_same compatible with SPF
>>>>   mm: introduce INIT_VMA()
>>>>   mm: protect VMA modifications using VMA sequence count
>>>>   mm: protect mremap() against SPF hanlder
>>>>   mm: protect SPF handler against anon_vma changes
>>>>   mm: cache some VMA fields in the vm_fault structure
>>>>   mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>>>>   mm: introduce __lru_cache_add_active_or_unevictable
>>>>   mm: introduce __vm_normal_page()
>>>>   mm: introduce __page_add_new_anon_rmap()
>>>>   mm: protect mm_rb tree with a rwlock
>>>>   mm: adding speculative page fault failure trace events
>>>>   perf: add a speculative page fault sw event
>>>>   perf tools: add support for the SPF perf event
>>>>   mm: add speculative page fault vmstats
>>>>   powerpc/mm: add speculative page fault
>>>>
>>>> Mahendran Ganesh (2):
>>>>   arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>   arm64/mm: add speculative page fault
>>>>
>>>> Peter Zijlstra (4):
>>>>   mm: prepare for FAULT_FLAG_SPECULATIVE
>>>>   mm: VMA sequence count
>>>>   mm: provide speculative fault infrastructure
>>>>   x86/mm: add speculative pagefault handling
>>>>
>>>>  arch/arm64/Kconfig                    |   1 +
>>>>  arch/arm64/mm/fault.c                 |  12 +
>>>>  arch/powerpc/Kconfig                  |   1 +
>>>>  arch/powerpc/mm/fault.c               |  16 +
>>>>  arch/x86/Kconfig                      |   1 +
>>>>  arch/x86/mm/fault.c                   |  27 +-
>>>>  fs/exec.c                             |   2 +-
>>>>  fs/proc/task_mmu.c                    |   5 +-
>>>>  fs/userfaultfd.c                      |  17 +-
>>>>  include/linux/hugetlb_inline.h        |   2 +-
>>>>  include/linux/migrate.h               |   4 +-
>>>>  include/linux/mm.h                    | 136 +++++++-
>>>>  include/linux/mm_types.h              |   7 +
>>>>  include/linux/pagemap.h               |   4 +-
>>>>  include/linux/rmap.h                  |  12 +-
>>>>  include/linux/swap.h                  |  10 +-
>>>>  include/linux/vm_event_item.h         |   3 +
>>>>  include/trace/events/pagefault.h      |  80 +++++
>>>>  include/uapi/linux/perf_event.h       |   1 +
>>>>  kernel/fork.c                         |   5 +-
>>>>  mm/Kconfig                            |  22 ++
>>>>  mm/huge_memory.c                      |   6 +-
>>>>  mm/hugetlb.c                          |   2 +
>>>>  mm/init-mm.c                          |   3 +
>>>>  mm/internal.h                         |  20 ++
>>>>  mm/khugepaged.c                       |   5 +
>>>>  mm/madvise.c                          |   6 +-
>>>>  mm/memory.c                           | 612 +++++++++++++++++++++++++++++-----
>>>>  mm/mempolicy.c                        |  51 ++-
>>>>  mm/migrate.c                          |   6 +-
>>>>  mm/mlock.c                            |  13 +-
>>>>  mm/mmap.c                             | 229 ++++++++++---
>>>>  mm/mprotect.c                         |   4 +-
>>>>  mm/mremap.c                           |  13 +
>>>>  mm/nommu.c                            |   2 +-
>>>>  mm/rmap.c                             |   5 +-
>>>>  mm/swap.c                             |   6 +-
>>>>  mm/swap_state.c                       |   8 +-
>>>>  mm/vmstat.c                           |   5 +-
>>>>  tools/include/uapi/linux/perf_event.h |   1 +
>>>>  tools/perf/util/evsel.c               |   1 +
>>>>  tools/perf/util/parse-events.c        |   4 +
>>>>  tools/perf/util/parse-events.l        |   1 +
>>>>  tools/perf/util/python.c              |   1 +
>>>>  44 files changed, 1161 insertions(+), 211 deletions(-)
>>>>  create mode 100644 include/trace/events/pagefault.h
>>>>
>>>> --
>>>> 2.7.4
>>>>
>>>>
>>>
>>
> 

^ permalink raw reply

* Re: [PATCH kernel 1/2] powerpc/powernv: Reuse existing TCE code for sketchy bypass
From: Alexey Kardashevskiy @ 2018-07-02  8:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Alistair Popple, Russell Currey
In-Reply-To: <615a926cedadfe528cd72e4f29a4183b0a7f8c08.camel@kernel.crashing.org>

On Sat, 16 Jun 2018 11:04:32 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Fri, 2018-06-01 at 18:10 +1000, Alexey Kardashevskiy wrote:
> > The existing sketchy bypass ignores the existing default 32bit TCE table
> > (created by default for every PE at boot time or after being used by
> > VFIO) and it allocates another table instead without updating PE DMA
> > config (pe->table_group). So if we decide to use such device for VFIO
> > later, this new table will also leak memory.
> > 
> > This replaces adhoc table allocation and programming with the existing
> > API which handles memory leaks.
> > 
> > This programs the default 32bit table back to TVE#0 if configuring
> > the new table failed for some reason.
> > 
> > While we are at it, switch from the hardcoded 256MB TCEs to the biggest
> > size supported by the hardware and reported by the firmware. This allows
> > the sketchy bypass (originally made for POWER8 only) to work on POWER9
> > too assuming that PHB4 type is defined and pnv_pci_ioda_dma_64bit_bypass()
> > is called (coming next).  
> 
> It won't work on POWER9 for other reasons. Mostly memory isn't
> contiguous there.


This simply allocates a table using the existing API which makes
things more accurate and manageable. "[PATCH 0/3] PCI DMA pseudo-bypass
for powernv" still allocates a table and programs it via OPAL, why does
it have to replicate all the helpers I put there? Do they all suck and
the new ones are better? If so, how exactly are they better? And let's
fix them then.



> 
> What we shuld look into rather than removing the 32-bit table is
> exploit DD2 P9 feature allowing to overlay TVE0 and TVE1 so that 0..4G
> is in control of TVE0 and the rest goes to TVE1.
>  
> > This does not call iommu_init_table() for the new table as the caller
> > will use &dma_nommu_ops and therefore ::it_map is not needed.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > Tested with:
> >  	if (pe->tce_bypass_enabled) {
> >  		top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
> > -		bypass = (dma_mask >= top);
> > +		bypass = false;//(dma_mask >= top);
> >  	}
> > ---
> >  arch/powerpc/platforms/powernv/pci-ioda.c | 71 +++++++++++++++++--------------
> >  1 file changed, 39 insertions(+), 32 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index ceb7e64..9239142 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -1791,54 +1791,61 @@ static bool pnv_pci_ioda_pe_single_vendor(struct pnv_ioda_pe *pe)
> >   *
> >   * Currently this will only work on PHB3 (POWER8).
> >   */
> > +static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> > +		int num, __u32 page_shift, __u64 window_size, __u32 levels,
> > +		struct iommu_table **ptbl);
> > +
> > +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> > +		int num, struct iommu_table *tbl);
> > +
> > +static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
> > +
> >  static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
> >  {
> > -	u64 window_size, table_size, tce_count, addr;
> > -	struct page *table_pages;
> > -	u64 tce_order = 28; /* 256MB TCEs */
> > -	__be64 *tces;
> > +	u64 window_size;
> >  	s64 rc;
> > +	struct iommu_table *tbl, *oldtbl = NULL;
> > +	unsigned long shift, offset;
> >  
> >  	/*
> >  	 * Window size needs to be a power of two, but needs to account for
> >  	 * shifting memory by the 4GB offset required to skip 32bit space.
> >  	 */
> > -	window_size = roundup_pow_of_two(memory_hotplug_max() + (1ULL << 32));
> > -	tce_count = window_size >> tce_order;
> > -	table_size = tce_count << 3;
> > -
> > -	if (table_size < PAGE_SIZE)
> > -		table_size = PAGE_SIZE;
> > +	window_size = roundup_pow_of_two(memory_hotplug_max() + SZ_4G);
> > +	shift = ilog2(pnv_ioda_parse_tce_sizes(pe->phb));
> > +	rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, shift, window_size,
> > +			POWERNV_IOMMU_DEFAULT_LEVELS, &tbl);
> > +	if (rc) {
> > +		pe_err(pe, "Failed to create 64-bypass TCE table, err %ld", rc);
> > +		return rc;
> > +	}
> >  
> > -	table_pages = alloc_pages_node(pe->phb->hose->node, GFP_KERNEL,
> > -				       get_order(table_size));
> > -	if (!table_pages)
> > +	offset = SZ_4G >> shift;
> > +	rc = tbl->it_ops->set(tbl, offset, tbl->it_size - offset,
> > +			0 /* uaddr */, DMA_BIDIRECTIONAL, 0 /* attrs */);
> > +	if (rc)
> >  		goto err;
> >  
> > -	tces = page_address(table_pages);
> > -	if (!tces)
> > +	if (pe->table_group.tables[0]) {
> > +		oldtbl = pe->table_group.tables[0];
> > +		pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> > +	}
> > +
> > +	rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> > +	if (rc != OPAL_SUCCESS) {
> > +		rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, oldtbl);
> >  		goto err;
> > +	}
> >  
> > -	memset(tces, 0, table_size);
> > +	if (oldtbl)
> > +		iommu_tce_table_put(oldtbl);
> >  
> > -	for (addr = 0; addr < memory_hotplug_max(); addr += (1 << tce_order)) {
> > -		tces[(addr + (1ULL << 32)) >> tce_order] =
> > -			cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
> > -	}
> > +	pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
> > +	return 0;
> >  
> > -	rc = opal_pci_map_pe_dma_window(pe->phb->opal_id,
> > -					pe->pe_number,
> > -					/* reconfigure window 0 */
> > -					(pe->pe_number << 1) + 0,
> > -					1,
> > -					__pa(tces),
> > -					table_size,
> > -					1 << tce_order);
> > -	if (rc == OPAL_SUCCESS) {
> > -		pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
> > -		return 0;
> > -	}
> >  err:
> > +	iommu_tce_table_put(tbl);
> > +
> >  	pe_err(pe, "Error configuring 64-bit DMA bypass\n");
> >  	return -EIO;
> >  }  



--
Alexey

^ permalink raw reply

* Re: [PATCH 2/3] powerpc/powernv: DMA operations for discontiguous allocation
From: Alexey Kardashevskiy @ 2018-07-02  8:47 UTC (permalink / raw)
  To: Russell Currey; +Cc: linuxppc-dev, benh, alistair, tpearson
In-Reply-To: <20180629073437.4060-3-ruscur@russell.cc>

On Fri, 29 Jun 2018 17:34:36 +1000
Russell Currey <ruscur@russell.cc> wrote:

> DMA pseudo-bypass is a new set of DMA operations that solve some issues for
> devices that want to address more than 32 bits but can't address the 59
> bits required to enable direct DMA.
> 
> The previous implementation for POWER8/PHB3 worked around this by
> configuring a bypass from the default 32-bit address space into 64-bit
> address space.  This approach does not work for POWER9/PHB4 because
> regions of memory are discontiguous and many devices will be unable to
> address memory beyond the first node.
> 
> Instead, implement a new set of DMA operations that allocate TCEs as DMA
> mappings are requested so that all memory is addressable even when a
> one-to-one mapping between real addresses and DMA addresses isn't
> possible.  These TCEs are the maximum size available on the platform,
> which is 256M on PHB3 and 1G on PHB4.
> 
> Devices can now map any region of memory up to the maximum amount they can
> address according to the DMA mask set, in chunks of the largest available
> TCE size.
> 
> This implementation replaces the need for the existing PHB3 solution and
> should be compatible with future PHB versions.
> 
> It is, however, rather naive.  There is no unmapping, and as a result
> devices can eventually run out of space if they address their entire
> DMA mask worth of TCEs.  An implementation with unmap() will come in
> future (and requires a much more complex implementation), but this is a
> good start due to the drastic performance improvement.


Why does not dma_iommu_ops work in this case? I keep asking and yet no
comment in the commit log or mails...


> 
> Signed-off-by: Russell Currey <ruscur@russell.cc>
> ---
>  arch/powerpc/include/asm/dma-mapping.h    |   1 +
>  arch/powerpc/platforms/powernv/Makefile   |   2 +-
>  arch/powerpc/platforms/powernv/pci-dma.c  | 243 ++++++++++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c |  82 +++-----
>  arch/powerpc/platforms/powernv/pci.h      |   7 +
>  5 files changed, 281 insertions(+), 54 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c

Too generic file name for such a hack.


> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa394520af6..354f435160f3 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -74,6 +74,7 @@ static inline unsigned long device_to_mask(struct device *dev)
>  extern struct dma_map_ops dma_iommu_ops;
>  #endif
>  extern const struct dma_map_ops dma_nommu_ops;
> +extern const struct dma_map_ops dma_pseudo_bypass_ops;
>  
>  static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
>  {
> diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
> index 703a350a7f4e..2467bdab3c13 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -6,7 +6,7 @@ obj-y			+= opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
>  obj-y			+= opal-kmsg.o opal-powercap.o opal-psr.o opal-sensor-groups.o
>  
>  obj-$(CONFIG_SMP)	+= smp.o subcore.o subcore-asm.o
> -obj-$(CONFIG_PCI)	+= pci.o pci-ioda.o npu-dma.o
> +obj-$(CONFIG_PCI)	+= pci.o pci-ioda.o npu-dma.o pci-dma.o
>  obj-$(CONFIG_CXL_BASE)	+= pci-cxl.o
>  obj-$(CONFIG_EEH)	+= eeh-powernv.o
>  obj-$(CONFIG_PPC_SCOM)	+= opal-xscom.o
> diff --git a/arch/powerpc/platforms/powernv/pci-dma.c b/arch/powerpc/platforms/powernv/pci-dma.c
> new file mode 100644
> index 000000000000..79382627c7be
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/pci-dma.c
> @@ -0,0 +1,243 @@
> +/*
> + * DMA operations supporting pseudo-bypass for PHB3+
> + *
> + * Author: Russell Currey <ruscur@russell.cc>
> + *
> + * Copyright 2018 IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; either version 2 of the License, or (at your
> + * option) any later version.
> + */
> +
> +#include <linux/export.h>
> +#include <linux/memblock.h>
> +#include <linux/device.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/hash.h>
> +
> +#include <asm/pci-bridge.h>
> +#include <asm/ppc-pci.h>
> +#include <asm/pnv-pci.h>
> +#include <asm/tce.h>
> +
> +#include "pci.h"
> +
> +/*
> + * This is a naive implementation that directly operates on TCEs, allocating
> + * on demand.  There is no locking or refcounts since no TCEs are ever removed
> + * and unmap does nothing.
> + */
> +static dma_addr_t dma_pseudo_bypass_get_address(struct device *dev,
> +					    phys_addr_t addr)
> +{
> +	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_ioda_pe *pe;
> +	u64 i, tce, ret, offset;
> +	__be64 entry;
> +
> +	offset = addr & ((1 << phb->ioda.max_tce_order) - 1);
> +
> +	pe = &phb->ioda.pe_array[pci_get_pdn(pdev)->pe_number];
> +
> +	/* look through the tracking table for a free entry */
> +	for (i = 0; i < pe->tce_count; i++) {
> +		/* skip between 2GB and 4GB */
> +		if ((i << phb->ioda.max_tce_order) >= 0x80000000 &&
> +		    (i << phb->ioda.max_tce_order) < 0x100000000)
> +			continue;
> +
> +		tce = be64_to_cpu(pe->tces[i]);
> +
> +		/* if the TCE is already valid (read + write) */
> +		if ((tce & 3) == 3) {
> +			/* check if we're already allocated, if not move on */
> +			if (tce >> phb->ioda.max_tce_order ==
> +			    addr >> phb->ioda.max_tce_order) {
> +				/* wait for the lock bit to clear */
> +				while (be64_to_cpu(pe->tces[i]) & 4)
> +					cpu_relax();
> +
> +				return (i << phb->ioda.max_tce_order) | offset;
> +			}
> +
> +			continue;
> +		}
> +
> +		/*
> +		 * The TCE isn't being used, so let's try and allocate it.
> +		 * Bits 0 and 1 are read/write, and we use bit 2 as a "lock"
> +		 * bit.  This is to prevent any race where the value is set in
> +		 * the TCE table but the invalidate/mb() hasn't finished yet.
> +		 */
> +		entry = cpu_to_be64((addr - offset) | 7);
> +		ret = cmpxchg(&pe->tces[i], tce, entry);
> +		if (ret != tce) {
> +			/* conflict, start looking again just in case */
> +			i--;
> +			continue;
> +		}
> +		pnv_pci_phb3_tce_invalidate(pe, 0, 0, addr - offset, 1);
> +		mb();
> +		/* clear the lock bit now that we know it's active */
> +		ret = cmpxchg(&pe->tces[i], entry, cpu_to_be64((addr - offset) | 3));
> +		if (ret != entry) {
> +			/* conflict, start looking again just in case */
> +			i--;
> +			continue;
> +		}
> +
> +		return (i << phb->ioda.max_tce_order) | offset;
> +	}
> +	/* If we get here, the table must be full, so error out. */
> +	return -1ULL;
> +}
> +
> +/*
> + * For now, don't actually do anything on unmap.
> + */
> +static void dma_pseudo_bypass_unmap_address(struct device *dev, dma_addr_t dma_addr)
> +{
> +}
> +
> +static int dma_pseudo_bypass_dma_supported(struct device *dev, u64 mask)
> +{
> +	/*
> +	 * Normally dma_supported() checks if the mask is capable of addressing
> +	 * all of memory.  Since we map physical memory in chunks that the
> +	 * device can address, the device will be able to address whatever it
> +	 * wants - just not all at once.
> +	 */
> +	return 1;
> +}
> +
> +static void *dma_pseudo_bypass_alloc_coherent(struct device *dev,
> +					  size_t size,
> +					  dma_addr_t *dma_handle,
> +					  gfp_t flag,
> +					  unsigned long attrs)
> +{
> +	void *ret;
> +	struct page *page;
> +	int node = dev_to_node(dev);
> +
> +	/* ignore region specifiers */
> +	flag &= ~(__GFP_HIGHMEM);
> +
> +	page = alloc_pages_node(node, flag, get_order(size));
> +	if (page == NULL)
> +		return NULL;
> +	ret = page_address(page);
> +	memset(ret, 0, size);
> +	*dma_handle = dma_pseudo_bypass_get_address(dev, __pa(ret));
> +
> +	return ret;
> +}
> +
> +static void dma_pseudo_bypass_free_coherent(struct device *dev,
> +					 size_t size,
> +					 void *vaddr,
> +					 dma_addr_t dma_handle,
> +					 unsigned long attrs)
> +{
> +	free_pages((unsigned long)vaddr, get_order(size));
> +}
> +
> +static int dma_pseudo_bypass_mmap_coherent(struct device *dev,
> +				       struct vm_area_struct *vma,
> +				       void *cpu_addr,
> +				       dma_addr_t handle,
> +				       size_t size,
> +				       unsigned long attrs)
> +{
> +	unsigned long pfn = page_to_pfn(virt_to_page(cpu_addr));
> +
> +	return remap_pfn_range(vma, vma->vm_start,
> +			       pfn + vma->vm_pgoff,
> +			       vma->vm_end - vma->vm_start,
> +			       vma->vm_page_prot);
> +}
> +
> +static inline dma_addr_t dma_pseudo_bypass_map_page(struct device *dev,
> +						struct page *page,
> +						unsigned long offset,
> +						size_t size,
> +						enum dma_data_direction dir,
> +						unsigned long attrs)
> +{
> +	BUG_ON(dir == DMA_NONE);
> +
> +	return dma_pseudo_bypass_get_address(dev, page_to_phys(page) + offset);
> +}
> +
> +static inline void dma_pseudo_bypass_unmap_page(struct device *dev,
> +					 dma_addr_t dma_address,
> +					 size_t size,
> +					 enum dma_data_direction direction,
> +					 unsigned long attrs)
> +{
> +	dma_pseudo_bypass_unmap_address(dev, dma_address);
> +}
> +
> +
> +static int dma_pseudo_bypass_map_sg(struct device *dev, struct scatterlist *sgl,
> +			     int nents, enum dma_data_direction direction,
> +			     unsigned long attrs)
> +{
> +	struct scatterlist *sg;
> +	int i;
> +
> +
> +	for_each_sg(sgl, sg, nents, i) {
> +		sg->dma_address = dma_pseudo_bypass_get_address(dev, sg_phys(sg));
> +		sg->dma_length = sg->length;
> +
> +		__dma_sync_page(sg_page(sg), sg->offset, sg->length, direction);
> +	}
> +
> +	return nents;
> +}
> +
> +static void dma_pseudo_bypass_unmap_sg(struct device *dev, struct scatterlist *sgl,
> +				int nents, enum dma_data_direction direction,
> +				unsigned long attrs)
> +{
> +	struct scatterlist *sg;
> +	int i;
> +
> +	for_each_sg(sgl, sg, nents, i) {
> +		dma_pseudo_bypass_unmap_address(dev, sg->dma_address);
> +	}
> +}
> +
> +static u64 dma_pseudo_bypass_get_required_mask(struct device *dev)
> +{
> +	/*
> +	 * there's no limitation on our end, the driver should just call
> +	 * set_mask() with as many bits as the device can address.
> +	 */
> +	return -1ULL;
> +}
> +
> +static int dma_pseudo_bypass_mapping_error(struct device *dev, dma_addr_t dma_addr)
> +{
> +	return dma_addr == -1ULL;




> +}
> +
> +
> +const struct dma_map_ops dma_pseudo_bypass_ops = {
> +	.alloc				= dma_pseudo_bypass_alloc_coherent,
> +	.free				= dma_pseudo_bypass_free_coherent,
> +	.mmap				= dma_pseudo_bypass_mmap_coherent,
> +	.map_sg				= dma_pseudo_bypass_map_sg,
> +	.unmap_sg			= dma_pseudo_bypass_unmap_sg,
> +	.dma_supported			= dma_pseudo_bypass_dma_supported,
> +	.map_page			= dma_pseudo_bypass_map_page,
> +	.unmap_page			= dma_pseudo_bypass_unmap_page,
> +	.get_required_mask		= dma_pseudo_bypass_get_required_mask,
> +	.mapping_error			= dma_pseudo_bypass_mapping_error,
> +};
> +EXPORT_SYMBOL(dma_pseudo_bypass_ops);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 17c590087279..d2ca214610fd 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1088,6 +1088,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>  	pe->pbus = NULL;
>  	pe->mve_number = -1;
>  	pe->rid = dev->bus->number << 8 | pdn->devfn;
> +	pe->tces = NULL;
>  
>  	pe_info(pe, "Associated device to PE\n");
>  
> @@ -1569,6 +1570,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  		pe->mve_number = -1;
>  		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
>  			   pci_iov_virtfn_devfn(pdev, vf_index);
> +		pe->tces = NULL;
>  
>  		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%x\n",
>  			hose->global_number, pdev->bus->number,
> @@ -1774,43 +1776,22 @@ static bool pnv_pci_ioda_pe_single_vendor(struct pnv_ioda_pe *pe)
>  	return true;
>  }
>  
> -/*
> - * Reconfigure TVE#0 to be usable as 64-bit DMA space.
> - *
> - * The first 4GB of virtual memory for a PE is reserved for 32-bit accesses.
> - * Devices can only access more than that if bit 59 of the PCI address is set
> - * by hardware, which indicates TVE#1 should be used instead of TVE#0.
> - * Many PCI devices are not capable of addressing that many bits, and as a
> - * result are limited to the 4GB of virtual memory made available to 32-bit
> - * devices in TVE#0.
> - *
> - * In order to work around this, reconfigure TVE#0 to be suitable for 64-bit
> - * devices by configuring the virtual memory past the first 4GB inaccessible
> - * by 64-bit DMAs.  This should only be used by devices that want more than
> - * 4GB, and only on PEs that have no 32-bit devices.
> - *
> - * Currently this will only work on PHB3 (POWER8).
> - */
> -static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
> +static int pnv_pci_pseudo_bypass_setup(struct pnv_ioda_pe *pe)
>  {
> -	u64 window_size, table_size, tce_count, addr;
> +	u64 tce_count, table_size, window_size;
> +	struct pnv_phb *p = pe->phb;
>  	struct page *table_pages;
> -	u64 tce_order = 28; /* 256MB TCEs */
>  	__be64 *tces;
> -	s64 rc;
> +	int rc = -ENOMEM;
>  
> -	/*
> -	 * Window size needs to be a power of two, but needs to account for
> -	 * shifting memory by the 4GB offset required to skip 32bit space.
> -	 */
> -	window_size = roundup_pow_of_two(memory_hotplug_max() + (1ULL << 32));
> -	tce_count = window_size >> tce_order;
> +	window_size = roundup_pow_of_two(memory_hotplug_max());
> +	tce_count = window_size >> p->ioda.max_tce_order;
>  	table_size = tce_count << 3;
>  
>  	if (table_size < PAGE_SIZE)
>  		table_size = PAGE_SIZE;
>  
> -	table_pages = alloc_pages_node(pe->phb->hose->node, GFP_KERNEL,
> +	table_pages = alloc_pages_node(p->hose->node, GFP_KERNEL,
>  				       get_order(table_size));


pnv_pci_ioda2_create_table() allocates the table, why does not it work for this task?


>  	if (!table_pages)
>  		goto err;
> @@ -1821,26 +1802,23 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
>  
>  	memset(tces, 0, table_size);
>  
> -	for (addr = 0; addr < memory_hotplug_max(); addr += (1 << tce_order)) {
> -		tces[(addr + (1ULL << 32)) >> tce_order] =
> -			cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
> -	}
> +	pe->tces = tces;
> +	pe->tce_count = tce_count;
>  
>  	rc = opal_pci_map_pe_dma_window(pe->phb->opal_id,
>  					pe->pe_number,
> -					/* reconfigure window 0 */
>  					(pe->pe_number << 1) + 0,
>  					1,
>  					__pa(tces),
>  					table_size,
> -					1 << tce_order);
> +					1 << p->ioda.max_tce_order);
>  	if (rc == OPAL_SUCCESS) {
> -		pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
> +		pe_info(pe, "TCE tables configured for pseudo-bypass\n");
>  		return 0;
>  	}
>  err:
> -	pe_err(pe, "Error configuring 64-bit DMA bypass\n");
> -	return -EIO;
> +	pe_err(pe, "error configuring pseudo-bypass\n");
> +	return rc;
>  }
>  
>  static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
> @@ -1851,7 +1829,6 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
>  	struct pnv_ioda_pe *pe;
>  	uint64_t top;
>  	bool bypass = false;
> -	s64 rc;
>  
>  	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
>  		return -ENODEV;
> @@ -1868,21 +1845,15 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
>  	} else {
>  		/*
>  		 * If the device can't set the TCE bypass bit but still wants
> -		 * to access 4GB or more, on PHB3 we can reconfigure TVE#0 to
> -		 * bypass the 32-bit region and be usable for 64-bit DMAs.
> -		 * The device needs to be able to address all of this space.
> +		 * to access 4GB or more, we need to use a different set of DMA
> +		 * operations with an indirect mapping.
>  		 */
>  		if (dma_mask >> 32 &&
> -		    dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
> -		    pnv_pci_ioda_pe_single_vendor(pe) &&
> -		    phb->model == PNV_PHB_MODEL_PHB3) {
> -			/* Configure the bypass mode */
> -			rc = pnv_pci_ioda_dma_64bit_bypass(pe);
> -			if (rc)
> -				return rc;
> -			/* 4GB offset bypasses 32-bit space */
> -			set_dma_offset(&pdev->dev, (1ULL << 32));
> -			set_dma_ops(&pdev->dev, &dma_nommu_ops);
> +		    phb->model != PNV_PHB_MODEL_P7IOC &&
> +		    pnv_pci_ioda_pe_single_vendor(pe)) {
> +			if (!pe->tces)
> +				pnv_pci_pseudo_bypass_setup(pe);
> +			set_dma_ops(&pdev->dev, &dma_pseudo_bypass_ops);
>  		} else if (dma_mask >> 32 && dma_mask != DMA_BIT_MASK(64)) {
>  			/*
>  			 * Fail the request if a DMA mask between 32 and 64 bits
> @@ -2071,7 +2042,7 @@ static inline void pnv_pci_phb3_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  	__raw_writeq_be(val, invalidate);
>  }
>  
> -static void pnv_pci_phb3_tce_invalidate(struct pnv_ioda_pe *pe, bool rm,
> +void pnv_pci_phb3_tce_invalidate(struct pnv_ioda_pe *pe, bool rm,
>  					unsigned shift, unsigned long index,
>  					unsigned long npages)
>  {
> @@ -2611,10 +2582,15 @@ static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>  static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  {
>  	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> -						table_group);
> +					      table_group);
> +
>  	/* Store @tbl as pnv_pci_ioda2_unset_window() resets it */
>  	struct iommu_table *tbl = pe->table_group.tables[0];
>  
> +	if (pe->tces)
> +		free_pages((unsigned long)pe->tces,
> +			   get_order(pe->tce_count << 3));
> +
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (pe->pbus)
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index c9952def5e93..56846ddc76a2 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -70,6 +70,10 @@ struct pnv_ioda_pe {
>  	bool			tce_bypass_enabled;
>  	uint64_t		tce_bypass_base;
>  
> +	/* TCE tables for DMA pseudo-bypass */
> +	__be64			*tces;
> +	u64			tce_count;


pnv_ioda_pe::table_group::tables[0] has a pointer to pages, a table size in entries and a page size, why duplicate all of this again?


> +
>  	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
>  	 * and -1 if not supported. (It's actually identical to the
>  	 * PE number)
> @@ -211,6 +215,9 @@ extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
>  extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
>  		unsigned long *hpa, enum dma_data_direction *direction);
>  extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
> +extern void pnv_pci_phb3_tce_invalidate(struct pnv_ioda_pe *pe, bool rm,
> +					unsigned shift, unsigned long index,
> +					unsigned long npages);
>  
>  void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
>  				unsigned char *log_buff);
> -- 
> 2.17.1
> 



--
Alexey

^ permalink raw reply

* [PATCH v4 09/11] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel,
	Geert Uytterhoeven
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

Now that the PowerMac via-pmu driver supports m68k PowerBooks,
switch over to that driver and remove the via-pmu68k driver.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 arch/m68k/configs/mac_defconfig   |   2 +-
 arch/m68k/configs/multi_defconfig |   2 +-
 arch/m68k/mac/config.c            |   2 +-
 arch/m68k/mac/misc.c              |  48 +--
 drivers/macintosh/Kconfig         |  13 +-
 drivers/macintosh/Makefile        |   1 -
 drivers/macintosh/adb.c           |   2 +-
 drivers/macintosh/via-pmu68k.c    | 846 --------------------------------------
 include/uapi/linux/pmu.h          |   2 +-
 9 files changed, 14 insertions(+), 904 deletions(-)
 delete mode 100644 drivers/macintosh/via-pmu68k.c

diff --git a/arch/m68k/configs/mac_defconfig b/arch/m68k/configs/mac_defconfig
index b52e597899eb..087ca15e32f1 100644
--- a/arch/m68k/configs/mac_defconfig
+++ b/arch/m68k/configs/mac_defconfig
@@ -369,7 +369,7 @@ CONFIG_TCM_PSCSI=m
 CONFIG_ADB=y
 CONFIG_ADB_MACII=y
 CONFIG_ADB_IOP=y
-CONFIG_ADB_PMU68K=y
+CONFIG_ADB_PMU=y
 CONFIG_ADB_CUDA=y
 CONFIG_INPUT_ADBHID=y
 CONFIG_MAC_EMUMOUSEBTN=y
diff --git a/arch/m68k/configs/multi_defconfig b/arch/m68k/configs/multi_defconfig
index 2a84eeec5b02..3f9334084d55 100644
--- a/arch/m68k/configs/multi_defconfig
+++ b/arch/m68k/configs/multi_defconfig
@@ -402,7 +402,7 @@ CONFIG_TCM_PSCSI=m
 CONFIG_ADB=y
 CONFIG_ADB_MACII=y
 CONFIG_ADB_IOP=y
-CONFIG_ADB_PMU68K=y
+CONFIG_ADB_PMU=y
 CONFIG_ADB_CUDA=y
 CONFIG_INPUT_ADBHID=y
 CONFIG_MAC_EMUMOUSEBTN=y
diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c
index e522307db47c..92e80cf0d8aa 100644
--- a/arch/m68k/mac/config.c
+++ b/arch/m68k/mac/config.c
@@ -891,7 +891,7 @@ static void __init mac_identify(void)
 #ifdef CONFIG_ADB_CUDA
 	find_via_cuda();
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 	find_via_pmu();
 #endif
 }
diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c
index 7ccb799eeb57..28090a44fa09 100644
--- a/arch/m68k/mac/misc.c
+++ b/arch/m68k/mac/misc.c
@@ -85,7 +85,7 @@ static void cuda_write_pram(int offset, __u8 data)
 }
 #endif /* CONFIG_ADB_CUDA */
 
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 static long pmu_read_time(void)
 {
 	struct adb_request req;
@@ -136,7 +136,7 @@ static void pmu_write_pram(int offset, __u8 data)
 	while (!req.complete)
 		pmu_poll();
 }
-#endif /* CONFIG_ADB_PMU68K */
+#endif /* CONFIG_ADB_PMU */
 
 /*
  * VIA PRAM/RTC access routines
@@ -367,38 +367,6 @@ static void cuda_shutdown(void)
 }
 #endif /* CONFIG_ADB_CUDA */
 
-#ifdef CONFIG_ADB_PMU68K
-
-void pmu_restart(void)
-{
-	struct adb_request req;
-	if (pmu_request(&req, NULL,
-			2, PMU_SET_INTR_MASK, PMU_INT_ADB|PMU_INT_TICK) < 0)
-		return;
-	while (!req.complete)
-		pmu_poll();
-	if (pmu_request(&req, NULL, 1, PMU_RESET) < 0)
-		return;
-	while (!req.complete)
-		pmu_poll();
-}
-
-void pmu_shutdown(void)
-{
-	struct adb_request req;
-	if (pmu_request(&req, NULL,
-			2, PMU_SET_INTR_MASK, PMU_INT_ADB|PMU_INT_TICK) < 0)
-		return;
-	while (!req.complete)
-		pmu_poll();
-	if (pmu_request(&req, NULL, 5, PMU_SHUTDOWN, 'M', 'A', 'T', 'T') < 0)
-		return;
-	while (!req.complete)
-		pmu_poll();
-}
-
-#endif
-
 /*
  *-------------------------------------------------------------------
  * Below this point are the generic routines; they'll dispatch to the
@@ -423,7 +391,7 @@ void mac_pram_read(int offset, __u8 *buffer, int len)
 		func = cuda_read_pram;
 		break;
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 	case MAC_ADB_PB2:
 		func = pmu_read_pram;
 		break;
@@ -453,7 +421,7 @@ void mac_pram_write(int offset, __u8 *buffer, int len)
 		func = cuda_write_pram;
 		break;
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 	case MAC_ADB_PB2:
 		func = pmu_write_pram;
 		break;
@@ -477,7 +445,7 @@ void mac_poweroff(void)
 	           macintosh_config->adb_type == MAC_ADB_CUDA) {
 		cuda_shutdown();
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 	} else if (macintosh_config->adb_type == MAC_ADB_PB2) {
 		pmu_shutdown();
 #endif
@@ -518,7 +486,7 @@ void mac_reset(void)
 	           macintosh_config->adb_type == MAC_ADB_CUDA) {
 		cuda_restart();
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 	} else if (macintosh_config->adb_type == MAC_ADB_PB2) {
 		pmu_restart();
 #endif
@@ -670,7 +638,7 @@ int mac_hwclk(int op, struct rtc_time *t)
 			now = cuda_read_time();
 			break;
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 		case MAC_ADB_PB2:
 			now = pmu_read_time();
 			break;
@@ -706,7 +674,7 @@ int mac_hwclk(int op, struct rtc_time *t)
 			cuda_write_time(now);
 			break;
 #endif
-#ifdef CONFIG_ADB_PMU68K
+#ifdef CONFIG_ADB_PMU
 		case MAC_ADB_PB2:
 			pmu_write_time(now);
 			break;
diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index 26abae4c899d..47c350cdfb12 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -39,17 +39,6 @@ config ADB_IOP
 	  <http://www.angelfire.com/ca2/dev68k/iopdesc.html> to enable direct
 	  support for it, say 'Y' here.
 
-config ADB_PMU68K
-	bool "Include PMU (Powerbook) ADB driver"
-	depends on ADB && MAC
-	help
-	  Say Y here if want your kernel to support the m68k based Powerbooks.
-	  This includes the PowerBook 140, PowerBook 145, PowerBook 150,
-	  PowerBook 160, PowerBook 165, PowerBook 165c, PowerBook 170,
-	  PowerBook 180, PowerBook, 180c, PowerBook 190cs, PowerBook 520,
-	  PowerBook Duo 210, PowerBook Duo 230, PowerBook Duo 250,
-	  PowerBook Duo 270c, PowerBook Duo 280 and PowerBook Duo 280c.
-
 # we want to change this to something like CONFIG_SYSCTRL_CUDA/PMU
 config ADB_CUDA
 	bool "Support for Cuda/Egret based Macs and PowerMacs"
@@ -66,7 +55,7 @@ config ADB_CUDA
 
 config ADB_PMU
 	bool "Support for PMU based PowerMacs and PowerBooks"
-	depends on PPC_PMAC
+	depends on PPC_PMAC || MAC
 	help
 	  On PowerBooks, iBooks, and recent iMacs and Power Macintoshes, the
 	  PMU is an embedded microprocessor whose primary function is to
diff --git a/drivers/macintosh/Makefile b/drivers/macintosh/Makefile
index ee803638e595..49819b1b6f20 100644
--- a/drivers/macintosh/Makefile
+++ b/drivers/macintosh/Makefile
@@ -22,7 +22,6 @@ obj-$(CONFIG_PMAC_SMU)		+= smu.o
 obj-$(CONFIG_ADB)		+= adb.o
 obj-$(CONFIG_ADB_MACII)		+= via-macii.o
 obj-$(CONFIG_ADB_IOP)		+= adb-iop.o
-obj-$(CONFIG_ADB_PMU68K)	+= via-pmu68k.o
 obj-$(CONFIG_ADB_MACIO)		+= macio-adb.o
 
 obj-$(CONFIG_THERM_WINDTUNNEL)	+= therm_windtunnel.o
diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 4c8097e0e6fe..76e98f0f7a3e 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -65,7 +65,7 @@ static struct adb_driver *adb_driver_list[] = {
 #ifdef CONFIG_ADB_IOP
 	&adb_iop_driver,
 #endif
-#if defined(CONFIG_ADB_PMU) || defined(CONFIG_ADB_PMU68K)
+#ifdef CONFIG_ADB_PMU
 	&via_pmu_driver,
 #endif
 #ifdef CONFIG_ADB_MACIO
diff --git a/drivers/macintosh/via-pmu68k.c b/drivers/macintosh/via-pmu68k.c
deleted file mode 100644
index bec8e1837d7d..000000000000
--- a/drivers/macintosh/via-pmu68k.c
+++ /dev/null
@@ -1,846 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Device driver for the PMU on 68K-based Apple PowerBooks
- *
- * The VIA (versatile interface adapter) interfaces to the PMU,
- * a 6805 microprocessor core whose primary function is to control
- * battery charging and system power on the PowerBooks.
- * The PMU also controls the ADB (Apple Desktop Bus) which connects
- * to the keyboard and mouse, as well as the non-volatile RAM
- * and the RTC (real time clock) chip.
- *
- * Adapted for 68K PMU by Joshua M. Thompson
- *
- * Based largely on the PowerMac PMU code by Paul Mackerras and
- * Fabio Riccardi.
- *
- * Also based on the PMU driver from MkLinux by Apple Computer, Inc.
- * and the Open Software Foundation, Inc.
- */
-
-#include <stdarg.h>
-#include <linux/types.h>
-#include <linux/errno.h>
-#include <linux/kernel.h>
-#include <linux/delay.h>
-#include <linux/miscdevice.h>
-#include <linux/blkdev.h>
-#include <linux/pci.h>
-#include <linux/init.h>
-#include <linux/interrupt.h>
-
-#include <linux/adb.h>
-#include <linux/pmu.h>
-#include <linux/cuda.h>
-
-#include <asm/macintosh.h>
-#include <asm/macints.h>
-#include <asm/mac_via.h>
-
-#include <asm/pgtable.h>
-#include <asm/irq.h>
-#include <linux/uaccess.h>
-
-/* Misc minor number allocated for /dev/pmu */
-#define PMU_MINOR	154
-
-/* VIA registers - spaced 0x200 bytes apart */
-#define RS		0x200		/* skip between registers */
-#define B		0		/* B-side data */
-#define A		RS		/* A-side data */
-#define DIRB		(2*RS)		/* B-side direction (1=output) */
-#define DIRA		(3*RS)		/* A-side direction (1=output) */
-#define T1CL		(4*RS)		/* Timer 1 ctr/latch (low 8 bits) */
-#define T1CH		(5*RS)		/* Timer 1 counter (high 8 bits) */
-#define T1LL		(6*RS)		/* Timer 1 latch (low 8 bits) */
-#define T1LH		(7*RS)		/* Timer 1 latch (high 8 bits) */
-#define T2CL		(8*RS)		/* Timer 2 ctr/latch (low 8 bits) */
-#define T2CH		(9*RS)		/* Timer 2 counter (high 8 bits) */
-#define SR		(10*RS)		/* Shift register */
-#define ACR		(11*RS)		/* Auxiliary control register */
-#define PCR		(12*RS)		/* Peripheral control register */
-#define IFR		(13*RS)		/* Interrupt flag register */
-#define IER		(14*RS)		/* Interrupt enable register */
-#define ANH		(15*RS)		/* A-side data, no handshake */
-
-/* Bits in B data register: both active low */
-#define TACK		0x02		/* Transfer acknowledge (input) */
-#define TREQ		0x04		/* Transfer request (output) */
-
-/* Bits in ACR */
-#define SR_CTRL		0x1c		/* Shift register control bits */
-#define SR_EXT		0x0c		/* Shift on external clock */
-#define SR_OUT		0x10		/* Shift out if 1 */
-
-/* Bits in IFR and IER */
-#define SR_INT		0x04		/* Shift register full/empty */
-#define CB1_INT		0x10		/* transition on CB1 input */
-
-static enum pmu_state {
-	idle,
-	sending,
-	intack,
-	reading,
-	reading_intr,
-} pmu_state;
-
-static struct adb_request *current_req;
-static struct adb_request *last_req;
-static struct adb_request *req_awaiting_reply;
-static unsigned char interrupt_data[32];
-static unsigned char *reply_ptr;
-static int data_index;
-static int data_len;
-static int adb_int_pending;
-static int pmu_adb_flags;
-static int adb_dev_map;
-static struct adb_request bright_req_1, bright_req_2, bright_req_3;
-static int pmu_kind = PMU_UNKNOWN;
-static int pmu_fully_inited;
-
-int asleep;
-
-static int pmu_probe(void);
-static int pmu_init(void);
-static void pmu_start(void);
-static irqreturn_t pmu_interrupt(int irq, void *arg);
-static int pmu_send_request(struct adb_request *req, int sync);
-static int pmu_autopoll(int devs);
-void pmu_poll(void);
-static int pmu_reset_bus(void);
-
-static int init_pmu(void);
-static void pmu_start(void);
-static void send_byte(int x);
-static void recv_byte(void);
-static void pmu_done(struct adb_request *req);
-static void pmu_handle_data(unsigned char *data, int len);
-static void set_volume(int level);
-static void pmu_enable_backlight(int on);
-static void pmu_set_brightness(int level);
-
-struct adb_driver via_pmu_driver = {
-	.name         = "68K PMU",
-	.probe        = pmu_probe,
-	.init         = pmu_init,
-	.send_request = pmu_send_request,
-	.autopoll     = pmu_autopoll,
-	.poll         = pmu_poll,
-	.reset_bus    = pmu_reset_bus,
-};
-
-/*
- * This table indicates for each PMU opcode:
- * - the number of data bytes to be sent with the command, or -1
- *   if a length byte should be sent,
- * - the number of response bytes which the PMU will return, or
- *   -1 if it will send a length byte.
- */
-static s8 pmu_data_len[256][2] = {
-/*	   0	   1	   2	   3	   4	   5	   6	   7  */
-/*00*/	{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*08*/	{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},
-/*10*/	{ 1, 0},{ 1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*18*/	{ 0, 1},{ 0, 1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{ 0, 0},
-/*20*/	{-1, 0},{ 0, 0},{ 2, 0},{ 1, 0},{ 1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*28*/	{ 0,-1},{ 0,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{ 0,-1},
-/*30*/	{ 4, 0},{20, 0},{-1, 0},{ 3, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*38*/	{ 0, 4},{ 0,20},{ 2,-1},{ 2, 1},{ 3,-1},{-1,-1},{-1,-1},{ 4, 0},
-/*40*/	{ 1, 0},{ 1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*48*/	{ 0, 1},{ 0, 1},{-1,-1},{ 1, 0},{ 1, 0},{-1,-1},{-1,-1},{-1,-1},
-/*50*/	{ 1, 0},{ 0, 0},{ 2, 0},{ 2, 0},{-1, 0},{ 1, 0},{ 3, 0},{ 1, 0},
-/*58*/	{ 0, 1},{ 1, 0},{ 0, 2},{ 0, 2},{ 0,-1},{-1,-1},{-1,-1},{-1,-1},
-/*60*/	{ 2, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*68*/	{ 0, 3},{ 0, 3},{ 0, 2},{ 0, 8},{ 0,-1},{ 0,-1},{-1,-1},{-1,-1},
-/*70*/	{ 1, 0},{ 1, 0},{ 1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*78*/	{ 0,-1},{ 0,-1},{-1,-1},{-1,-1},{-1,-1},{ 5, 1},{ 4, 1},{ 4, 1},
-/*80*/	{ 4, 0},{-1, 0},{ 0, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*88*/	{ 0, 5},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},
-/*90*/	{ 1, 0},{ 2, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*98*/	{ 0, 1},{ 0, 1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},
-/*a0*/	{ 2, 0},{ 2, 0},{ 2, 0},{ 4, 0},{-1, 0},{ 0, 0},{-1, 0},{-1, 0},
-/*a8*/	{ 1, 1},{ 1, 0},{ 3, 0},{ 2, 0},{-1,-1},{-1,-1},{-1,-1},{-1,-1},
-/*b0*/	{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*b8*/	{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},
-/*c0*/	{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*c8*/	{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},
-/*d0*/	{ 0, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*d8*/	{ 1, 1},{ 1, 1},{-1,-1},{-1,-1},{ 0, 1},{ 0,-1},{-1,-1},{-1,-1},
-/*e0*/	{-1, 0},{ 4, 0},{ 0, 1},{-1, 0},{-1, 0},{ 4, 0},{-1, 0},{-1, 0},
-/*e8*/	{ 3,-1},{-1,-1},{ 0, 1},{-1,-1},{ 0,-1},{-1,-1},{-1,-1},{ 0, 0},
-/*f0*/	{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},{-1, 0},
-/*f8*/	{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},{-1,-1},
-};
-
-int __init find_via_pmu(void)
-{
-	switch (macintosh_config->adb_type) {
-	case MAC_ADB_PB2:
-		pmu_kind = PMU_68K_V2;
-		break;
-	default:
-		pmu_kind = PMU_UNKNOWN;
-		return -ENODEV;
-	}
-
-	pmu_state = idle;
-
-	if (!init_pmu())
-		goto fail_init;
-
-	pr_info("adb: PMU 68K driver v0.5 for Unified ADB\n");
-
-	return 1;
-
-fail_init:
-	pmu_kind = PMU_UNKNOWN;
-	return 0;
-}
-
-static int pmu_probe(void)
-{
-	if (pmu_kind == PMU_UNKNOWN)
-		return -ENODEV;
-	return 0;
-}
-
-static int pmu_init(void)
-{
-	if (pmu_kind == PMU_UNKNOWN)
-		return -ENODEV;
-	return 0;
-}
-
-static int __init via_pmu_start(void)
-{
-	if (pmu_kind == PMU_UNKNOWN)
-		return -ENODEV;
-
-	if (request_irq(IRQ_MAC_ADB_SR, pmu_interrupt, 0, "PMU_SR",
-			pmu_interrupt)) {
-		pr_err("%s: can't get SR irq\n", __func__);
-		return -ENODEV;
-	}
-	if (request_irq(IRQ_MAC_ADB_CL, pmu_interrupt, 0, "PMU_CL",
-			pmu_interrupt)) {
-		pr_err("%s: can't get CL irq\n", __func__);
-		free_irq(IRQ_MAC_ADB_SR, pmu_interrupt);
-		return -ENODEV;
-	}
-
-	pmu_fully_inited = 1;
-
-	/* Enable backlight */
-	pmu_enable_backlight(1);
-
-	return 0;
-}
-
-arch_initcall(via_pmu_start);
-
-static int __init init_pmu(void)
-{
-	int timeout;
-	volatile struct adb_request req;
-
-	via2[B] |= TREQ;				/* negate TREQ */
-	via2[DIRB] = (via2[DIRB] | TREQ) & ~TACK;	/* TACK in, TREQ out */
-
-	pmu_request((struct adb_request *) &req, NULL, 2, PMU_SET_INTR_MASK, PMU_INT_ADB);
-	timeout =  100000;
-	while (!req.complete) {
-		if (--timeout < 0) {
-			printk(KERN_ERR "pmu_init: no response from PMU\n");
-			return -EAGAIN;
-		}
-		udelay(10);
-		pmu_poll();
-	}
-
-	/* ack all pending interrupts */
-	timeout = 100000;
-	interrupt_data[0] = 1;
-	while (interrupt_data[0] || pmu_state != idle) {
-		if (--timeout < 0) {
-			printk(KERN_ERR "pmu_init: timed out acking intrs\n");
-			return -EAGAIN;
-		}
-		if (pmu_state == idle) {
-			adb_int_pending = 1;
-			pmu_interrupt(0, NULL);
-		}
-		pmu_poll();
-		udelay(10);
-	}
-
-	pmu_request((struct adb_request *) &req, NULL, 2, PMU_SET_INTR_MASK,
-			PMU_INT_ADB_AUTO|PMU_INT_SNDBRT|PMU_INT_ADB);
-	timeout =  100000;
-	while (!req.complete) {
-		if (--timeout < 0) {
-			printk(KERN_ERR "pmu_init: no response from PMU\n");
-			return -EAGAIN;
-		}
-		udelay(10);
-		pmu_poll();
-	}
-
-	bright_req_1.complete = 1;
-	bright_req_2.complete = 1;
-	bright_req_3.complete = 1;
-
-	return 1;
-}
-
-int
-pmu_get_model(void)
-{
-	return pmu_kind;
-}
-
-/* Send an ADB command */
-static int 
-pmu_send_request(struct adb_request *req, int sync)
-{
-    int i, ret;
-
-    if (!pmu_fully_inited)
-    {
- 	req->complete = 1;
-   	return -ENXIO;
-   }
-
-    ret = -EINVAL;
-	
-    switch (req->data[0]) {
-    case PMU_PACKET:
-		for (i = 0; i < req->nbytes - 1; ++i)
-			req->data[i] = req->data[i+1];
-		--req->nbytes;
-		if (pmu_data_len[req->data[0]][1] != 0) {
-			req->reply[0] = ADB_RET_OK;
-			req->reply_len = 1;
-		} else
-			req->reply_len = 0;
-		ret = pmu_queue_request(req);
-		break;
-    case CUDA_PACKET:
-		switch (req->data[1]) {
-		case CUDA_GET_TIME:
-			if (req->nbytes != 2)
-				break;
-			req->data[0] = PMU_READ_RTC;
-			req->nbytes = 1;
-			req->reply_len = 3;
-			req->reply[0] = CUDA_PACKET;
-			req->reply[1] = 0;
-			req->reply[2] = CUDA_GET_TIME;
-			ret = pmu_queue_request(req);
-			break;
-		case CUDA_SET_TIME:
-			if (req->nbytes != 6)
-				break;
-			req->data[0] = PMU_SET_RTC;
-			req->nbytes = 5;
-			for (i = 1; i <= 4; ++i)
-				req->data[i] = req->data[i+1];
-			req->reply_len = 3;
-			req->reply[0] = CUDA_PACKET;
-			req->reply[1] = 0;
-			req->reply[2] = CUDA_SET_TIME;
-			ret = pmu_queue_request(req);
-			break;
-		case CUDA_GET_PRAM:
-			if (req->nbytes != 4)
-				break;
-			req->data[0] = PMU_READ_NVRAM;
-			req->data[1] = req->data[2];
-			req->data[2] = req->data[3];
-			req->nbytes = 3;
-			req->reply_len = 3;
-			req->reply[0] = CUDA_PACKET;
-			req->reply[1] = 0;
-			req->reply[2] = CUDA_GET_PRAM;
-			ret = pmu_queue_request(req);
-			break;
-		case CUDA_SET_PRAM:
-			if (req->nbytes != 5)
-				break;
-			req->data[0] = PMU_WRITE_NVRAM;
-			req->data[1] = req->data[2];
-			req->data[2] = req->data[3];
-			req->data[3] = req->data[4];
-			req->nbytes = 4;
-			req->reply_len = 3;
-			req->reply[0] = CUDA_PACKET;
-			req->reply[1] = 0;
-			req->reply[2] = CUDA_SET_PRAM;
-			ret = pmu_queue_request(req);
-			break;
-		}
-		break;
-    case ADB_PACKET:
-		for (i = req->nbytes - 1; i > 1; --i)
-			req->data[i+2] = req->data[i];
-		req->data[3] = req->nbytes - 2;
-		req->data[2] = pmu_adb_flags;
-		/*req->data[1] = req->data[1];*/
-		req->data[0] = PMU_ADB_CMD;
-		req->nbytes += 2;
-		req->reply_expected = 1;
-		req->reply_len = 0;
-		ret = pmu_queue_request(req);
-		break;
-    }
-    if (ret)
-    {
-    	req->complete = 1;
-    	return ret;
-    }
-    	
-    if (sync) {
-	while (!req->complete)
-		pmu_poll();
-    }
-
-    return 0;
-}
-
-/* Enable/disable autopolling */
-static int 
-pmu_autopoll(int devs)
-{
-	struct adb_request req;
-
-	if (!pmu_fully_inited) return -ENXIO;
-
-	if (devs) {
-		adb_dev_map = devs;
-		pmu_request(&req, NULL, 5, PMU_ADB_CMD, 0, 0x86,
-			    adb_dev_map >> 8, adb_dev_map);
-		pmu_adb_flags = 2;
-	} else {
-		pmu_request(&req, NULL, 1, PMU_ADB_POLL_OFF);
-		pmu_adb_flags = 0;
-	}
-	while (!req.complete)
-		pmu_poll();
-	return 0;
-}
-
-/* Reset the ADB bus */
-static int 
-pmu_reset_bus(void)
-{
-	struct adb_request req;
-	long timeout;
-	int save_autopoll = adb_dev_map;
-
-	if (!pmu_fully_inited) return -ENXIO;
-
-	/* anyone got a better idea?? */
-	pmu_autopoll(0);
-
-	req.nbytes = 5;
-	req.done = NULL;
-	req.data[0] = PMU_ADB_CMD;
-	req.data[1] = 0;
-	req.data[2] = 3; /* ADB_BUSRESET ??? */
-	req.data[3] = 0;
-	req.data[4] = 0;
-	req.reply_len = 0;
-	req.reply_expected = 1;
-	if (pmu_queue_request(&req) != 0)
-	{
-		printk(KERN_ERR "pmu_adb_reset_bus: pmu_queue_request failed\n");
-		return -EIO;
-	}
-	while (!req.complete)
-		pmu_poll();
-	timeout = 100000;
-	while (!req.complete) {
-		if (--timeout < 0) {
-			printk(KERN_ERR "pmu_adb_reset_bus (reset): no response from PMU\n");
-			return -EIO;
-		}
-		udelay(10);
-		pmu_poll();
-	}
-
-	if (save_autopoll != 0)
-		pmu_autopoll(save_autopoll);
-		
-	return 0;
-}
-
-/* Construct and send a pmu request */
-int 
-pmu_request(struct adb_request *req, void (*done)(struct adb_request *),
-	    int nbytes, ...)
-{
-	va_list list;
-	int i;
-
-	if (nbytes < 0 || nbytes > 32) {
-		printk(KERN_ERR "pmu_request: bad nbytes (%d)\n", nbytes);
-		req->complete = 1;
-		return -EINVAL;
-	}
-	req->nbytes = nbytes;
-	req->done = done;
-	va_start(list, nbytes);
-	for (i = 0; i < nbytes; ++i)
-		req->data[i] = va_arg(list, int);
-	va_end(list);
-	if (pmu_data_len[req->data[0]][1] != 0) {
-		req->reply[0] = ADB_RET_OK;
-		req->reply_len = 1;
-	} else
-		req->reply_len = 0;
-	req->reply_expected = 0;
-	return pmu_queue_request(req);
-}
-
-int
-pmu_queue_request(struct adb_request *req)
-{
-	unsigned long flags;
-	int nsend;
-
-	if (req->nbytes <= 0) {
-		req->complete = 1;
-		return 0;
-	}
-	nsend = pmu_data_len[req->data[0]][0];
-	if (nsend >= 0 && req->nbytes != nsend + 1) {
-		req->complete = 1;
-		return -EINVAL;
-	}
-
-	req->next = NULL;
-	req->sent = 0;
-	req->complete = 0;
-	local_irq_save(flags);
-
-	if (current_req != 0) {
-		last_req->next = req;
-		last_req = req;
-	} else {
-		current_req = req;
-		last_req = req;
-		if (pmu_state == idle)
-			pmu_start();
-	}
-
-	local_irq_restore(flags);
-	return 0;
-}
-
-static void 
-send_byte(int x)
-{
-	via1[ACR] |= SR_CTRL;
-	via1[SR] = x;
-	via2[B] &= ~TREQ;		/* assert TREQ */
-}
-
-static void 
-recv_byte(void)
-{
-	char c;
-
-	via1[ACR] = (via1[ACR] | SR_EXT) & ~SR_OUT;
-	c = via1[SR];		/* resets SR */
-	via2[B] &= ~TREQ;
-}
-
-static void 
-pmu_start(void)
-{
-	unsigned long flags;
-	struct adb_request *req;
-
-	/* assert pmu_state == idle */
-	/* get the packet to send */
-	local_irq_save(flags);
-	req = current_req;
-	if (req == 0 || pmu_state != idle
-	    || (req->reply_expected && req_awaiting_reply))
-		goto out;
-
-	pmu_state = sending;
-	data_index = 1;
-	data_len = pmu_data_len[req->data[0]][0];
-
-	/* set the shift register to shift out and send a byte */
-	send_byte(req->data[0]);
-
-out:
-	local_irq_restore(flags);
-}
-
-void 
-pmu_poll(void)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	if (via1[IFR] & SR_INT) {
-		via1[IFR] = SR_INT;
-		pmu_interrupt(IRQ_MAC_ADB_SR, NULL);
-	}
-	if (via1[IFR] & CB1_INT) {
-		via1[IFR] = CB1_INT;
-		pmu_interrupt(IRQ_MAC_ADB_CL, NULL);
-	}
-	local_irq_restore(flags);
-}
-
-static irqreturn_t
-pmu_interrupt(int irq, void *dev_id)
-{
-	struct adb_request *req;
-	int timeout, bite = 0;	/* to prevent compiler warning */
-
-#if 0
-	printk("pmu_interrupt: irq %d state %d acr %02X, b %02X data_index %d/%d adb_int_pending %d\n",
-		irq, pmu_state, (uint) via1[ACR], (uint) via2[B], data_index, data_len, adb_int_pending);
-#endif
-
-	if (irq == IRQ_MAC_ADB_CL) {		/* CB1 interrupt */
-		adb_int_pending = 1;
-	} else if (irq == IRQ_MAC_ADB_SR) {	/* SR interrupt  */
-		if (via2[B] & TACK) {
-			printk(KERN_DEBUG "PMU: SR_INT but ack still high! (%x)\n", via2[B]);
-		}
-
-		/* if reading grab the byte */
-		if ((via1[ACR] & SR_OUT) == 0) bite = via1[SR];
-
-		/* reset TREQ and wait for TACK to go high */
-		via2[B] |= TREQ;
-		timeout = 3200;
-		while (!(via2[B] & TACK)) {
-			if (--timeout < 0) {
-				printk(KERN_ERR "PMU not responding (!ack)\n");
-				goto finish;
-			}
-			udelay(10);
-		}
-
-		switch (pmu_state) {
-		case sending:
-			req = current_req;
-			if (data_len < 0) {
-				data_len = req->nbytes - 1;
-				send_byte(data_len);
-				break;
-			}
-			if (data_index <= data_len) {
-				send_byte(req->data[data_index++]);
-				break;
-			}
-			req->sent = 1;
-			data_len = pmu_data_len[req->data[0]][1];
-			if (data_len == 0) {
-				pmu_state = idle;
-				current_req = req->next;
-				if (req->reply_expected)
-					req_awaiting_reply = req;
-				else
-					pmu_done(req);
-			} else {
-				pmu_state = reading;
-				data_index = 0;
-				reply_ptr = req->reply + req->reply_len;
-				recv_byte();
-			}
-			break;
-
-		case intack:
-			data_index = 0;
-			data_len = -1;
-			pmu_state = reading_intr;
-			reply_ptr = interrupt_data;
-			recv_byte();
-			break;
-
-		case reading:
-		case reading_intr:
-			if (data_len == -1) {
-				data_len = bite;
-				if (bite > 32)
-					printk(KERN_ERR "PMU: bad reply len %d\n",
-					       bite);
-			} else {
-				reply_ptr[data_index++] = bite;
-			}
-			if (data_index < data_len) {
-				recv_byte();
-				break;
-			}
-
-			if (pmu_state == reading_intr) {
-				pmu_handle_data(interrupt_data, data_index);
-			} else {
-				req = current_req;
-				current_req = req->next;
-				req->reply_len += data_index;
-				pmu_done(req);
-			}
-			pmu_state = idle;
-
-			break;
-
-		default:
-			printk(KERN_ERR "pmu_interrupt: unknown state %d?\n",
-			       pmu_state);
-		}
-	}
-finish:
-	if (pmu_state == idle) {
-		if (adb_int_pending) {
-			pmu_state = intack;
-			send_byte(PMU_INT_ACK);
-			adb_int_pending = 0;
-		} else if (current_req) {
-			pmu_start();
-		}
-	}
-
-#if 0
-	printk("pmu_interrupt: exit state %d acr %02X, b %02X data_index %d/%d adb_int_pending %d\n",
-		pmu_state, (uint) via1[ACR], (uint) via2[B], data_index, data_len, adb_int_pending);
-#endif
-	return IRQ_HANDLED;
-}
-
-static void 
-pmu_done(struct adb_request *req)
-{
-	req->complete = 1;
-	if (req->done)
-		(*req->done)(req);
-}
-
-/* Interrupt data could be the result data from an ADB cmd */
-static void 
-pmu_handle_data(unsigned char *data, int len)
-{
-	static int show_pmu_ints = 1;
-
-	asleep = 0;
-	if (len < 1) {
-		adb_int_pending = 0;
-		return;
-	}
-	if (data[0] & PMU_INT_ADB) {
-		if ((data[0] & PMU_INT_ADB_AUTO) == 0) {
-			struct adb_request *req = req_awaiting_reply;
-			if (req == 0) {
-				printk(KERN_ERR "PMU: extra ADB reply\n");
-				return;
-			}
-			req_awaiting_reply = NULL;
-			if (len <= 2)
-				req->reply_len = 0;
-			else {
-				memcpy(req->reply, data + 1, len - 1);
-				req->reply_len = len - 1;
-			}
-			pmu_done(req);
-		} else {
-			adb_input(data+1, len-1, 1);
-		}
-	} else {
-		if (data[0] == 0x08 && len == 3) {
-			/* sound/brightness buttons pressed */
-			pmu_set_brightness(data[1] >> 3);
-			set_volume(data[2]);
-		} else if (show_pmu_ints
-			   && !(data[0] == PMU_INT_TICK && len == 1)) {
-			int i;
-			printk(KERN_DEBUG "pmu intr");
-			for (i = 0; i < len; ++i)
-				printk(" %.2x", data[i]);
-			printk("\n");
-		}
-	}
-}
-
-static int backlight_level = -1;
-static int backlight_enabled = 0;
-
-#define LEVEL_TO_BRIGHT(lev)	((lev) < 1? 0x7f: 0x4a - ((lev) << 1))
-
-static void 
-pmu_enable_backlight(int on)
-{
-	struct adb_request req;
-
-	if (on) {
-	    /* first call: get current backlight value */
-	    if (backlight_level < 0) {
-		switch(pmu_kind) {
-		    case PMU_68K_V2:
-			pmu_request(&req, NULL, 3, PMU_READ_NVRAM, 0x14, 0xe);
-			while (!req.complete)
-				pmu_poll();
-			printk(KERN_DEBUG "pmu: nvram returned bright: %d\n", (int)req.reply[1]);
-			backlight_level = req.reply[1];
-			break;
-		    default:
-		        backlight_enabled = 0;
-		        return;
-		}
-	    }
-	    pmu_request(&req, NULL, 2, PMU_BACKLIGHT_BRIGHT,
-	    	LEVEL_TO_BRIGHT(backlight_level));
-	    while (!req.complete)
-		pmu_poll();
-	}
-	pmu_request(&req, NULL, 2, PMU_POWER_CTRL,
-	    PMU_POW_BACKLIGHT | (on ? PMU_POW_ON : PMU_POW_OFF));
-	while (!req.complete)
-		pmu_poll();
-	backlight_enabled = on;
-}
-
-static void 
-pmu_set_brightness(int level)
-{
-	int bright;
-
-	backlight_level = level;
-	bright = LEVEL_TO_BRIGHT(level);
-	if (!backlight_enabled)
-		return;
-	if (bright_req_1.complete)
-		pmu_request(&bright_req_1, NULL, 2, PMU_BACKLIGHT_BRIGHT,
-		    bright);
-	if (bright_req_2.complete)
-		pmu_request(&bright_req_2, NULL, 2, PMU_POWER_CTRL,
-		    PMU_POW_BACKLIGHT | (bright < 0x7f ? PMU_POW_ON : PMU_POW_OFF));
-}
-
-void 
-pmu_enable_irled(int on)
-{
-	struct adb_request req;
-
-	pmu_request(&req, NULL, 2, PMU_POWER_CTRL, PMU_POW_IRLED |
-	    (on ? PMU_POW_ON : PMU_POW_OFF));
-	while (!req.complete)
-		pmu_poll();
-}
-
-static void 
-set_volume(int level)
-{
-}
-
-int
-pmu_present(void)
-{
-	return (pmu_kind != PMU_UNKNOWN);
-}
diff --git a/include/uapi/linux/pmu.h b/include/uapi/linux/pmu.h
index e128f609281a..97256f90e6df 100644
--- a/include/uapi/linux/pmu.h
+++ b/include/uapi/linux/pmu.h
@@ -94,7 +94,7 @@ enum {
 	PMU_PADDINGTON_BASED,	/* 1999 PowerBook G3 */
 	PMU_KEYLARGO_BASED,	/* Core99 motherboard (PMU99) */
 	PMU_68K_V1,		/* Unused/deprecated */
-	PMU_68K_V2, 		/* 68K PMU, version 2 */
+	PMU_68K_V2,		/* Unused/deprecated */
 };
 
 /* PMU PMU_POWER_EVENTS commands */
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 11/11] macintosh/via-pmu: Disambiguate interrupt statistics
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

Some of the event counters are overloaded which makes it very
difficult to interpret their values.

Counter 0 is supposed to report CB1 interrupts but it can also count
PMU_INT_WAITING_CHARGER events.

Counter 1 is supposed to report GPIO interrupts but it can also count
other events (depending upon the value of the PMU_INT_ADB bit).

Disambiguate these statistics with dedicated counters for GPIO and
CB1 interrupts.

Comments in the MkLinux source code say that the type 0 and type 1
interrupts are model-specific. Label them as "unknown".

This change to the contents of /proc/pmu/interrupts is by necessity
visible in userland. However, packages which interact with the PMU
(that is, pbbuttonsd, pmac-utils and pmud) don't open this file.
AFAIK, user software has no need to poll these counters.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
The file now looks like this,

  0:          0 (Unknown interrupt (type 0))
  1:          0 (Unknown interrupt (type 1))
  2:          0 (PC-Card eject button)
  3:         23 (Sound/Brightness button)
  4:         74 (ADB message)
  5:          0 (Battery state change)
  6:          0 (Environment interrupt)
  7:        121 (Tick timer)
  8:          0 (Ghost interrupt (zero len))
  9:          1 (Empty interrupt (empty mask))
 10:          2 (Max irqs in a row)
 11:        194 (Total CB1 triggered events)
 12:          0 (Total GPIO1 triggered events)

rather than this,

  0:        194 (Total CB1 triggered events)
  1:          0 (Total GPIO1 triggered events)
  2:          0 (PC-Card eject button)
  3:         23 (Sound/Brightness button)
  4:         74 (ADB message)
  5:          0 (Battery state change)
  6:          0 (Environment interrupt)
  7:        121 (Tick timer)
  8:          0 (Ghost interrupt (zero len))
  9:          1 (Empty interrupt (empty mask))
 10:          2 (Max irqs in a row)

If some parser exists for this file, and if this change is problematic,
we could increment the driver version number in /proc/pmu/info, to
correspond with the format change.
---
 drivers/macintosh/via-pmu.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 3da5d40309d4..d72c450aebe5 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -172,7 +172,9 @@ static int drop_interrupts;
 static int option_lid_wakeup = 1;
 #endif /* CONFIG_SUSPEND && CONFIG_PPC32 */
 static unsigned long async_req_locks;
-static unsigned int pmu_irq_stats[11];
+
+#define NUM_IRQ_STATS 13
+static unsigned int pmu_irq_stats[NUM_IRQ_STATS];
 
 static struct proc_dir_entry *proc_pmu_root;
 static struct proc_dir_entry *proc_pmu_info;
@@ -873,9 +875,9 @@ static int pmu_info_proc_show(struct seq_file *m, void *v)
 static int pmu_irqstats_proc_show(struct seq_file *m, void *v)
 {
 	int i;
-	static const char *irq_names[] = {
-		"Total CB1 triggered events",
-		"Total GPIO1 triggered events",
+	static const char *irq_names[NUM_IRQ_STATS] = {
+		"Unknown interrupt (type 0)",
+		"Unknown interrupt (type 1)",
 		"PC-Card eject button",
 		"Sound/Brightness button",
 		"ADB message",
@@ -884,10 +886,12 @@ static int pmu_irqstats_proc_show(struct seq_file *m, void *v)
 		"Tick timer",
 		"Ghost interrupt (zero len)",
 		"Empty interrupt (empty mask)",
-		"Max irqs in a row"
+		"Max irqs in a row",
+		"Total CB1 triggered events",
+		"Total GPIO1 triggered events",
         };
 
-	for (i=0; i<11; i++) {
+	for (i = 0; i < NUM_IRQ_STATS; i++) {
 		seq_printf(m, " %2u: %10u (%s)\n",
 			     i, pmu_irq_stats[i], irq_names[i]);
 	}
@@ -1622,7 +1626,7 @@ via_pmu_interrupt(int irq, void *arg)
 		}
 		if (intr & CB1_INT) {
 			adb_int_pending = 1;
-			pmu_irq_stats[0]++;
+			pmu_irq_stats[11]++;
 		}
 		if (intr & SR_INT) {
 			req = pmu_sr_intr();
@@ -1709,7 +1713,7 @@ gpio1_interrupt(int irq, void *arg)
 			disable_irq_nosync(gpio_irq);
 			gpio_irq_enabled = 0;
 		}
-		pmu_irq_stats[1]++;
+		pmu_irq_stats[12]++;
 		adb_int_pending = 1;
 		spin_unlock_irqrestore(&pmu_lock, flags);
 		via_pmu_interrupt(0, NULL);
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 06/11] macintosh/via-pmu: Add support for m68k PowerBooks
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

Put #ifdefs around the Open Firmware, xmon, interrupt dispatch,
battery and suspend code. Add the necessary interrupt handling to
support m68k PowerBooks.

The pmu_kind value is available to userspace using the
PMU_IOC_GET_MODEL ioctl. It is not clear yet what hardware classes
are be needed to describe m68k PowerBook models, so pmu_kind is given
the provisional value PMU_UNKNOWN.

To find out about the hardware, user programs can use /proc/bootinfo
or /proc/hardware, or send the PMU_GET_VERSION command using /dev/adb.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/macintosh/Kconfig   |   2 +-
 drivers/macintosh/via-pmu.c | 101 +++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 91 insertions(+), 12 deletions(-)

diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index 97a420c11eed..9c6452b38c36 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -65,7 +65,7 @@ config ADB_CUDA
 	  If unsure say Y.
 
 config ADB_PMU
-	bool "Support for PMU  based PowerMacs"
+	bool "Support for PMU based PowerMacs and PowerBooks"
 	depends on PPC_PMAC
 	help
 	  On PowerBooks, iBooks, and recent iMacs and Power Macintoshes, the
diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 2557f3e49f18..a68e7a6f00cc 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * Device driver for the via-pmu on Apple Powermacs.
+ * Device driver for the PMU in Apple PowerBooks and PowerMacs.
  *
  * The VIA (versatile interface adapter) interfaces to the PMU,
  * a 6805 microprocessor core whose primary function is to control
@@ -49,20 +49,26 @@
 #include <linux/compat.h>
 #include <linux/of_address.h>
 #include <linux/of_irq.h>
-#include <asm/prom.h>
+#include <linux/uaccess.h>
 #include <asm/machdep.h>
 #include <asm/io.h>
 #include <asm/pgtable.h>
 #include <asm/sections.h>
 #include <asm/irq.h>
+#ifdef CONFIG_PPC_PMAC
 #include <asm/pmac_feature.h>
 #include <asm/pmac_pfunc.h>
 #include <asm/pmac_low_i2c.h>
-#include <linux/uaccess.h>
+#include <asm/prom.h>
 #include <asm/mmu_context.h>
 #include <asm/cputable.h>
 #include <asm/time.h>
 #include <asm/backlight.h>
+#else
+#include <asm/macintosh.h>
+#include <asm/macints.h>
+#include <asm/mac_via.h>
+#endif
 
 #include "via-pmu-event.h"
 
@@ -97,8 +103,13 @@ static DEFINE_MUTEX(pmu_info_proc_mutex);
 #define ANH		(15*RS)		/* A-side data, no handshake */
 
 /* Bits in B data register: both active low */
+#ifdef CONFIG_PPC_PMAC
 #define TACK		0x08		/* Transfer acknowledge (input) */
 #define TREQ		0x10		/* Transfer request (output) */
+#else
+#define TACK		0x02
+#define TREQ		0x04
+#endif
 
 /* Bits in ACR */
 #define SR_CTRL		0x1c		/* Shift register control bits */
@@ -140,13 +151,15 @@ static int data_index;
 static int data_len;
 static volatile int adb_int_pending;
 static volatile int disable_poll;
-static struct device_node *vias;
 static int pmu_kind = PMU_UNKNOWN;
 static int pmu_fully_inited;
 static int pmu_has_adb;
+#ifdef CONFIG_PPC_PMAC
 static volatile unsigned char __iomem *via1;
 static volatile unsigned char __iomem *via2;
+static struct device_node *vias;
 static struct device_node *gpio_node;
+#endif
 static unsigned char __iomem *gpio_reg;
 static int gpio_irq = 0;
 static int gpio_irq_enabled = -1;
@@ -273,6 +286,7 @@ static char *pbook_type[] = {
 
 int __init find_via_pmu(void)
 {
+#ifdef CONFIG_PPC_PMAC
 	u64 taddr;
 	const u32 *reg;
 
@@ -355,9 +369,6 @@ int __init find_via_pmu(void)
 	if (!init_pmu())
 		goto fail_init;
 
-	printk(KERN_INFO "PMU driver v%d initialized for %s, firmware: %02x\n",
-	       PMU_DRIVER_VERSION, pbook_type[pmu_kind], pmu_version);
-	       
 	sys_ctrler = SYS_CTRLER_PMU;
 	
 	return 1;
@@ -373,6 +384,30 @@ int __init find_via_pmu(void)
 	vias = NULL;
 	pmu_state = uninitialized;
 	return 0;
+#else
+	if (macintosh_config->adb_type != MAC_ADB_PB2)
+		return 0;
+
+	pmu_kind = PMU_UNKNOWN;
+
+	spin_lock_init(&pmu_lock);
+
+	pmu_has_adb = 1;
+
+	pmu_intr_mask =	PMU_INT_PCEJECT |
+			PMU_INT_SNDBRT |
+			PMU_INT_ADB |
+			PMU_INT_TICK;
+
+	pmu_state = idle;
+
+	if (!init_pmu()) {
+		pmu_state = uninitialized;
+		return 0;
+	}
+
+	return 1;
+#endif /* !CONFIG_PPC_PMAC */
 }
 
 #ifdef CONFIG_ADB
@@ -396,13 +431,14 @@ static int pmu_init(void)
  */
 static int __init via_pmu_start(void)
 {
-	unsigned int irq;
+	unsigned int __maybe_unused irq;
 
 	if (pmu_state == uninitialized)
 		return -ENODEV;
 
 	batt_req.complete = 1;
 
+#ifdef CONFIG_PPC_PMAC
 	irq = irq_of_parse_and_map(vias, 0);
 	if (!irq) {
 		printk(KERN_ERR "via-pmu: can't map interrupt\n");
@@ -439,6 +475,19 @@ static int __init via_pmu_start(void)
 
 	/* Enable interrupts */
 	out_8(&via1[IER], IER_SET | SR_INT | CB1_INT);
+#else
+	if (request_irq(IRQ_MAC_ADB_SR, via_pmu_interrupt, IRQF_NO_SUSPEND,
+			"VIA-PMU-SR", NULL)) {
+		pr_err("%s: couldn't get SR irq\n", __func__);
+		return -ENODEV;
+	}
+	if (request_irq(IRQ_MAC_ADB_CL, via_pmu_interrupt, IRQF_NO_SUSPEND,
+			"VIA-PMU-CL", NULL)) {
+		pr_err("%s: couldn't get CL irq\n", __func__);
+		free_irq(IRQ_MAC_ADB_SR, NULL);
+		return -ENODEV;
+	}
+#endif /* !CONFIG_PPC_PMAC */
 
 	pmu_fully_inited = 1;
 
@@ -589,6 +638,10 @@ init_pmu(void)
 			       option_server_mode ? "enabled" : "disabled");
 		}
 	}
+
+	printk(KERN_INFO "PMU driver v%d initialized for %s, firmware: %02x\n",
+	       PMU_DRIVER_VERSION, pbook_type[pmu_kind], pmu_version);
+
 	return 1;
 }
 
@@ -627,6 +680,7 @@ static void pmu_set_server_mode(int server_mode)
 static void
 done_battery_state_ohare(struct adb_request* req)
 {
+#ifdef CONFIG_PPC_PMAC
 	/* format:
 	 *  [0]    :  flags
 	 *    0x01 :  AC indicator
@@ -708,6 +762,7 @@ done_battery_state_ohare(struct adb_request* req)
 	pmu_batteries[pmu_cur_battery].amperage = amperage;
 	pmu_batteries[pmu_cur_battery].voltage = voltage;
 	pmu_batteries[pmu_cur_battery].time_remaining = time;
+#endif /* CONFIG_PPC_PMAC */
 
 	clear_bit(0, &async_req_locks);
 }
@@ -1356,6 +1411,7 @@ pmu_handle_data(unsigned char *data, int len)
 			}
 			pmu_done(req);
 		} else {
+#ifdef CONFIG_XMON
 			if (len == 4 && data[1] == 0x2c) {
 				extern int xmon_wants_key, xmon_adb_keycode;
 				if (xmon_wants_key) {
@@ -1363,6 +1419,7 @@ pmu_handle_data(unsigned char *data, int len)
 					return;
 				}
 			}
+#endif /* CONFIG_XMON */
 #ifdef CONFIG_ADB
 			/*
 			 * XXX On the [23]400 the PMU gives us an up
@@ -1530,7 +1587,25 @@ via_pmu_interrupt(int irq, void *arg)
 	++disable_poll;
 	
 	for (;;) {
-		intr = in_8(&via1[IFR]) & (SR_INT | CB1_INT);
+		/* On 68k Macs, VIA interrupts are dispatched individually.
+		 * Unless we are polling, the relevant IRQ flag has already
+		 * been cleared.
+		 */
+		intr = 0;
+		if (IS_ENABLED(CONFIG_PPC_PMAC) || !irq) {
+			intr = in_8(&via1[IFR]) & (SR_INT | CB1_INT);
+			out_8(&via1[IFR], intr);
+		}
+#ifndef CONFIG_PPC_PMAC
+		switch (irq) {
+		case IRQ_MAC_ADB_CL:
+			intr = CB1_INT;
+			break;
+		case IRQ_MAC_ADB_SR:
+			intr = SR_INT;
+			break;
+		}
+#endif
 		if (intr == 0)
 			break;
 		handled = 1;
@@ -1540,7 +1615,6 @@ via_pmu_interrupt(int irq, void *arg)
 			       intr, in_8(&via1[IER]), pmu_state);
 			break;
 		}
-		out_8(&via1[IFR], intr);
 		if (intr & CB1_INT) {
 			adb_int_pending = 1;
 			pmu_irq_stats[0]++;
@@ -1550,6 +1624,9 @@ via_pmu_interrupt(int irq, void *arg)
 			if (req)
 				break;
 		}
+#ifndef CONFIG_PPC_PMAC
+		break;
+#endif
 	}
 
 recheck:
@@ -1616,7 +1693,7 @@ pmu_unlock(void)
 }
 
 
-static irqreturn_t
+static __maybe_unused irqreturn_t
 gpio1_interrupt(int irq, void *arg)
 {
 	unsigned long flags;
@@ -2250,6 +2327,7 @@ static int pmu_ioctl(struct file *filp,
 	int error = -EINVAL;
 
 	switch (cmd) {
+#ifdef CONFIG_PPC_PMAC
 	case PMU_IOC_SLEEP:
 		if (!capable(CAP_SYS_ADMIN))
 			return -EACCES;
@@ -2259,6 +2337,7 @@ static int pmu_ioctl(struct file *filp,
 			return put_user(0, argp);
 		else
 			return put_user(1, argp);
+#endif
 
 #ifdef CONFIG_PMAC_BACKLIGHT_LEGACY
 	/* Compatibility ioctl's for backlight */
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 08/11] macintosh/via-pmu68k: Don't load driver on unsupported hardware
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel,
	Geert Uytterhoeven
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

Don't load the via-pmu68k driver on early PowerBooks. The M50753 PMU
device found in those models was never supported by this driver.
Attempting to load the driver usually causes a boot hang.

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Reviewed-by: Michael Schmitz <schmitzmic@gmail.com>
---
 arch/m68k/mac/misc.c           | 6 ++----
 drivers/macintosh/via-pmu68k.c | 4 ----
 include/uapi/linux/pmu.h       | 2 +-
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c
index c68054361615..7ccb799eeb57 100644
--- a/arch/m68k/mac/misc.c
+++ b/arch/m68k/mac/misc.c
@@ -478,8 +478,7 @@ void mac_poweroff(void)
 		cuda_shutdown();
 #endif
 #ifdef CONFIG_ADB_PMU68K
-	} else if (macintosh_config->adb_type == MAC_ADB_PB1
-		|| macintosh_config->adb_type == MAC_ADB_PB2) {
+	} else if (macintosh_config->adb_type == MAC_ADB_PB2) {
 		pmu_shutdown();
 #endif
 	}
@@ -520,8 +519,7 @@ void mac_reset(void)
 		cuda_restart();
 #endif
 #ifdef CONFIG_ADB_PMU68K
-	} else if (macintosh_config->adb_type == MAC_ADB_PB1
-		|| macintosh_config->adb_type == MAC_ADB_PB2) {
+	} else if (macintosh_config->adb_type == MAC_ADB_PB2) {
 		pmu_restart();
 #endif
 	} else if (CPU_IS_030) {
diff --git a/drivers/macintosh/via-pmu68k.c b/drivers/macintosh/via-pmu68k.c
index d545ed45e482..bec8e1837d7d 100644
--- a/drivers/macintosh/via-pmu68k.c
+++ b/drivers/macintosh/via-pmu68k.c
@@ -175,9 +175,6 @@ static s8 pmu_data_len[256][2] = {
 int __init find_via_pmu(void)
 {
 	switch (macintosh_config->adb_type) {
-	case MAC_ADB_PB1:
-		pmu_kind = PMU_68K_V1;
-		break;
 	case MAC_ADB_PB2:
 		pmu_kind = PMU_68K_V2;
 		break;
@@ -785,7 +782,6 @@ pmu_enable_backlight(int on)
 	    /* first call: get current backlight value */
 	    if (backlight_level < 0) {
 		switch(pmu_kind) {
-		    case PMU_68K_V1:
 		    case PMU_68K_V2:
 			pmu_request(&req, NULL, 3, PMU_READ_NVRAM, 0x14, 0xe);
 			while (!req.complete)
diff --git a/include/uapi/linux/pmu.h b/include/uapi/linux/pmu.h
index 89cb1acea93a..e128f609281a 100644
--- a/include/uapi/linux/pmu.h
+++ b/include/uapi/linux/pmu.h
@@ -93,7 +93,7 @@ enum {
 	PMU_HEATHROW_BASED,	/* PowerBook G3 series */
 	PMU_PADDINGTON_BASED,	/* 1999 PowerBook G3 */
 	PMU_KEYLARGO_BASED,	/* Core99 motherboard (PMU99) */
-	PMU_68K_V1,		/* 68K PMU, version 1 */
+	PMU_68K_V1,		/* Unused/deprecated */
 	PMU_68K_V2, 		/* 68K PMU, version 2 */
 };
 
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 10/11] macintosh/via-pmu: Clean up interrupt statistics
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

Replace an open-coded ffs() with the function call.
Simplify an if-else cascade using a switch statement.
Correct a typo and an indentation issue.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
 drivers/macintosh/via-pmu.c | 39 ++++++++++++++++++++++-----------------
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index a68e7a6f00cc..3da5d40309d4 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -1355,7 +1355,8 @@ pmu_resume(void)
 static void
 pmu_handle_data(unsigned char *data, int len)
 {
-	unsigned char ints, pirq;
+	unsigned char ints;
+	int idx;
 	int i = 0;
 
 	asleep = 0;
@@ -1377,25 +1378,24 @@ pmu_handle_data(unsigned char *data, int len)
 		ints &= ~(PMU_INT_ADB_AUTO | PMU_INT_AUTO_SRQ_POLL);
 
 next:
-
 	if (ints == 0) {
 		if (i > pmu_irq_stats[10])
 			pmu_irq_stats[10] = i;
 		return;
 	}
-
-	for (pirq = 0; pirq < 8; pirq++)
-		if (ints & (1 << pirq))
-			break;
-	pmu_irq_stats[pirq]++;
 	i++;
-	ints &= ~(1 << pirq);
+
+	idx = ffs(ints) - 1;
+	ints &= ~BIT(idx);
+
+	pmu_irq_stats[idx]++;
 
 	/* Note: for some reason, we get an interrupt with len=1,
 	 * data[0]==0 after each normal ADB interrupt, at least
 	 * on the Pismo. Still investigating...  --BenH
 	 */
-	if ((1 << pirq) & PMU_INT_ADB) {
+	switch (BIT(idx)) {
+	case PMU_INT_ADB:
 		if ((data[0] & PMU_INT_ADB_AUTO) == 0) {
 			struct adb_request *req = req_awaiting_reply;
 			if (!req) {
@@ -1433,25 +1433,28 @@ pmu_handle_data(unsigned char *data, int len)
 				adb_input(data+1, len-1, 1);
 #endif /* CONFIG_ADB */		
 		}
-	}
+		break;
+
 	/* Sound/brightness button pressed */
-	else if ((1 << pirq) & PMU_INT_SNDBRT) {
+	case PMU_INT_SNDBRT:
 #ifdef CONFIG_PMAC_BACKLIGHT
 		if (len == 3)
 			pmac_backlight_set_legacy_brightness_pmu(data[1] >> 4);
 #endif
-	}
+		break;
+
 	/* Tick interrupt */
-	else if ((1 << pirq) & PMU_INT_TICK) {
-		/* Environement or tick interrupt, query batteries */
+	case PMU_INT_TICK:
+		/* Environment or tick interrupt, query batteries */
 		if (pmu_battery_count) {
 			if ((--query_batt_timer) == 0) {
 				query_battery_state();
 				query_batt_timer = BATTERY_POLLING_COUNT;
 			}
 		}
-        }
-	else if ((1 << pirq) & PMU_INT_ENVIRONMENT) {
+		break;
+
+	case PMU_INT_ENVIRONMENT:
 		if (pmu_battery_count)
 			query_battery_state();
 		pmu_pass_intr(data, len);
@@ -1461,7 +1464,9 @@ pmu_handle_data(unsigned char *data, int len)
 			via_pmu_event(PMU_EVT_POWER, !!(data[1]&8));
 			via_pmu_event(PMU_EVT_LID, data[1]&1);
 		}
-	} else {
+		break;
+
+	default:
 	       pmu_pass_intr(data, len);
 	}
 	goto next;
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 07/11] macintosh/via-pmu: Explicitly specify CONFIG_PPC_PMAC dependencies
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

At present, CONFIG_ADB_PMU depends on CONFIG_PPC_PMAC. When this gets
relaxed to CONFIG_PPC_PMAC || CONFIG_MAC, those Kconfig symbols with
implicit deps on PPC_PMAC will need explicit deps. Add them now.
No functional change.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/macintosh/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index 9c6452b38c36..26abae4c899d 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -79,7 +79,7 @@ config ADB_PMU
 
 config ADB_PMU_LED
 	bool "Support for the Power/iBook front LED"
-	depends on ADB_PMU
+	depends on PPC_PMAC && ADB_PMU
 	select NEW_LEDS
 	select LEDS_CLASS
 	help
@@ -122,7 +122,7 @@ config PMAC_MEDIABAY
 
 config PMAC_BACKLIGHT
 	bool "Backlight control for LCD screens"
-	depends on ADB_PMU && FB = y && (BROKEN || !PPC64)
+	depends on PPC_PMAC && ADB_PMU && FB = y && (BROKEN || !PPC64)
 	select FB_BACKLIGHT
 	help
 	  Say Y here to enable Macintosh specific extensions of the generic
-- 
2.16.4

^ permalink raw reply related

* [PATCH v4 05/11] macintosh/via-pmu: Replace via pointer with via1 and via2 pointers
From: Finn Thain @ 2018-07-02  8:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael Schmitz, linuxppc-dev, linux-m68k, linux-kernel
In-Reply-To: <cover.1530519301.git.fthain@telegraphics.com.au>

On most PowerPC Macs, the PMU driver uses the shift register and
IO port B from a single VIA chip.

On 68k and early PowerPC PowerBooks, the driver uses the shift register
from one VIA chip together with IO port B from another.

Replace via with via1 and via2 to accommodate this. For the
CONFIG_PPC_PMAC case, set via1 = via2 so there is no change.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/macintosh/via-pmu.c | 142 +++++++++++++++++++++-----------------------
 1 file changed, 69 insertions(+), 73 deletions(-)

diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c
index 6a6f1666712e..2557f3e49f18 100644
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -76,7 +76,6 @@
 #define BATTERY_POLLING_COUNT	2
 
 static DEFINE_MUTEX(pmu_info_proc_mutex);
-static volatile unsigned char __iomem *via;
 
 /* VIA registers - spaced 0x200 bytes apart */
 #define RS		0x200		/* skip between registers */
@@ -145,6 +144,8 @@ static struct device_node *vias;
 static int pmu_kind = PMU_UNKNOWN;
 static int pmu_fully_inited;
 static int pmu_has_adb;
+static volatile unsigned char __iomem *via1;
+static volatile unsigned char __iomem *via2;
 static struct device_node *gpio_node;
 static unsigned char __iomem *gpio_reg;
 static int gpio_irq = 0;
@@ -340,14 +341,14 @@ int __init find_via_pmu(void)
 	} else
 		pmu_kind = PMU_UNKNOWN;
 
-	via = ioremap(taddr, 0x2000);
-	if (via == NULL) {
+	via1 = via2 = ioremap(taddr, 0x2000);
+	if (via1 == NULL) {
 		printk(KERN_ERR "via-pmu: Can't map address !\n");
 		goto fail_via_remap;
 	}
 	
-	out_8(&via[IER], IER_CLR | 0x7f);	/* disable all intrs */
-	out_8(&via[IFR], 0x7f);			/* clear IFR */
+	out_8(&via1[IER], IER_CLR | 0x7f);	/* disable all intrs */
+	out_8(&via1[IFR], 0x7f);			/* clear IFR */
 
 	pmu_state = idle;
 
@@ -362,8 +363,8 @@ int __init find_via_pmu(void)
 	return 1;
 
  fail_init:
-	iounmap(via);
-	via = NULL;
+	iounmap(via1);
+	via1 = via2 = NULL;
  fail_via_remap:
 	iounmap(gpio_reg);
 	gpio_reg = NULL;
@@ -437,7 +438,7 @@ static int __init via_pmu_start(void)
 	}
 
 	/* Enable interrupts */
-	out_8(&via[IER], IER_SET | SR_INT | CB1_INT);
+	out_8(&via1[IER], IER_SET | SR_INT | CB1_INT);
 
 	pmu_fully_inited = 1;
 
@@ -535,8 +536,8 @@ init_pmu(void)
 	struct adb_request req;
 
 	/* Negate TREQ. Set TACK to input and TREQ to output. */
-	out_8(&via[B], in_8(&via[B]) | TREQ);
-	out_8(&via[DIRB], (in_8(&via[DIRB]) | TREQ) & ~TACK);
+	out_8(&via2[B], in_8(&via2[B]) | TREQ);
+	out_8(&via2[DIRB], (in_8(&via2[DIRB]) | TREQ) & ~TACK);
 
 	pmu_request(&req, NULL, 2, PMU_SET_INTR_MASK, pmu_intr_mask);
 	timeout =  100000;
@@ -1137,7 +1138,7 @@ wait_for_ack(void)
 	 * reported
 	 */
 	int timeout = 4000;
-	while ((in_8(&via[B]) & TACK) == 0) {
+	while ((in_8(&via2[B]) & TACK) == 0) {
 		if (--timeout < 0) {
 			printk(KERN_ERR "PMU not responding (!ack)\n");
 			return;
@@ -1151,23 +1152,19 @@ wait_for_ack(void)
 static inline void
 send_byte(int x)
 {
-	volatile unsigned char __iomem *v = via;
-
-	out_8(&v[ACR], in_8(&v[ACR]) | SR_OUT | SR_EXT);
-	out_8(&v[SR], x);
-	out_8(&v[B], in_8(&v[B]) & ~TREQ);		/* assert TREQ */
-	(void)in_8(&v[B]);
+	out_8(&via1[ACR], in_8(&via1[ACR]) | SR_OUT | SR_EXT);
+	out_8(&via1[SR], x);
+	out_8(&via2[B], in_8(&via2[B]) & ~TREQ);	/* assert TREQ */
+	(void)in_8(&via2[B]);
 }
 
 static inline void
 recv_byte(void)
 {
-	volatile unsigned char __iomem *v = via;
-
-	out_8(&v[ACR], (in_8(&v[ACR]) & ~SR_OUT) | SR_EXT);
-	in_8(&v[SR]);		/* resets SR */
-	out_8(&v[B], in_8(&v[B]) & ~TREQ);
-	(void)in_8(&v[B]);
+	out_8(&via1[ACR], (in_8(&via1[ACR]) & ~SR_OUT) | SR_EXT);
+	in_8(&via1[SR]);		/* resets SR */
+	out_8(&via2[B], in_8(&via2[B]) & ~TREQ);
+	(void)in_8(&via2[B]);
 }
 
 static inline void
@@ -1270,7 +1267,7 @@ pmu_suspend(void)
 		if (!adb_int_pending && pmu_state == idle && !req_awaiting_reply) {
 			if (gpio_irq >= 0)
 				disable_irq_nosync(gpio_irq);
-			out_8(&via[IER], CB1_INT | IER_CLR);
+			out_8(&via1[IER], CB1_INT | IER_CLR);
 			spin_unlock_irqrestore(&pmu_lock, flags);
 			break;
 		}
@@ -1294,7 +1291,7 @@ pmu_resume(void)
 	adb_int_pending = 1;
 	if (gpio_irq >= 0)
 		enable_irq(gpio_irq);
-	out_8(&via[IER], CB1_INT | IER_SET);
+	out_8(&via1[IER], CB1_INT | IER_SET);
 	spin_unlock_irqrestore(&pmu_lock, flags);
 	pmu_poll();
 }
@@ -1419,20 +1416,20 @@ pmu_sr_intr(void)
 	struct adb_request *req;
 	int bite = 0;
 
-	if (in_8(&via[B]) & TREQ) {
-		printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8(&via[B]));
+	if (in_8(&via2[B]) & TREQ) {
+		printk(KERN_ERR "PMU: spurious SR intr (%x)\n", in_8(&via2[B]));
 		return NULL;
 	}
 	/* The ack may not yet be low when we get the interrupt */
-	while ((in_8(&via[B]) & TACK) != 0)
+	while ((in_8(&via2[B]) & TACK) != 0)
 			;
 
 	/* if reading grab the byte, and reset the interrupt */
 	if (pmu_state == reading || pmu_state == reading_intr)
-		bite = in_8(&via[SR]);
+		bite = in_8(&via1[SR]);
 
 	/* reset TREQ and wait for TACK to go high */
-	out_8(&via[B], in_8(&via[B]) | TREQ);
+	out_8(&via2[B], in_8(&via2[B]) | TREQ);
 	wait_for_ack();
 
 	switch (pmu_state) {
@@ -1533,17 +1530,17 @@ via_pmu_interrupt(int irq, void *arg)
 	++disable_poll;
 	
 	for (;;) {
-		intr = in_8(&via[IFR]) & (SR_INT | CB1_INT);
+		intr = in_8(&via1[IFR]) & (SR_INT | CB1_INT);
 		if (intr == 0)
 			break;
 		handled = 1;
 		if (++nloop > 1000) {
 			printk(KERN_DEBUG "PMU: stuck in intr loop, "
 			       "intr=%x, ier=%x pmu_state=%d\n",
-			       intr, in_8(&via[IER]), pmu_state);
+			       intr, in_8(&via1[IER]), pmu_state);
 			break;
 		}
-		out_8(&via[IFR], intr);
+		out_8(&via1[IFR], intr);
 		if (intr & CB1_INT) {
 			adb_int_pending = 1;
 			pmu_irq_stats[0]++;
@@ -1725,29 +1722,29 @@ static u32 save_via[8];
 static void
 save_via_state(void)
 {
-	save_via[0] = in_8(&via[ANH]);
-	save_via[1] = in_8(&via[DIRA]);
-	save_via[2] = in_8(&via[B]);
-	save_via[3] = in_8(&via[DIRB]);
-	save_via[4] = in_8(&via[PCR]);
-	save_via[5] = in_8(&via[ACR]);
-	save_via[6] = in_8(&via[T1CL]);
-	save_via[7] = in_8(&via[T1CH]);
+	save_via[0] = in_8(&via1[ANH]);
+	save_via[1] = in_8(&via1[DIRA]);
+	save_via[2] = in_8(&via1[B]);
+	save_via[3] = in_8(&via1[DIRB]);
+	save_via[4] = in_8(&via1[PCR]);
+	save_via[5] = in_8(&via1[ACR]);
+	save_via[6] = in_8(&via1[T1CL]);
+	save_via[7] = in_8(&via1[T1CH]);
 }
 static void
 restore_via_state(void)
 {
-	out_8(&via[ANH], save_via[0]);
-	out_8(&via[DIRA], save_via[1]);
-	out_8(&via[B], save_via[2]);
-	out_8(&via[DIRB], save_via[3]);
-	out_8(&via[PCR], save_via[4]);
-	out_8(&via[ACR], save_via[5]);
-	out_8(&via[T1CL], save_via[6]);
-	out_8(&via[T1CH], save_via[7]);
-	out_8(&via[IER], IER_CLR | 0x7f);	/* disable all intrs */
-	out_8(&via[IFR], 0x7f);				/* clear IFR */
-	out_8(&via[IER], IER_SET | SR_INT | CB1_INT);
+	out_8(&via1[ANH],  save_via[0]);
+	out_8(&via1[DIRA], save_via[1]);
+	out_8(&via1[B],    save_via[2]);
+	out_8(&via1[DIRB], save_via[3]);
+	out_8(&via1[PCR],  save_via[4]);
+	out_8(&via1[ACR],  save_via[5]);
+	out_8(&via1[T1CL], save_via[6]);
+	out_8(&via1[T1CH], save_via[7]);
+	out_8(&via1[IER], IER_CLR | 0x7f);	/* disable all intrs */
+	out_8(&via1[IFR], 0x7f);			/* clear IFR */
+	out_8(&via1[IER], IER_SET | SR_INT | CB1_INT);
 }
 
 #define	GRACKLE_PM	(1<<7)
@@ -2389,33 +2386,33 @@ device_initcall(pmu_device_init);
 
 #ifdef DEBUG_SLEEP
 static inline void 
-polled_handshake(volatile unsigned char __iomem *via)
+polled_handshake(void)
 {
-	via[B] &= ~TREQ; eieio();
-	while ((via[B] & TACK) != 0)
+	via2[B] &= ~TREQ; eieio();
+	while ((via2[B] & TACK) != 0)
 		;
-	via[B] |= TREQ; eieio();
-	while ((via[B] & TACK) == 0)
+	via2[B] |= TREQ; eieio();
+	while ((via2[B] & TACK) == 0)
 		;
 }
 
 static inline void 
-polled_send_byte(volatile unsigned char __iomem *via, int x)
+polled_send_byte(int x)
 {
-	via[ACR] |= SR_OUT | SR_EXT; eieio();
-	via[SR] = x; eieio();
-	polled_handshake(via);
+	via1[ACR] |= SR_OUT | SR_EXT; eieio();
+	via1[SR] = x; eieio();
+	polled_handshake();
 }
 
 static inline int
-polled_recv_byte(volatile unsigned char __iomem *via)
+polled_recv_byte(void)
 {
 	int x;
 
-	via[ACR] = (via[ACR] & ~SR_OUT) | SR_EXT; eieio();
-	x = via[SR]; eieio();
-	polled_handshake(via);
-	x = via[SR]; eieio();
+	via1[ACR] = (via1[ACR] & ~SR_OUT) | SR_EXT; eieio();
+	x = via1[SR]; eieio();
+	polled_handshake();
+	x = via1[SR]; eieio();
 	return x;
 }
 
@@ -2424,7 +2421,6 @@ pmu_polled_request(struct adb_request *req)
 {
 	unsigned long flags;
 	int i, l, c;
-	volatile unsigned char __iomem *v = via;
 
 	req->complete = 1;
 	c = req->data[0];
@@ -2436,21 +2432,21 @@ pmu_polled_request(struct adb_request *req)
 	while (pmu_state != idle)
 		pmu_poll();
 
-	while ((via[B] & TACK) == 0)
+	while ((via2[B] & TACK) == 0)
 		;
-	polled_send_byte(v, c);
+	polled_send_byte(c);
 	if (l < 0) {
 		l = req->nbytes - 1;
-		polled_send_byte(v, l);
+		polled_send_byte(l);
 	}
 	for (i = 1; i <= l; ++i)
-		polled_send_byte(v, req->data[i]);
+		polled_send_byte(req->data[i]);
 
 	l = pmu_data_len[c][1];
 	if (l < 0)
-		l = polled_recv_byte(v);
+		l = polled_recv_byte();
 	for (i = 0; i < l; ++i)
-		req->reply[i + req->reply_len] = polled_recv_byte(v);
+		req->reply[i + req->reply_len] = polled_recv_byte();
 
 	if (req->done)
 		(*req->done)(req);
-- 
2.16.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox