LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 17/23] watchdog/hardlockup/hpet: Convert the timer's interrupt to NMI
From: Ricardo Neri @ 2018-06-15  2:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
	Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
	linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
	Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
	Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
	David Rientjes, iommu
In-Reply-To: <alpine.DEB.2.21.1806131128570.2280@nanos.tec.linutronix.de>

On Wed, Jun 13, 2018 at 11:40:00AM +0200, Thomas Gleixner wrote:
> On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > @@ -183,6 +184,8 @@ static irqreturn_t hardlockup_detector_irq_handler(int irq, void *data)
> >  	if (!(hdata->flags & HPET_DEV_PERI_CAP))
> >  		kick_timer(hdata);
> >  
> > +	pr_err("This interrupt should not have happened. Ensure delivery mode is NMI.\n");
> 
> Eeew.

If you don't mind me asking. What is the problem with this error message?
> 
> >  /**
> > + * hardlockup_detector_nmi_handler() - NMI Interrupt handler
> > + * @val:	Attribute associated with the NMI. Not used.
> > + * @regs:	Register values as seen when the NMI was asserted
> > + *
> > + * When an NMI is issued, look for hardlockups. If the timer is not periodic,
> > + * kick it. The interrupt is always handled when if delivered via the
> > + * Front-Side Bus.
> > + *
> > + * Returns:
> > + *
> > + * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
> > + * otherwise.
> > + */
> > +static int hardlockup_detector_nmi_handler(unsigned int val,
> > +					   struct pt_regs *regs)
> > +{
> > +	struct hpet_hld_data *hdata = hld_data;
> > +	unsigned int use_fsb;
> > +
> > +	/*
> > +	 * If FSB delivery mode is used, the timer interrupt is programmed as
> > +	 * edge-triggered and there is no need to check the ISR register.
> > +	 */
> > +	use_fsb = hdata->flags & HPET_DEV_FSB_CAP;
> > +
> > +	if (!use_fsb && !is_hpet_wdt_interrupt(hdata))
> > +		return NMI_DONE;
> 
> So for 'use_fsb == True' every single NMI will fall through into the
> watchdog code below.
> 
> > +	inspect_for_hardlockups(regs);
> > +
> > +	if (!(hdata->flags & HPET_DEV_PERI_CAP))
> > +		kick_timer(hdata);
> 
> And in case that the HPET does not support periodic mode this reprogramms
> the timer on every NMI which means that while perf is running the watchdog
> will never ever detect anything.

Yes. I see that this is wrong. With MSI interrupts, as far as I can
see, there is not a way to make sure that the HPET timer caused the NMI
perhaps the only option is to use an IO APIC interrupt and read the
interrupt status register.

> 
> Aside of that, reading TWO HPET registers for every NMI is insane. HPET
> access is horribly slow, so any high frequency perf monitoring will take a
> massive performance hit.

If an IO APIC interrupt is used, only HPET register (the status register)
would need to be read for every NMI. Would that be more acceptable? Otherwise,
there is no way to determine if the HPET cause the NMI.

Alternatively, there could be a counter that skips reading the HPET status
register (and the detection of hardlockups) for every X NMIs. This would
reduce the overall frequency of HPET register reads.

Is that more acceptable?

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 17/23] watchdog/hardlockup/hpet: Convert the timer's interrupt to NMI
From: Ricardo Neri @ 2018-06-15  2:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Ashok Raj, Borislav Petkov, Tony Luck, Ravi V. Shankar, x86,
	sparclinux, linuxppc-dev, linux-kernel, Jacob Pan,
	Rafael J. Wysocki, Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Andrew Morton,
	Philippe Ombredanne, Colin Ian King, Byungchul Park,
	Paul E. McKenney, Luis R. Rodriguez, Waiman Long, Josh Poimboeuf,
	Randy Dunlap, Davidlohr Bueso, Christoffer Dall, Marc Zyngier,
	Kai-Heng Feng, Konrad Rzeszutek Wilk, David Rientjes, iommu
In-Reply-To: <20180613090720.GV12258@hirez.programming.kicks-ass.net>

On Wed, Jun 13, 2018 at 11:07:20AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 12, 2018 at 05:57:37PM -0700, Ricardo Neri wrote:
> 
> +static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
> +{
> +       unsigned long this_isr;
> +       unsigned int lvl_trig;
> +
> +       this_isr = hpet_readl(HPET_STATUS) & BIT(hdata->num);
> +
> +       lvl_trig = hpet_readl(HPET_Tn_CFG(hdata->num)) & HPET_TN_LEVEL;
> +
> +       if (lvl_trig && this_isr)
> +               return true;
> +
> +       return false;
> +}
> 
> > +static int hardlockup_detector_nmi_handler(unsigned int val,
> > +					   struct pt_regs *regs)
> > +{
> > +	struct hpet_hld_data *hdata = hld_data;
> > +	unsigned int use_fsb;
> > +
> > +	/*
> > +	 * If FSB delivery mode is used, the timer interrupt is programmed as
> > +	 * edge-triggered and there is no need to check the ISR register.
> > +	 */
> > +	use_fsb = hdata->flags & HPET_DEV_FSB_CAP;
> 
> Please do explain.. That FSB thing basically means MSI. But there's only
> a single NMI vector. How do we know this NMI came from the HPET?

Indeed, I see now that this is wrong. There is no way to know. The only way
is to use an IO APIC interrupt and read the HPET status register.

> 
> > +
> > +	if (!use_fsb && !is_hpet_wdt_interrupt(hdata))
> 
> So you add _2_ HPET reads for every single NMI that gets triggered...
> and IIRC HPET reads are _sllooooowwwwww_.


Since the trigger mode of the HPET timer is not expected to change, 
perhaps is_hpet_wdt_interrupt() can only need the interrupt status
register. This would reduce the reads to one. Furthermore, the hardlockup
detector can skip an X number of NMIs and reduce further the frequency
of reads. Does this make sense?

> 
> > +		return NMI_DONE;
> > +
> > +	inspect_for_hardlockups(regs);
> > +
> > +	if (!(hdata->flags & HPET_DEV_PERI_CAP))
> > +		kick_timer(hdata);
> > +
> > +	/* Acknowledge interrupt if in level-triggered mode */
> > +	if (!use_fsb)
> > +		hpet_writel(BIT(hdata->num), HPET_STATUS);
> > +
> > +	return NMI_HANDLED;
> 
> So if I read this right, when in FSB/MSI mode, we'll basically _always_
> claim every single NMI as handled?
> 
> That's broken.

Yes, this is not correct. I will drop the functionality to use
FSB/MSI mode.

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 03/23] genirq: Introduce IRQF_DELIVER_AS_NMI
From: Ricardo Neri @ 2018-06-15  2:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Thomas Gleixner, Julien Thierry, Peter Zijlstra, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Daniel Lezcano, Andrew Morton,
	Levin, Alexander (Sasha Levin), Randy Dunlap, Masami Hiramatsu,
	Bartosz Golaszewski, Doug Berger, Palmer Dabbelt, iommu
In-Reply-To: <c000699c-7859-74ec-7944-ec484e887f60@arm.com>

On Wed, Jun 13, 2018 at 11:06:25AM +0100, Marc Zyngier wrote:
> On 13/06/18 10:20, Thomas Gleixner wrote:
> > On Wed, 13 Jun 2018, Julien Thierry wrote:
> >> On 13/06/18 09:34, Peter Zijlstra wrote:
> >>> On Tue, Jun 12, 2018 at 05:57:23PM -0700, Ricardo Neri wrote:
> >>>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> >>>> index 5426627..dbc5e02 100644
> >>>> --- a/include/linux/interrupt.h
> >>>> +++ b/include/linux/interrupt.h
> >>>> @@ -61,6 +61,8 @@
> >>>>    *                interrupt handler after suspending interrupts. For
> >>>> system
> >>>>    *                wakeup devices users need to implement wakeup
> >>>> detection in
> >>>>    *                their interrupt handlers.
> >>>> + * IRQF_DELIVER_AS_NMI - Configure interrupt to be delivered as
> >>>> non-maskable, if
> >>>> + *                supported by the chip.
> >>>>    */
> >>>
> >>> NAK on the first 6 patches. You really _REALLY_ don't want to expose
> >>> NMIs to this level.
> >>>
> >>
> >> I've been working on something similar on arm64 side, and effectively the one
> >> thing that might be common to arm64 and intel is the interface to set an
> >> interrupt as NMI. So I guess it would be nice to agree on the right approach
> >> for this.
> >>
> >> The way I did it was by introducing a new irq_state and let the irqchip driver
> >> handle most of the work (if it supports that state):
> >>
> >> https://lkml.org/lkml/2018/5/25/181
> >>
> >> This has not been ACKed nor NAKed. So I am just asking whether this is a more
> >> suitable approach, and if not, is there any suggestions on how to do this?
> > 
> > I really didn't pay attention to that as it's burried in the GIC/ARM series
> > which is usually Marc's playground.
> 
> I'm working my way through it ATM now that I have some brain cycles back.
> 
> > Adding NMI delivery support at low level architecture irq chip level is
> > perfectly fine, but the exposure of that needs to be restricted very
> > much. Adding it to the generic interrupt control interfaces is not going to
> > happen. That's doomed to begin with and a complete abuse of the interface
> > as the handler can not ever be used for that.
> 
> I can only agree with that. Allowing random driver to use request_irq()
> to make anything an NMI ultimately turns it into a complete mess ("hey,
> NMI is *faster*, let's use that"), and a potential source of horrible
> deadlocks.
> 
> What I'd find more palatable is a way for an irqchip to be able to
> prioritize some interrupts based on a set of architecturally-defined
> requirements, and a separate NMI requesting/handling framework that is
> separate from the IRQ API, as the overall requirements are likely to
> completely different.
> 
> It shouldn't have to be nearly as complex as the IRQ API, and require
> much stricter requirements in terms of what you can do there (flow
> handling should definitely be different).

Marc, Julien, do you plan to actively work on this? Would you mind keeping
me in the loop? I also need this work for this watchdog. In the meantime,
I will go through Julien's patches and try to adapt it to my work.

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 20/23] watchdog/hardlockup/hpet: Rotate interrupt among all monitored CPUs
From: Ricardo Neri @ 2018-06-15  2:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
	Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
	linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
	Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
	Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
	David Rientjes, iommu
In-Reply-To: <alpine.DEB.2.21.1806131140560.2280@nanos.tec.linutronix.de>

On Wed, Jun 13, 2018 at 11:48:09AM +0200, Thomas Gleixner wrote:
> On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > +	/* There are no CPUs to monitor. */
> > +	if (!cpumask_weight(&hdata->monitored_mask))
> > +		return NMI_HANDLED;
> > +
> >  	inspect_for_hardlockups(regs);
> >  
> > +	/*
> > +	 * Target a new CPU. Keep trying until we find a monitored CPU. CPUs
> > +	 * are addded and removed to this mask at cpu_up() and cpu_down(),
> > +	 * respectively. Thus, the interrupt should be able to be moved to
> > +	 * the next monitored CPU.
> > +	 */
> > +	spin_lock(&hld_data->lock);
> 
> Yuck. Taking a spinlock from NMI ...

I am sorry. I will look into other options for locking. Do you think rcu_lock
would help in this case? I need this locking because the CPUs being monitored
changes as CPUs come online and offline.

> 
> > +	for_each_cpu_wrap(cpu, &hdata->monitored_mask, smp_processor_id() + 1) {
> > +		if (!irq_set_affinity(hld_data->irq, cpumask_of(cpu)))
> > +			break;
> 
> ... and then calling into generic interrupt code which will take even more
> locks is completely broken.


I will into reworking how the destination of the interrupt is set.

Thanks and BR,

Ricardo

^ permalink raw reply

* Re: [RFC PATCH 12/23] kernel/watchdog: Introduce a struct for NMI watchdog operations
From: Ricardo Neri @ 2018-06-15  2:21 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Don Zickus, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180614123250.230667f6@roar.ozlabs.ibm.com>

On Thu, Jun 14, 2018 at 12:32:50PM +1000, Nicholas Piggin wrote:
> On Wed, 13 Jun 2018 18:31:17 -0700
> Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:
> 
> > On Wed, Jun 13, 2018 at 09:52:25PM +1000, Nicholas Piggin wrote:
> > > On Wed, 13 Jun 2018 11:26:49 +0200 (CEST)
> > > Thomas Gleixner <tglx@linutronix.de> wrote:
> > >   
> > > > On Wed, 13 Jun 2018, Peter Zijlstra wrote:  
> > > > > On Wed, Jun 13, 2018 at 05:41:41PM +1000, Nicholas Piggin wrote:    
> > > > > > On Tue, 12 Jun 2018 17:57:32 -0700
> > > > > > Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:
> > > > > >     
> > > > > > > Instead of exposing individual functions for the operations of the NMI
> > > > > > > watchdog, define a common interface that can be used across multiple
> > > > > > > implementations.
> > > > > > > 
> > > > > > > The struct nmi_watchdog_ops is defined for such operations. These initial
> > > > > > > definitions include the enable, disable, start, stop, and cleanup
> > > > > > > operations.
> > > > > > > 
> > > > > > > Only a single NMI watchdog can be used in the system. The operations of
> > > > > > > this NMI watchdog are accessed via the new variable nmi_wd_ops. This
> > > > > > > variable is set to point the operations of the first NMI watchdog that
> > > > > > > initializes successfully. Even though at this moment, the only available
> > > > > > > NMI watchdog is the perf-based hardlockup detector. More implementations
> > > > > > > can be added in the future.    
> > > > > > 
> > > > > > Cool, this looks pretty nice at a quick glance. sparc and powerpc at
> > > > > > least have their own NMI watchdogs, it would be good to have those
> > > > > > converted as well.    
> > > > > 
> > > > > Yeah, agreed, this looks like half a patch.    
> > > > 
> > > > Though I'm not seeing the advantage of it. That kind of NMI watchdogs are
> > > > low level architecture details so having yet another 'ops' data structure
> > > > with a gazillion of callbacks, checks and indirections does not provide
> > > > value over the currently available weak stubs.  
> > > 
> > > The other way to go of course is librify the perf watchdog and make an
> > > x86 watchdog that selects between perf and hpet... I also probably
> > > prefer that for code such as this, but I wouldn't strongly object to
> > > ops struct if I'm not writing the code. It's not that bad is it?  
> > 
> > My motivation to add the ops was that the hpet and perf watchdog share
> > significant portions of code.
> 
> Right, a good motivation.
> 
> > I could look into creating the library for
> > common code and relocate the hpet watchdog into arch/x86 for the hpet-
> > specific parts.
> 
> If you can investigate that approach, that would be appreciated. I hope
> I did not misunderstand you there, Thomas.
> 
> Basically you would have perf infrastructure and hpet infrastructure,
> and then the x86 watchdog driver will use one or the other of those. The
> generic watchdog driver will be just a simple shim that uses the perf
> infrastructure. Then hopefully the powerpc driver would require almost
> no change.

Sure, I will try to structure code to minimize the changes to the powerpc
watchdog... without breaking the sparc one.

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 14/23] watchdog/hardlockup: Decouple the hardlockup detector from perf
From: Ricardo Neri @ 2018-06-15  2:23 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Don Zickus, Michael Ellerman, Frederic Weisbecker,
	Babu Moger, David S. Miller, Benjamin Herrenschmidt,
	Paul Mackerras, Mathieu Desnoyers, Masami Hiramatsu,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Luis R. Rodriguez, iommu
In-Reply-To: <20180614114144.05891a04@roar.ozlabs.ibm.com>

On Thu, Jun 14, 2018 at 11:41:44AM +1000, Nicholas Piggin wrote:
> On Wed, 13 Jun 2018 18:19:01 -0700
> Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:
> 
> > On Wed, Jun 13, 2018 at 10:43:24AM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 12, 2018 at 05:57:34PM -0700, Ricardo Neri wrote:  
> > > > The current default implementation of the hardlockup detector assumes that
> > > > it is implemented using perf events.  
> > > 
> > > The sparc and powerpc things are very much not using perf.  
> > 
> > Isn't it true that the current hardlockup detector
> > (under kernel/watchdog_hld.c) is based on perf?
> 
> arch/powerpc/kernel/watchdog.c is a powerpc implementation that uses
> the kernel/watchdog_hld.c framework.
> 
> > As far as I understand,
> > this hardlockup detector is constructed using perf events for architectures
> > that don't provide an NMI watchdog. Perhaps I can be more specific and say
> > that this synthetized detector is based on perf.
> 
> The perf detector is like that, but we want NMI watchdogs to share
> the watchdog_hld code as much as possible even for arch specific NMI
> watchdogs, so that kernel and user interfaces and behaviour are
> consistent.
> 
> Other arch watchdogs like sparc are a little older so they are not
> using HLD. You don't have to change those for your series, but it
> would be good to bring them into the fold if possible at some time.
> IIRC sparc was slightly non-trivial because it has some differences
> in sysctl or cmdline APIs that we don't want to break.
> 
> But powerpc at least needs to be updated if you change hld apis.

I will look into updating at least the powerpc implementation as part
of these changes.

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
From: Alexey Kardashevskiy @ 2018-06-15  7:15 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, kvm-ppc, kvm, Alex Williamson,
	Benjamin Herrenschmidt
In-Reply-To: <20180614123525.GB19339@umbus.fritz.box>

On 14/6/18 10:35 pm, David Gibson wrote:
> On Thu, Jun 14, 2018 at 04:35:18PM +1000, Alexey Kardashevskiy wrote:
>> On 12/6/18 2:17 pm, David Gibson wrote:
>>> On Fri, Jun 08, 2018 at 03:46:33PM +1000, Alexey Kardashevskiy wrote:
>>>> At the moment we allocate the entire TCE table, twice (hardware part and
>>>> userspace translation cache). This normally works as we normally have
>>>> contigous memory and the guest will map entire RAM for 64bit DMA.
>>>>
>>>> However if we have sparse RAM (one example is a memory device), then
>>>> we will allocate TCEs which will never be used as the guest only maps
>>>> actual memory for DMA. If it is a single level TCE table, there is nothing
>>>> we can really do but if it a multilevel table, we can skip allocating
>>>> TCEs we know we won't need.
>>>>
>>>> This adds ability to allocate only first level, saving memory.
>>>>
>>>> This changes iommu_table::free() to avoid allocating of an extra level;
>>>> iommu_table::set() will do this when needed.
>>>>
>>>> This adds @alloc parameter to iommu_table::exchange() to tell the callback
>>>> if it can allocate an extra level; the flag is set to "false" for
>>>> the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
>>>> H_TOO_HARD.
>>>>
>>>> This still requires the entire table to be counted in mm::locked_vm.
>>>>
>>>> To be conservative, this only does on-demand allocation when
>>>> the usespace cache table is requested which is the case of VFIO.
>>>>
>>>> The example math for a system replicating a powernv setup with NVLink2
>>>> in a guest:
>>>> 16GB RAM mapped at 0x0
>>>> 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
>>>>
>>>> the table to cover that all with 64K pages takes:
>>>> (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
>>>>
>>>> If we allocate only necessary TCE levels, we will only need:
>>>> (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
>>>> levels).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  arch/powerpc/include/asm/iommu.h              |  7 ++-
>>>>  arch/powerpc/platforms/powernv/pci.h          |  6 ++-
>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c           |  4 +-
>>>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 ++++++++++++++++++++-------
>>>>  arch/powerpc/platforms/powernv/pci-ioda.c     |  8 ++--
>>>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  2 +-
>>>>  6 files changed, 69 insertions(+), 27 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>> index 4bdcf22..daa3ee5 100644
>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>> @@ -70,7 +70,7 @@ struct iommu_table_ops {
>>>>  			unsigned long *hpa,
>>>>  			enum dma_data_direction *direction);
>>>>  
>>>> -	__be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
>>>> +	__be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
>>>>  #endif
>>>>  	void (*clear)(struct iommu_table *tbl,
>>>>  			long index, long npages);
>>>> @@ -122,10 +122,13 @@ struct iommu_table {
>>>>  	__be64 *it_userspace; /* userspace view of the table */
>>>>  	struct iommu_table_ops *it_ops;
>>>>  	struct kref    it_kref;
>>>> +	int it_nid;
>>>>  };
>>>>  
>>>> +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
>>>> +		((tbl)->it_ops->useraddrptr((tbl), (entry), false))
>>>
>>> Is real mode really the only case where you want to inhibit new
>>> allocations?  I would have thought some paths would be read-only and
>>> you wouldn't want to allocate, even in virtual mode.
>>
>> There are paths when I do not want allocation but I can figure out that
>> from dma direction flag, for example, I am cleaning up the table and I do
>> not want any extra  allocation to happen there and they do happen because I
>> made a mistake so I'll repost. Other than that, this @alloc flag is for
>> real mode only.
> 
> Hm, ok.
> 
> Something just occurred to me: IIRC, going from the vmalloc() to the
> explicit table structure was to avoid allocating pages for memory
> regions that aren't there due to sparseness.  But.. won't you get that
> with vmalloc() anyway, if the pages for the gaps never get
> instantiated?


Nope, vmalloc() always allocates all the pages, these are not swapable
(checked with Paul).


-- 
Alexey

^ permalink raw reply

* Re: [RFC PATCH 03/23] genirq: Introduce IRQF_DELIVER_AS_NMI
From: Julien Thierry @ 2018-06-15  8:01 UTC (permalink / raw)
  To: Ricardo Neri, Marc Zyngier
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Daniel Lezcano, Andrew Morton,
	Levin, Alexander (Sasha Levin), Randy Dunlap, Masami Hiramatsu,
	Bartosz Golaszewski, Doug Berger, Palmer Dabbelt, iommu
In-Reply-To: <20180615021213.GC11625@voyager>

Hi Ricardo,

On 15/06/18 03:12, Ricardo Neri wrote:
> On Wed, Jun 13, 2018 at 11:06:25AM +0100, Marc Zyngier wrote:
>> On 13/06/18 10:20, Thomas Gleixner wrote:
>>> On Wed, 13 Jun 2018, Julien Thierry wrote:
>>>> On 13/06/18 09:34, Peter Zijlstra wrote:
>>>>> On Tue, Jun 12, 2018 at 05:57:23PM -0700, Ricardo Neri wrote:
>>>>>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>>>>>> index 5426627..dbc5e02 100644
>>>>>> --- a/include/linux/interrupt.h
>>>>>> +++ b/include/linux/interrupt.h
>>>>>> @@ -61,6 +61,8 @@
>>>>>>     *                interrupt handler after suspending interrupts. For
>>>>>> system
>>>>>>     *                wakeup devices users need to implement wakeup
>>>>>> detection in
>>>>>>     *                their interrupt handlers.
>>>>>> + * IRQF_DELIVER_AS_NMI - Configure interrupt to be delivered as
>>>>>> non-maskable, if
>>>>>> + *                supported by the chip.
>>>>>>     */
>>>>>
>>>>> NAK on the first 6 patches. You really _REALLY_ don't want to expose
>>>>> NMIs to this level.
>>>>>
>>>>
>>>> I've been working on something similar on arm64 side, and effectively the one
>>>> thing that might be common to arm64 and intel is the interface to set an
>>>> interrupt as NMI. So I guess it would be nice to agree on the right approach
>>>> for this.
>>>>
>>>> The way I did it was by introducing a new irq_state and let the irqchip driver
>>>> handle most of the work (if it supports that state):
>>>>
>>>> https://lkml.org/lkml/2018/5/25/181
>>>>
>>>> This has not been ACKed nor NAKed. So I am just asking whether this is a more
>>>> suitable approach, and if not, is there any suggestions on how to do this?
>>>
>>> I really didn't pay attention to that as it's burried in the GIC/ARM series
>>> which is usually Marc's playground.
>>
>> I'm working my way through it ATM now that I have some brain cycles back.
>>
>>> Adding NMI delivery support at low level architecture irq chip level is
>>> perfectly fine, but the exposure of that needs to be restricted very
>>> much. Adding it to the generic interrupt control interfaces is not going to
>>> happen. That's doomed to begin with and a complete abuse of the interface
>>> as the handler can not ever be used for that.
>>
>> I can only agree with that. Allowing random driver to use request_irq()
>> to make anything an NMI ultimately turns it into a complete mess ("hey,
>> NMI is *faster*, let's use that"), and a potential source of horrible
>> deadlocks.
>>
>> What I'd find more palatable is a way for an irqchip to be able to
>> prioritize some interrupts based on a set of architecturally-defined
>> requirements, and a separate NMI requesting/handling framework that is
>> separate from the IRQ API, as the overall requirements are likely to
>> completely different.
>>
>> It shouldn't have to be nearly as complex as the IRQ API, and require
>> much stricter requirements in terms of what you can do there (flow
>> handling should definitely be different).
> 
> Marc, Julien, do you plan to actively work on this? Would you mind keeping
> me in the loop? I also need this work for this watchdog. In the meantime,
> I will go through Julien's patches and try to adapt it to my work.

We are going to work on this and of course your input is most welcome to 
make sure we have an interface usable across different architectures.

In my patches, I'm not sure there is much to adapt to your work as most 
of it is arch specific (although I wont say no to another pair of eyes 
looking at them). From what I've seen of your patches, the point where 
we converge is that need for some code to be able to tell the irqchip "I 
want that particular interrupt line to be treated/setup as an NMI".

We'll make sure to keep you in the loop for discussions/suggestions on this.

Thanks,

-- 
Julien Thierry

^ permalink raw reply

* Re: [PATCH] media: fsl-viu: Use proper check for presence of {out,in}_be32()
From: Hans Verkuil @ 2018-06-15  8:10 UTC (permalink / raw)
  To: Geert Uytterhoeven, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Mauro Carvalho Chehab
  Cc: Arnd Bergmann, linuxppc-dev, linux-media, linux-kernel
In-Reply-To: <1528451328-21316-1-git-send-email-geert@linux-m68k.org>

On 08/06/18 11:48, Geert Uytterhoeven wrote:
> When compile-testing on m68k or microblaze:
> 
>     drivers/media/platform/fsl-viu.c:41:1: warning: "out_be32" redefined
>     drivers/media/platform/fsl-viu.c:42:1: warning: "in_be32" redefined
> 
> Fix this by replacing the check for PowerPC by checks for the presence
> of {out,in}_be32().
> 
> As PowerPC implements the be32 accessors using inline functions instead
> of macros, identity definions are added for all accessors to make the
> above checks work.
> 
> Fixes: 29d750686331a1a9 ("media: fsl-viu: allow building it with COMPILE_TEST")
> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>

Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com>

Should this go through the media tree or powerpc tree? Either way works for me.

Regards,

	Hans

> ---
> Compile-tested on m68k, microblaze, and powerpc.
> Assembler output before/after compared for powerpc.
> ---
>  arch/powerpc/include/asm/io.h    | 14 ++++++++++++++
>  drivers/media/platform/fsl-viu.c |  4 +++-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index e0331e7545685c5f..3741183ae09349f1 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -222,6 +222,20 @@ static inline void out_be64(volatile u64 __iomem *addr, u64 val)
>  #endif
>  #endif /* __powerpc64__ */
>  
> +#define in_be16 in_be16
> +#define in_be32 in_be32
> +#define in_be64 in_be64
> +#define in_le16 in_le16
> +#define in_le32 in_le32
> +#define in_le64 in_le64
> +
> +#define out_be16 out_be16
> +#define out_be32 out_be32
> +#define out_be64 out_be64
> +#define out_le16 out_le16
> +#define out_le32 out_le32
> +#define out_le64 out_le64
> +
>  /*
>   * Low level IO stream instructions are defined out of line for now
>   */
> diff --git a/drivers/media/platform/fsl-viu.c b/drivers/media/platform/fsl-viu.c
> index e41510ce69a40815..5d5e030c9c980647 100644
> --- a/drivers/media/platform/fsl-viu.c
> +++ b/drivers/media/platform/fsl-viu.c
> @@ -37,8 +37,10 @@
>  #define VIU_VERSION		"0.5.1"
>  
>  /* Allow building this driver with COMPILE_TEST */
> -#ifndef CONFIG_PPC
> +#ifndef out_be32
>  #define out_be32(v, a)	iowrite32be(a, (void __iomem *)v)
> +#endif
> +#ifndef in_be32
>  #define in_be32(a)	ioread32be((void __iomem *)a)
>  #endif
>  
> 

^ permalink raw reply

* Re: [PATCH kernel 0/2] powerpc/powernv: Rework sketchy bypass
From: Alexey Kardashevskiy @ 2018-06-15  9:01 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Benjamin Herrenschmidt, Russell Currey
In-Reply-To: <20180601081028.29401-1-aik@ozlabs.ru>

On Fri,  1 Jun 2018 18:10:26 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> I came across this adhoc implementation and thought it could use some
> polishing. This fixes memory leaks and add P9 support. Based on the current
> upstream.
> 
> 
> Please comment. Thanks.


Ping?


> 
> 
> 
> Alexey Kardashevskiy (2):
>   powerpc/powernv: Reuse existing TCE code for sketchy bypass
>   powerpc/powernv: Define PHB4 type and enable sketchy bypass on POWER9
> 
>  arch/powerpc/platforms/powernv/pci.h      |  1 +
>  arch/powerpc/platforms/powernv/pci-ioda.c | 76 +++++++++++++++++--------------
>  2 files changed, 44 insertions(+), 33 deletions(-)
> 



--
Alexey

^ permalink raw reply

* Re: [RFC PATCH 17/23] watchdog/hardlockup/hpet: Convert the timer's interrupt to NMI
From: Thomas Gleixner @ 2018-06-15  9:19 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
	Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
	linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
	Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
	Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
	David Rientjes, iommu
In-Reply-To: <20180615020314.GA11625@voyager>

On Thu, 14 Jun 2018, Ricardo Neri wrote:
> On Wed, Jun 13, 2018 at 11:40:00AM +0200, Thomas Gleixner wrote:
> > On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > > @@ -183,6 +184,8 @@ static irqreturn_t hardlockup_detector_irq_handler(int irq, void *data)
> > >  	if (!(hdata->flags & HPET_DEV_PERI_CAP))
> > >  		kick_timer(hdata);
> > >  
> > > +	pr_err("This interrupt should not have happened. Ensure delivery mode is NMI.\n");
> > 
> > Eeew.
> 
> If you don't mind me asking. What is the problem with this error message?

The problem is not the error message. The problem is the abuse of
request_irq() and the fact that this irq handler function exists in the
first place for something which is NMI based.

> > And in case that the HPET does not support periodic mode this reprogramms
> > the timer on every NMI which means that while perf is running the watchdog
> > will never ever detect anything.
> 
> Yes. I see that this is wrong. With MSI interrupts, as far as I can
> see, there is not a way to make sure that the HPET timer caused the NMI
> perhaps the only option is to use an IO APIC interrupt and read the
> interrupt status register.
> 
> > Aside of that, reading TWO HPET registers for every NMI is insane. HPET
> > access is horribly slow, so any high frequency perf monitoring will take a
> > massive performance hit.
> 
> If an IO APIC interrupt is used, only HPET register (the status register)
> would need to be read for every NMI. Would that be more acceptable? Otherwise,
> there is no way to determine if the HPET cause the NMI.

You need level trigger for the HPET status register to be useful at all
because in edge mode the interrupt status bits read always 0.

That means you have to fiddle with the IOAPIC acknowledge magic from NMI
context. Brilliant idea. If the NMI hits in the middle of a regular
io_apic_read() then the interrupted code will endup with the wrong index
register. Not to talk about the fun which the affinity rotation from NMI
context would bring.

Do not even think about using IOAPIC and level for this.

> Alternatively, there could be a counter that skips reading the HPET status
> register (and the detection of hardlockups) for every X NMIs. This would
> reduce the overall frequency of HPET register reads.

Great plan. So if the watchdog is the only NMI (because perf is off) then
you delay the watchdog detection by that count.

You neither can do a time based check, because time might be corrupted and
then you end up in lala land as well.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH] powerpc/mm: Fix kernel crash on page table free
From: Christophe Leroy @ 2018-06-14 21:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mpe; +Cc: linuxppc-dev
In-Reply-To: <20180530123225.7732-1-aneesh.kumar@linux.ibm.com>



On 05/30/2018 12:32 PM, Aneesh Kumar K.V wrote:
> Fix the below crash on BookE 64. pgtable_page_dtor expects struct page *arg.
> 
> Also call the destructor on non book3s platforms correctly. This free up the
> split ptl locks correctly if we had allocated them before.
> 
> Call Trace:
> [c0000000f30c7520] [c00000000021eeec] .kmem_cache_free+0x9c/0x44c (unreliable)
> [c0000000f30c75c0] [c0000000001ee07c] .ptlock_free+0x1c/0x30
> [c0000000f30c7630] [c0000000001ee260] .tlb_remove_table+0xdc/0x224
> [c0000000f30c76c0] [c0000000001ee640] .free_pgd_range+0x298/0x500
> [c0000000f30c77d0] [c000000000232afc] .shift_arg_pages+0x10c/0x1e0
> [c0000000f30c7910] [c000000000232dd0] .setup_arg_pages+0x200/0x25c
> [c0000000f30c79c0] [c0000000002ad4fc] .load_elf_binary+0x450/0x16c8
> [c0000000f30c7b10] [c000000000234914] .search_binary_handler.part.11+0x9c/0x248
> [c0000000f30c7bb0] [c00000000023595c] .do_execveat_common.isra.13+0x868/0xc18
> [c0000000f30c7cb0] [c00000000000184c] .run_init_process+0x34/0x4c
> [c0000000f30c7d30] [c000000000001880] .try_to_run_init_process+0x1c/0x68
> [c0000000f30c7db0] [c000000000002bd8] .kernel_init+0xdc/0x130
> [c0000000f30c7e30] [c0000000000009dc] .ret_from_kernel_thread+0x58/0x7c
> 
> Fixes: 702346768 ("powerpc/mm/nohash: Remove pte fragment dependency from nohash")
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   arch/powerpc/include/asm/book3s/32/pgalloc.h | 1 +
>   arch/powerpc/include/asm/nohash/32/pgalloc.h | 1 +
>   arch/powerpc/include/asm/nohash/64/pgalloc.h | 2 +-
>   3 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
> index 5073cc75f1c8..6a6673907e45 100644
> --- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
> +++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
> @@ -99,6 +99,7 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
>   static inline void pgtable_free(void *table, unsigned index_size)
>   {
>   	if (!index_size) {
> +		pgtable_page_dtor(virt_to_page(table));

__pte_free_tlb() already calls pgtable_page_dtor(table) before calling
pgtable_free() via pgtable_free_tlb().

Is it normal to call it twice ?

>   		free_page((unsigned long)table);
>   	} else {
>   		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
> diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
> index 29d37bd1f3b3..1707781d2f20 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
> @@ -100,6 +100,7 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
>   static inline void pgtable_free(void *table, unsigned index_size)
>   {
>   	if (!index_size) {
> +		pgtable_page_dtor(virt_to_page(table));

Same here

Christophe

>   		free_page((unsigned long)table);
>   	} else {
>   		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
> diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h b/arch/powerpc/include/asm/nohash/64/pgalloc.h
> index 21624ff1f065..0e693f322cb2 100644
> --- a/arch/powerpc/include/asm/nohash/64/pgalloc.h
> +++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h
> @@ -133,7 +133,7 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
>   static inline void pgtable_free(void *table, int shift)
>   {
>   	if (!shift) {
> -		pgtable_page_dtor(table);
> +		pgtable_page_dtor(virt_to_page(table));
>   		free_page((unsigned long)table);
>   	} else {
>   		BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
> 

^ permalink raw reply

* Re: [PATCH] powerpc/mm: Fix kernel crash on page table free
From: Aneesh Kumar K.V @ 2018-06-15 10:15 UTC (permalink / raw)
  To: Christophe Leroy, benh, paulus, mpe; +Cc: linuxppc-dev
In-Reply-To: <1bbbfc4d-c47c-2383-454c-430a014beb9e@c-s.fr>

On 06/15/2018 03:24 AM, Christophe Leroy wrote:
> 
> 
> On 05/30/2018 12:32 PM, Aneesh Kumar K.V wrote:
>> Fix the below crash on BookE 64. pgtable_page_dtor expects struct page 
>> *arg.
>>
>> Also call the destructor on non book3s platforms correctly. This free 
>> up the
>> split ptl locks correctly if we had allocated them before.
>>
>> Call Trace:
>> [c0000000f30c7520] [c00000000021eeec] .kmem_cache_free+0x9c/0x44c 
>> (unreliable)
>> [c0000000f30c75c0] [c0000000001ee07c] .ptlock_free+0x1c/0x30
>> [c0000000f30c7630] [c0000000001ee260] .tlb_remove_table+0xdc/0x224
>> [c0000000f30c76c0] [c0000000001ee640] .free_pgd_range+0x298/0x500
>> [c0000000f30c77d0] [c000000000232afc] .shift_arg_pages+0x10c/0x1e0
>> [c0000000f30c7910] [c000000000232dd0] .setup_arg_pages+0x200/0x25c
>> [c0000000f30c79c0] [c0000000002ad4fc] .load_elf_binary+0x450/0x16c8
>> [c0000000f30c7b10] [c000000000234914] 
>> .search_binary_handler.part.11+0x9c/0x248
>> [c0000000f30c7bb0] [c00000000023595c] 
>> .do_execveat_common.isra.13+0x868/0xc18
>> [c0000000f30c7cb0] [c00000000000184c] .run_init_process+0x34/0x4c
>> [c0000000f30c7d30] [c000000000001880] .try_to_run_init_process+0x1c/0x68
>> [c0000000f30c7db0] [c000000000002bd8] .kernel_init+0xdc/0x130
>> [c0000000f30c7e30] [c0000000000009dc] .ret_from_kernel_thread+0x58/0x7c
>>
>> Fixes: 702346768 ("powerpc/mm/nohash: Remove pte fragment dependency 
>> from nohash")
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/include/asm/book3s/32/pgalloc.h | 1 +
>>   arch/powerpc/include/asm/nohash/32/pgalloc.h | 1 +
>>   arch/powerpc/include/asm/nohash/64/pgalloc.h | 2 +-
>>   3 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h 
>> b/arch/powerpc/include/asm/book3s/32/pgalloc.h
>> index 5073cc75f1c8..6a6673907e45 100644
>> --- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
>> +++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
>> @@ -99,6 +99,7 @@ static inline void pte_free(struct mm_struct *mm, 
>> pgtable_t ptepage)
>>   static inline void pgtable_free(void *table, unsigned index_size)
>>   {
>>       if (!index_size) {
>> +        pgtable_page_dtor(virt_to_page(table));
> 
> __pte_free_tlb() already calls pgtable_page_dtor(table) before calling
> pgtable_free() via pgtable_free_tlb().
> 
> Is it normal to call it twice ?

No. We should call pgtable_page_dtor only in the rcu callback. So we 
should remove it from __pte_free_tlb().

> 
>>           free_page((unsigned long)table);
>>       } else {
>>           BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
>> diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
>> b/arch/powerpc/include/asm/nohash/32/pgalloc.h
>> index 29d37bd1f3b3..1707781d2f20 100644
>> --- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
>> +++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
>> @@ -100,6 +100,7 @@ static inline void pte_free(struct mm_struct *mm, 
>> pgtable_t ptepage)
>>   static inline void pgtable_free(void *table, unsigned index_size)
>>   {
>>       if (!index_size) {
>> +        pgtable_page_dtor(virt_to_page(table));
> 
> Same here

I don't have the facility to test anything other than book3s64. Can you 
do a patch for this?

-aneesh

^ permalink raw reply

* Re: [RFC PATCH 20/23] watchdog/hardlockup/hpet: Rotate interrupt among all monitored CPUs
From: Thomas Gleixner @ 2018-06-15 10:29 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
	Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
	linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
	Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
	Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
	David Rientjes, iommu
In-Reply-To: <20180615021629.GD11625@voyager>

On Thu, 14 Jun 2018, Ricardo Neri wrote:
> On Wed, Jun 13, 2018 at 11:48:09AM +0200, Thomas Gleixner wrote:
> > On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > > +	/* There are no CPUs to monitor. */
> > > +	if (!cpumask_weight(&hdata->monitored_mask))
> > > +		return NMI_HANDLED;
> > > +
> > >  	inspect_for_hardlockups(regs);
> > >  
> > > +	/*
> > > +	 * Target a new CPU. Keep trying until we find a monitored CPU. CPUs
> > > +	 * are addded and removed to this mask at cpu_up() and cpu_down(),
> > > +	 * respectively. Thus, the interrupt should be able to be moved to
> > > +	 * the next monitored CPU.
> > > +	 */
> > > +	spin_lock(&hld_data->lock);
> > 
> > Yuck. Taking a spinlock from NMI ...
> 
> I am sorry. I will look into other options for locking. Do you think rcu_lock
> would help in this case? I need this locking because the CPUs being monitored
> changes as CPUs come online and offline.

Sure, but you _cannot_ take any locks in NMI context which are also taken
in !NMI context. And RCU will not help either. How so? The NMI can hit
exactly before the CPU bit is cleared and then the CPU goes down. So RCU
_cannot_ protect anything.

All you can do there is make sure that the TIMn_CONF is only ever accessed
in !NMI code. Then you can stop the timer _before_ a CPU goes down and make
sure that the eventually on the fly NMI is finished. After that you can
fiddle with the CPU mask and restart the timer. Be aware that this is going
to be more corner case handling that actual functionality.

> > > +	for_each_cpu_wrap(cpu, &hdata->monitored_mask, smp_processor_id() + 1) {
> > > +		if (!irq_set_affinity(hld_data->irq, cpumask_of(cpu)))
> > > +			break;
> > 
> > ... and then calling into generic interrupt code which will take even more
> > locks is completely broken.
> 
> I will into reworking how the destination of the interrupt is set.

You have to consider two cases:

 1) !remapped mode:

    That's reasonably simple because you just have to deal with the HPET
    TIMERn_PROCMSG_ROUT register. But then you need to do this directly and
    not through any of the existing interrupt facilities.

 2) remapped mode:

    That's way more complex as you _cannot_ ever do anything which touches
    the IOMMU and the related tables.

    So you'd need to reserve an IOMMU remapping entry for each CPU upfront,
    store the resulting value for the HPET TIMERn_PROCMSG_ROUT register in
    per cpu storage and just modify that one from NMI.

    Though there might be subtle side effects involved, which are related to
    the acknowledge part. You need to talk to the IOMMU wizards first.

All in all, the idea itself is interesting, but the envisioned approach of
round robin and no fast accessible NMI reason detection is going to create
more problems than it solves.

This all could have been avoided if Intel hadn't decided to reuse the APIC
timer registers for the TSC deadline timer. If both would be available we'd
have a CPU local fast accessible watchdog timer when TSC deadline is used
for general timer purposes. But why am I complaining? I've resigned to the
fact that timers are designed^Wcobbled together by janitors long ago.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH] powerpc/64s: Report SLB multi-hit rather than parity error
From: Michael Ellerman @ 2018-06-15 11:37 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <20180614004036.7c71cf1b@roar.ozlabs.ibm.com>

Nicholas Piggin <npiggin@gmail.com> writes:
> On Wed, 13 Jun 2018 23:24:14 +1000
> Michael Ellerman <mpe@ellerman.id.au> wrote:
>
>> When we take an SLB multi-hit on bare metal, we see both the multi-hit
>> and parity error bits set in DSISR. The user manuals indicates this is
>> expected to always happen on Power8, whereas on Power9 it says a
>> multi-hit will "usually" also cause a parity error.
>> 
>> We decide what to do based on the various error tables in mce_power.c,
>> and because we process them in order and only report the first, we
>> currently always report a parity error but not the multi-hit, eg:
>> 
>>   Severe Machine check interrupt [Recovered]
>>     Initiator: CPU
>>     Error type: SLB [Parity]
>>       Effective address: c000000ffffd4300
>> 
>> Although this is correct, it leaves the user wondering why they got a
>> parity error. It would be clearer instead if we reported the
>> multi-hit because that is more likely to be simply a software bug,
>> whereas a true parity error is possibly an indication of a bad core.
>> 
>> We can do that simply by reordering the error tables so that multi-hit
>> appears before parity. That doesn't affect the error recovery at all,
>> because we flush the SLB either way.
>
> Yeah this is a good idea. I wonder if there are any other conditions
> like this that should be reordered.

Yeah good point, this one just caught my eye because I was testing it.
Ideally it wouldn't matter and we could actually report multiple, but
that would be a bit of a bigger change.

> I think the i-side should not have to be changed here because it
> matches the value not bits, so that shouldn't matter.

Ah OK, will check.

> A bit of a shame we don't report i/d side, and ideally we'd be able
> to report multiple conditions. The reporting APIs really want to be
> massaged a bit, but for now this is a good step.

Ah snap, yep, more detail & multiple conditions would be nice.

I don't really understand the way we do the reporting now. The
struct machine_check_event is all carefully laid out with reserved
fields and a version number and everything as if it's an ABI. But AFAICS
it's purely internal to the kernel.

And then we have struct mce_error_info, but that's a separate thing and
struct machine_check_event doesn't contain one of them?

cheers

^ permalink raw reply

* Re: UBSAN: Undefined behaviour in ../include/linux/percpu_counter.h:137:13
From: Michael Ellerman @ 2018-06-15 12:34 UTC (permalink / raw)
  To: Mathieu Malaterre; +Cc: linuxppc-dev
In-Reply-To: <CA+7wUswJcba_Yi63VS5kX+wdbBgsZfkptU0E8J7w3t602bz5sA@mail.gmail.com>

Mathieu Malaterre <malat@debian.org> writes:
> On Wed, Jun 13, 2018 at 10:43 AM Mathieu Malaterre <malat@debian.org> wrote:
>> On Wed, Jun 13, 2018 at 3:43 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> > Mathieu Malaterre <malat@debian.org> writes:
>> > > Hi there,
>> > >
>> > > I have a reproducible UBSAN appearing in dmesg after a while on my G4
>> > > (*). Could anyone suggest a way to diagnose the actual root issue here
>> > > (or is it just a false positive) ?
>> >
>> > It looks like a real overflow, I guess the question is why are we seeing it.
>> >
>> > The first thing to work out would be what exactly is overflowing.
>> >
>> > Is it in here?
>> >
>> >         cfqg_stats_update_completion(cfqq->cfqg, rq->start_time_ns,
>> >                                      rq->io_start_time_ns, rq->cmd_flags);
>> >
>> >
>> > If so that would suggest something is taking multiple hours to complete,
>> > which seems unlikely. Is time going backward?
>>
>> There is also something suspicious in the kern.log file:
>>
>> Jun 12 20:09:04 debian kernel: [    5.504182]
>> ================================================================================
>> Jun 12 20:09:04 debian kernel: [    5.508945] UBSAN: Undefined
>> behaviour in ../drivers/rtc/rtc-lib.c:87:22
>> Jun 12 20:09:04 debian kernel: [    5.513658] signed integer overflow:
>> Jun 12 20:09:04 debian kernel: [    5.518211] 1193024 * 3600 cannot be
>> represented in type 'int'
>> Jun 12 20:09:04 debian kernel: [    5.522866] CPU: 0 PID: 1 Comm:
>> swapper Not tainted 4.17.0+ #1
>> Jun 12 20:09:04 debian kernel: [    5.527567] Call Trace:
>> Jun 12 20:09:04 debian kernel: [    5.532200] [df4e7b00] [c0481074]
>> ubsan_epilogue+0x18/0x4c (unreliable)
>> Jun 12 20:09:04 debian kernel: [    5.537019] [df4e7b10] [c0481a14]
>> handle_overflow+0xbc/0xdc
>> Jun 12 20:09:04 debian kernel: [    5.541832] [df4e7b90] [c060d698]
>> rtc_time64_to_tm+0x344/0x388
>> Jun 12 20:09:04 debian kernel: [    5.546655] [df4e7bd0] [c001076c]
>> rtc_generic_get_time+0x2c/0x40
>> Jun 12 20:09:04 debian kernel: [    5.551477] [df4e7be0] [c06113d4]
>> __rtc_read_time+0x70/0x13c
>> Jun 12 20:09:04 debian kernel: [    5.556288] [df4e7c00] [c061150c]
>> rtc_read_time+0x6c/0x130
>> Jun 12 20:09:04 debian kernel: [    5.561088] [df4e7c30] [c061271c]
>> __rtc_read_alarm+0x34/0x684
>> Jun 12 20:09:04 debian kernel: [    5.565884] [df4e7ce0] [c060f234]
>> rtc_device_register+0x88/0x218
>> Jun 12 20:09:04 debian kernel: [    5.570695] [df4e7d40] [c060f428]
>> devm_rtc_device_register+0x64/0xc4
>> Jun 12 20:09:04 debian kernel: [    5.575528] [df4e7d60] [c09d15d4]
>> generic_rtc_probe+0x50/0x78
>> Jun 12 20:09:04 debian kernel: [    5.580359] [df4e7d70] [c055e4a4]
>> platform_drv_probe+0xa8/0x128
>> Jun 12 20:09:04 debian kernel: [    5.585210] [df4e7d90] [c0559d28]
>> driver_probe_device+0x354/0x6fc
>> Jun 12 20:09:04 debian kernel: [    5.590064] [df4e7dd0] [c055a270]
>> __driver_attach+0x1a0/0x22c
>> Jun 12 20:09:04 debian kernel: [    5.594917] [df4e7df0] [c0555b70]
>> bus_for_each_dev+0x84/0xdc
>> Jun 12 20:09:04 debian kernel: [    5.599750] [df4e7e20] [c0558420]
>> bus_add_driver+0x188/0x348
>> Jun 12 20:09:04 debian kernel: [    5.604584] [df4e7e40] [c055b7b4]
>> driver_register+0xa0/0x18c
>> Jun 12 20:09:04 debian kernel: [    5.609433] [df4e7e50] [c055e950]
>> __platform_driver_probe+0x8c/0x198
>> Jun 12 20:09:04 debian kernel: [    5.614330] [df4e7e70] [c0005800]
>> do_one_initcall+0x64/0x280
>> Jun 12 20:09:04 debian kernel: [    5.619237] [df4e7ee0] [c0997c04]
>> kernel_init_freeable+0x3a4/0x444
>> Jun 12 20:09:04 debian kernel: [    5.624145] [df4e7f30] [c00049f8]
>> kernel_init+0x24/0x118
>> Jun 12 20:09:04 debian kernel: [    5.629029] [df4e7f40] [c001b1c4]
>> ret_from_kernel_thread+0x14/0x1c
>> Jun 12 20:09:04 debian kernel: [    5.633878]
>> ================================================================================
>>
>>
>> Grep-ing all leads to:
>>
>> $ grep  "cannot be represented" kern.log | colrm 1 45|sort -u
>>  1193022 * 3600 cannot be represented in type 'int'
>>  1193024 * 3600 cannot be represented in type 'int'
>>  1193032 * 3600 cannot be represented in type 'int'
>>  1193033 * 3600 cannot be represented in type 'int'
>>  1193034 * 3600 cannot be represented in type 'int'
>>  1193035 * 3600 cannot be represented in type 'int'
>>
>> How come tm_hour can store a value of 1193035 ?
>
> It appears that I am getting a negative value for  time64_t :

That sounds bad :)

Thanks for tracking it down.

cheers

^ permalink raw reply

* Re: [powerpc/powervmc]kernel BUG at arch/powerpc/mm/pgtable-book3s64.c:414!
From: Michael Ellerman @ 2018-06-15 13:44 UTC (permalink / raw)
  To: vrbagal1, linuxppc-dev; +Cc: sachinp, aneesh.kumar, Nicholas Piggin
In-Reply-To: <4ca83e8bdd7d9f573a51837fc66a2c45@linux.vnet.ibm.com>

vrbagal1 <vrbagal1@linux.vnet.ibm.com> writes:

> Hi,
>
> Observing kernel bug followed by kernel oops and system reboots, while 
> running kselftest on Power8 LPAR.
>
> Machine Details: Power8 LPAR
> Git Tree: 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> Commit ID: f5b7769eb0400ec5217a47e41148a9f816ca1f9f
> Kernel version: 4.17.0-autotest
> GCC version: (gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC))
>

This is fixed by your patch Aneesh?

http://patchwork.ozlabs.org/patch/929325/

cheers

^ permalink raw reply

* [PATCH 1/3] powerpc/tm: Remove msr_tm_active()
From: Breno Leitao @ 2018-06-15 17:37 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Breno Leitao

Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.

This function is not necessary, since MSR_TM_ACTIVE() just do the same,
checking for the TS bits and does not require any TM facility.

This patchset remove every instance of msr_tm_active() and replaced it
by MSR_TM_ACTIVE();

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/powerpc/kernel/process.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 9ef4aea9fffe..6b73d74793c2 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -102,24 +102,18 @@ static void check_if_tm_restore_required(struct task_struct *tsk)
 	}
 }
 
-static inline bool msr_tm_active(unsigned long msr)
-{
-	return MSR_TM_ACTIVE(msr);
-}
-
 static bool tm_active_with_fp(struct task_struct *tsk)
 {
-	return msr_tm_active(tsk->thread.regs->msr) &&
+	return MSR_TM_ACTIVE(tsk->thread.regs->msr) &&
 		(tsk->thread.ckpt_regs.msr & MSR_FP);
 }
 
 static bool tm_active_with_altivec(struct task_struct *tsk)
 {
-	return msr_tm_active(tsk->thread.regs->msr) &&
+	return MSR_TM_ACTIVE(tsk->thread.regs->msr) &&
 		(tsk->thread.ckpt_regs.msr & MSR_VEC);
 }
 #else
-static inline bool msr_tm_active(unsigned long msr) { return false; }
 static inline void check_if_tm_restore_required(struct task_struct *tsk) { }
 static inline bool tm_active_with_fp(struct task_struct *tsk) { return false; }
 static inline bool tm_active_with_altivec(struct task_struct *tsk) { return false; }
@@ -247,7 +241,7 @@ void enable_kernel_fp(void)
 		 * giveup as this would save  to the 'live' structure not the
 		 * checkpointed structure.
 		 */
-		if(!msr_tm_active(cpumsr) && msr_tm_active(current->thread.regs->msr))
+		if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
 			return;
 		__giveup_fpu(current);
 	}
@@ -311,7 +305,7 @@ void enable_kernel_altivec(void)
 		 * giveup as this would save  to the 'live' structure not the
 		 * checkpointed structure.
 		 */
-		if(!msr_tm_active(cpumsr) && msr_tm_active(current->thread.regs->msr))
+		if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
 			return;
 		__giveup_altivec(current);
 	}
@@ -397,7 +391,7 @@ void enable_kernel_vsx(void)
 		 * giveup as this would save  to the 'live' structure not the
 		 * checkpointed structure.
 		 */
-		if(!msr_tm_active(cpumsr) && msr_tm_active(current->thread.regs->msr))
+		if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
 			return;
 		__giveup_vsx(current);
 	}
@@ -530,7 +524,7 @@ void restore_math(struct pt_regs *regs)
 {
 	unsigned long msr;
 
-	if (!msr_tm_active(regs->msr) &&
+	if (!MSR_TM_ACTIVE(regs->msr) &&
 		!current->thread.load_fp && !loadvec(current->thread))
 		return;
 
-- 
2.16.3

^ permalink raw reply related

* [PATCH 2/3] powerpc/tm: Fix HTM documentation
From: Breno Leitao @ 2018-06-15 17:42 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Breno Leitao
In-Reply-To: <1529084262-16372-1-git-send-email-leitao@debian.org>

This patch simply fix part of the documentation on the HTM code.

This fixes reference to old fields that were renamed in commit
000ec280e3dd ("powerpc: tm: Rename transct_(*) to ck(\1)_state").

It also documents better the flow after commit eb5c3f1c8647 ("powerpc:
Always save/restore checkpointed regs during treclaim/trecheckpoint"),
where tm_recheckpoint can recheckpoint what is in ck{fp,vr}_state blindly.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/powerpc/kernel/tm.S    | 10 +++++-----
 arch/powerpc/kernel/traps.c | 15 +++++++++------
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S
index ff12f47a96b6..019d73053cd3 100644
--- a/arch/powerpc/kernel/tm.S
+++ b/arch/powerpc/kernel/tm.S
@@ -95,9 +95,9 @@ EXPORT_SYMBOL_GPL(tm_abort);
  *		   uint8_t cause)
  *
  *	- Performs a full reclaim.  This destroys outstanding
- *	  transactions and updates thread->regs.tm_ckpt_* with the
- *	  original checkpointed state.  Note that thread->regs is
- *	  unchanged.
+ *	  transactions and updates thread.ckpt_regs, thread.ckfp_state and
+ *	  thread.ckvr_state with the original checkpointed state.  Note that
+ *	  thread->regs is unchanged.
  *
  * Purpose is to both abort transactions of, and preserve the state of,
  * a transactions at a context switch. We preserve/restore both sets of process
@@ -260,7 +260,7 @@ _GLOBAL(tm_reclaim)
 
 	/* Altivec (VEC/VMX/VR)*/
 	addi	r7, r3, THREAD_CKVRSTATE
-	SAVE_32VRS(0, r6, r7)	/* r6 scratch, r7 transact vr state */
+	SAVE_32VRS(0, r6, r7)	/* r6 scratch, r7 ckvr_state */
 	mfvscr	v0
 	li	r6, VRSTATE_VSCR
 	stvx	v0, r7, r6
@@ -271,7 +271,7 @@ _GLOBAL(tm_reclaim)
 
 	/* Floating Point (FP) */
 	addi	r7, r3, THREAD_CKFPSTATE
-	SAVE_32FPRS_VSRS(0, R6, R7)	/* r6 scratch, r7 transact fp state */
+	SAVE_32FPRS_VSRS(0, R6, R7)	/* r6 scratch, r7 ckfp_state */
 	mffs    fr0
 	stfd    fr0,FPSTATE_FPSCR(r7)
 
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0e17dcb48720..6742b6b3eb37 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1719,16 +1719,19 @@ void fp_unavailable_tm(struct pt_regs *regs)
          * checkpointed FP registers need to be loaded.
 	 */
 	tm_reclaim_current(TM_CAUSE_FAC_UNAV);
-	/* Reclaim didn't save out any FPRs to transact_fprs. */
+
+	/* Reclaim initially saved out bogus (lazy) FPRs to ckfp_state, and
+	 * then it was overwrite by the thr->fp_state by tm_reclaim_thread().
+	 *
+	 * At this point, ck{fp,vr}_state contains the exact values we want to
+	 * recheckpoint.
+	 */
 
 	/* Enable FP for the task: */
 	current->thread.load_fp = 1;
 
-	/* This loads and recheckpoints the FP registers from
-	 * thread.fpr[].  They will remain in registers after the
-	 * checkpoint so we don't need to reload them after.
-	 * If VMX is in use, the VRs now hold checkpointed values,
-	 * so we don't want to load the VRs from the thread_struct.
+	/*
+	 * Recheckpoint all the checkpointed ckpt, ck{fp, vr}_state registers.
 	 */
 	tm_recheckpoint(&current->thread);
 }
-- 
2.16.3

^ permalink raw reply related

* [PATCH 3/3] powerpc/tm: Remove struct thread_info param from tm_reclaim_thread()
From: Breno Leitao @ 2018-06-15 17:42 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cyril Bur, Breno Leitao
In-Reply-To: <1529084522-17034-1-git-send-email-leitao@debian.org>

From: Cyril Bur <cyrilbur@gmail.com>

tm_reclaim_thread() doesn't use the parameter anymore, both callers have
to bother getting it as they have no need for a struct thread_info
either.

It was previously used but became unused in dc3106690b20 ("powerpc: tm:
Always use fp_state and vr_state to store live registers")

Just remove it and adjust the callers.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/powerpc/kernel/process.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 6b73d74793c2..061a1c4cb3a8 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -860,8 +860,7 @@ static inline bool tm_enabled(struct task_struct *tsk)
 	return tsk && tsk->thread.regs && (tsk->thread.regs->msr & MSR_TM);
 }
 
-static void tm_reclaim_thread(struct thread_struct *thr,
-			      struct thread_info *ti, uint8_t cause)
+static void tm_reclaim_thread(struct thread_struct *thr, uint8_t cause)
 {
 	/*
 	 * Use the current MSR TM suspended bit to track if we have
@@ -908,7 +907,7 @@ static void tm_reclaim_thread(struct thread_struct *thr,
 void tm_reclaim_current(uint8_t cause)
 {
 	tm_enable();
-	tm_reclaim_thread(&current->thread, current_thread_info(), cause);
+	tm_reclaim_thread(&current->thread, cause);
 }
 
 static inline void tm_reclaim_task(struct task_struct *tsk)
@@ -939,7 +938,7 @@ static inline void tm_reclaim_task(struct task_struct *tsk)
 		 thr->regs->ccr, thr->regs->msr,
 		 thr->regs->trap);
 
-	tm_reclaim_thread(thr, task_thread_info(tsk), TM_CAUSE_RESCHED);
+	tm_reclaim_thread(thr, TM_CAUSE_RESCHED);
 
 	TM_DEBUG("--- tm_reclaim on pid %d complete\n",
 		 tsk->pid);
-- 
2.16.3

^ permalink raw reply related

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Christoph Hellwig @ 2018-06-15  9:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christoph Hellwig, Ram Pai, Michael S. Tsirkin, robh, pawel.moll,
	Tom Lendacky, aik, jasowang, cohuck, linux-kernel, virtualization,
	joe, Rustad, Mark D, david, linuxppc-dev, elfring,
	Anshuman Khandual
In-Reply-To: <10bbd7122aaa67f51de7a8328df8154212a13f23.camel@kernel.crashing.org>

On Wed, Jun 13, 2018 at 11:11:01PM +1000, Benjamin Herrenschmidt wrote:
> Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
> 
> There's no cache flushing and there's no architecture hooks that I can
> see other than the AMD security stuff which is probably fine.
> 
> Or am I missing something ?

You are missing the __phys_to_dma arch hook that allows architectures
to adjust the dma address.  Various systems have offsets, or even
multiple banks with different offsets there.  Most of them don't
use the dma-direct code yet (working on it), but there are a few
examples in the tree already.

^ permalink raw reply

* [PATCH] ipmi/powernv: Fix spurious warnings at boot
From: William A. Kennington III @ 2018-06-15 18:33 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: William A. Kennington III

Sometimes we have stale messages showing up in the recv queue that are
being processed by the pollers. We don't want to print out warnings for
these messages not having an outstanding request as they can be expected
when doing a kexec from the petitroot environment or from another
running kernel.

Signed-off-by: William A. Kennington III <wak@google.com>
---
 drivers/char/ipmi/ipmi_powernv.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/char/ipmi/ipmi_powernv.c b/drivers/char/ipmi/ipmi_powernv.c
index e96500372ce20..36b07f26cebd9 100644
--- a/drivers/char/ipmi/ipmi_powernv.c
+++ b/drivers/char/ipmi/ipmi_powernv.c
@@ -31,6 +31,12 @@ struct ipmi_smi_powernv {
 	spinlock_t		msg_lock;
 	struct ipmi_smi_msg	*cur_msg;
 	struct opal_ipmi_msg	*opal_msg;
+
+	/**
+	 * Marker denoting if we should be draining the ipmi queue of any
+	 * outstanding messages
+	 */
+	bool			in_drain;
 };
 
 static int ipmi_powernv_start_processing(void *send_info, ipmi_smi_t intf)
@@ -96,6 +102,7 @@ static void ipmi_powernv_send(void *send_info, struct ipmi_smi_msg *msg)
 
 	if (!rc) {
 		smi->cur_msg = msg;
+		smi->in_drain = false;
 		spin_unlock_irqrestore(&smi->msg_lock, flags);
 		return;
 	}
@@ -121,8 +128,14 @@ static int ipmi_powernv_recv(struct ipmi_smi_powernv *smi)
 	spin_lock_irqsave(&smi->msg_lock, flags);
 
 	if (!smi->cur_msg) {
+		bool in_drain = READ_ONCE(smi->in_drain);
 		spin_unlock_irqrestore(&smi->msg_lock, flags);
-		pr_warn("no current message?\n");
+		/**
+		 * We don't want to print spurious errors if we are draining
+		 * leftover messages prior to sending our first message.
+		 */
+		if (!in_drain)
+			pr_warn("no current message?\n");
 		return 0;
 	}
 
@@ -226,6 +239,11 @@ static int ipmi_powernv_probe(struct platform_device *pdev)
 
 	spin_lock_init(&ipmi->msg_lock);
 
+	/* Our channel may have stale messages from a previous kernel
+	 * let the recv code know we haven't actually sent anything yet
+	 */
+	ipmi->in_drain = true;
+
 	rc = of_property_read_u32(dev->of_node, "ibm,ipmi-interface-id",
 			&prop);
 	if (rc) {
-- 
2.18.0.rc1.244.gcf134e6275-goog

^ permalink raw reply related

* Re: [PATCH 1/3] powerpc/tm: Remove msr_tm_active()
From: kbuild test robot @ 2018-06-15 20:06 UTC (permalink / raw)
  To: Breno Leitao; +Cc: kbuild-all, linuxppc-dev, Breno Leitao
In-Reply-To: <1529084262-16372-1-git-send-email-leitao@debian.org>

[-- Attachment #1: Type: text/plain, Size: 7752 bytes --]

Hi Breno,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.17 next-20180615]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Breno-Leitao/powerpc-tm-Remove-msr_tm_active/20180616-015124
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-mpc8272_ads_defconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All error/warnings (new ones prefixed by >>):

   In file included from arch/powerpc/include/asm/processor.h:13:0,
                    from arch/powerpc/include/asm/thread_info.h:26,
                    from include/linux/thread_info.h:38,
                    from arch/powerpc/include/asm/ptrace.h:158,
                    from arch/powerpc/include/asm/hw_irq.h:12,
                    from arch/powerpc/include/asm/irqflags.h:12,
                    from include/linux/irqflags.h:16,
                    from include/asm-generic/cmpxchg-local.h:6,
                    from arch/powerpc/include/asm/cmpxchg.h:537,
                    from arch/powerpc/include/asm/atomic.h:11,
                    from include/linux/atomic.h:5,
                    from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/pid.h:5,
                    from include/linux/sched.h:14,
                    from arch/powerpc/kernel/process.c:18:
   arch/powerpc/kernel/process.c: In function 'enable_kernel_fp':
>> arch/powerpc/include/asm/reg.h:65:23: error: left shift count >= width of type [-Werror=shift-count-overflow]
    #define __MASK(X) (1UL<<(X))
                          ^
>> arch/powerpc/include/asm/reg.h:117:18: note: in expansion of macro '__MASK'
    #define MSR_TS_T __MASK(MSR_TS_T_LG) /*  Transaction Transactional */
                     ^~~~~~
>> arch/powerpc/include/asm/reg.h:118:22: note: in expansion of macro 'MSR_TS_T'
    #define MSR_TS_MASK (MSR_TS_T | MSR_TS_S)   /* Transaction State bits */
                         ^~~~~~~~
>> arch/powerpc/include/asm/reg.h:119:34: note: in expansion of macro 'MSR_TS_MASK'
    #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
                                     ^~~~~~~~~~~
>> arch/powerpc/kernel/process.c:244:7: note: in expansion of macro 'MSR_TM_ACTIVE'
      if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
          ^~~~~~~~~~~~~
>> arch/powerpc/include/asm/reg.h:65:23: error: left shift count >= width of type [-Werror=shift-count-overflow]
    #define __MASK(X) (1UL<<(X))
                          ^
   arch/powerpc/include/asm/reg.h:116:18: note: in expansion of macro '__MASK'
    #define MSR_TS_S __MASK(MSR_TS_S_LG) /*  Transaction Suspended */
                     ^~~~~~
>> arch/powerpc/include/asm/reg.h:118:33: note: in expansion of macro 'MSR_TS_S'
    #define MSR_TS_MASK (MSR_TS_T | MSR_TS_S)   /* Transaction State bits */
                                    ^~~~~~~~
>> arch/powerpc/include/asm/reg.h:119:34: note: in expansion of macro 'MSR_TS_MASK'
    #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
                                     ^~~~~~~~~~~
>> arch/powerpc/kernel/process.c:244:7: note: in expansion of macro 'MSR_TM_ACTIVE'
      if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
          ^~~~~~~~~~~~~
>> arch/powerpc/include/asm/reg.h:65:23: error: left shift count >= width of type [-Werror=shift-count-overflow]
    #define __MASK(X) (1UL<<(X))
                          ^
>> arch/powerpc/include/asm/reg.h:117:18: note: in expansion of macro '__MASK'
    #define MSR_TS_T __MASK(MSR_TS_T_LG) /*  Transaction Transactional */
                     ^~~~~~
>> arch/powerpc/include/asm/reg.h:118:22: note: in expansion of macro 'MSR_TS_T'
    #define MSR_TS_MASK (MSR_TS_T | MSR_TS_S)   /* Transaction State bits */
                         ^~~~~~~~
>> arch/powerpc/include/asm/reg.h:119:34: note: in expansion of macro 'MSR_TS_MASK'
    #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
                                     ^~~~~~~~~~~
   arch/powerpc/kernel/process.c:244:32: note: in expansion of macro 'MSR_TM_ACTIVE'
      if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
                                   ^~~~~~~~~~~~~
>> arch/powerpc/include/asm/reg.h:65:23: error: left shift count >= width of type [-Werror=shift-count-overflow]
    #define __MASK(X) (1UL<<(X))
                          ^
   arch/powerpc/include/asm/reg.h:116:18: note: in expansion of macro '__MASK'
    #define MSR_TS_S __MASK(MSR_TS_S_LG) /*  Transaction Suspended */
                     ^~~~~~
>> arch/powerpc/include/asm/reg.h:118:33: note: in expansion of macro 'MSR_TS_S'
    #define MSR_TS_MASK (MSR_TS_T | MSR_TS_S)   /* Transaction State bits */
                                    ^~~~~~~~
>> arch/powerpc/include/asm/reg.h:119:34: note: in expansion of macro 'MSR_TS_MASK'
    #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
                                     ^~~~~~~~~~~
   arch/powerpc/kernel/process.c:244:32: note: in expansion of macro 'MSR_TM_ACTIVE'
      if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
                                   ^~~~~~~~~~~~~
   arch/powerpc/kernel/process.c: In function 'restore_math':
>> arch/powerpc/include/asm/reg.h:65:23: error: left shift count >= width of type [-Werror=shift-count-overflow]
    #define __MASK(X) (1UL<<(X))
                          ^
>> arch/powerpc/include/asm/reg.h:117:18: note: in expansion of macro '__MASK'
    #define MSR_TS_T __MASK(MSR_TS_T_LG) /*  Transaction Transactional */
                     ^~~~~~
>> arch/powerpc/include/asm/reg.h:118:22: note: in expansion of macro 'MSR_TS_T'
    #define MSR_TS_MASK (MSR_TS_T | MSR_TS_S)   /* Transaction State bits */
                         ^~~~~~~~
>> arch/powerpc/include/asm/reg.h:119:34: note: in expansion of macro 'MSR_TS_MASK'
    #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
                                     ^~~~~~~~~~~
   arch/powerpc/kernel/process.c:527:7: note: in expansion of macro 'MSR_TM_ACTIVE'
     if (!MSR_TM_ACTIVE(regs->msr) &&
          ^~~~~~~~~~~~~

vim +/MSR_TM_ACTIVE +244 arch/powerpc/kernel/process.c

   226	
   227	void enable_kernel_fp(void)
   228	{
   229		unsigned long cpumsr;
   230	
   231		WARN_ON(preemptible());
   232	
   233		cpumsr = msr_check_and_set(MSR_FP);
   234	
   235		if (current->thread.regs && (current->thread.regs->msr & MSR_FP)) {
   236			check_if_tm_restore_required(current);
   237			/*
   238			 * If a thread has already been reclaimed then the
   239			 * checkpointed registers are on the CPU but have definitely
   240			 * been saved by the reclaim code. Don't need to and *cannot*
   241			 * giveup as this would save  to the 'live' structure not the
   242			 * checkpointed structure.
   243			 */
 > 244			if(!MSR_TM_ACTIVE(cpumsr) && MSR_TM_ACTIVE(current->thread.regs->msr))
   245				return;
   246			__giveup_fpu(current);
   247		}
   248	}
   249	EXPORT_SYMBOL(enable_kernel_fp);
   250	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 12632 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 03/23] genirq: Introduce IRQF_DELIVER_AS_NMI
From: Ricardo Neri @ 2018-06-16  0:39 UTC (permalink / raw)
  To: Julien Thierry
  Cc: Marc Zyngier, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
	Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
	Jacob Pan, Daniel Lezcano, Andrew Morton,
	Levin, Alexander (Sasha Levin), Randy Dunlap, Masami Hiramatsu,
	Bartosz Golaszewski, Doug Berger, Palmer Dabbelt, iommu
In-Reply-To: <4eb34b18-11f8-7d70-46a5-f206d127b768@arm.com>

On Fri, Jun 15, 2018 at 09:01:02AM +0100, Julien Thierry wrote:
> Hi Ricardo,
> 
> On 15/06/18 03:12, Ricardo Neri wrote:
> >On Wed, Jun 13, 2018 at 11:06:25AM +0100, Marc Zyngier wrote:
> >>On 13/06/18 10:20, Thomas Gleixner wrote:
> >>>On Wed, 13 Jun 2018, Julien Thierry wrote:
> >>>>On 13/06/18 09:34, Peter Zijlstra wrote:
> >>>>>On Tue, Jun 12, 2018 at 05:57:23PM -0700, Ricardo Neri wrote:
> >>>>>>diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> >>>>>>index 5426627..dbc5e02 100644
> >>>>>>--- a/include/linux/interrupt.h
> >>>>>>+++ b/include/linux/interrupt.h
> >>>>>>@@ -61,6 +61,8 @@
> >>>>>>    *                interrupt handler after suspending interrupts. For
> >>>>>>system
> >>>>>>    *                wakeup devices users need to implement wakeup
> >>>>>>detection in
> >>>>>>    *                their interrupt handlers.
> >>>>>>+ * IRQF_DELIVER_AS_NMI - Configure interrupt to be delivered as
> >>>>>>non-maskable, if
> >>>>>>+ *                supported by the chip.
> >>>>>>    */
> >>>>>
> >>>>>NAK on the first 6 patches. You really _REALLY_ don't want to expose
> >>>>>NMIs to this level.
> >>>>>
> >>>>
> >>>>I've been working on something similar on arm64 side, and effectively the one
> >>>>thing that might be common to arm64 and intel is the interface to set an
> >>>>interrupt as NMI. So I guess it would be nice to agree on the right approach
> >>>>for this.
> >>>>
> >>>>The way I did it was by introducing a new irq_state and let the irqchip driver
> >>>>handle most of the work (if it supports that state):
> >>>>
> >>>>https://lkml.org/lkml/2018/5/25/181
> >>>>
> >>>>This has not been ACKed nor NAKed. So I am just asking whether this is a more
> >>>>suitable approach, and if not, is there any suggestions on how to do this?
> >>>
> >>>I really didn't pay attention to that as it's burried in the GIC/ARM series
> >>>which is usually Marc's playground.
> >>
> >>I'm working my way through it ATM now that I have some brain cycles back.
> >>
> >>>Adding NMI delivery support at low level architecture irq chip level is
> >>>perfectly fine, but the exposure of that needs to be restricted very
> >>>much. Adding it to the generic interrupt control interfaces is not going to
> >>>happen. That's doomed to begin with and a complete abuse of the interface
> >>>as the handler can not ever be used for that.
> >>
> >>I can only agree with that. Allowing random driver to use request_irq()
> >>to make anything an NMI ultimately turns it into a complete mess ("hey,
> >>NMI is *faster*, let's use that"), and a potential source of horrible
> >>deadlocks.
> >>
> >>What I'd find more palatable is a way for an irqchip to be able to
> >>prioritize some interrupts based on a set of architecturally-defined
> >>requirements, and a separate NMI requesting/handling framework that is
> >>separate from the IRQ API, as the overall requirements are likely to
> >>completely different.
> >>
> >>It shouldn't have to be nearly as complex as the IRQ API, and require
> >>much stricter requirements in terms of what you can do there (flow
> >>handling should definitely be different).
> >
> >Marc, Julien, do you plan to actively work on this? Would you mind keeping
> >me in the loop? I also need this work for this watchdog. In the meantime,
> >I will go through Julien's patches and try to adapt it to my work.
> 
> We are going to work on this and of course your input is most welcome to
> make sure we have an interface usable across different architectures.

Great! Thanks! I will keep an eye to future version of your "arm64: provide
pseudo NMI with GICv3" series.
> 
> In my patches, I'm not sure there is much to adapt to your work as most of
> it is arch specific (although I wont say no to another pair of eyes looking
> at them). From what I've seen of your patches, the point where we converge
> is that need for some code to be able to tell the irqchip "I want that
> particular interrupt line to be treated/setup as an NMI".

Indeed, there has to be a generic way for the irqchip to announce that it
supports configuring an interrupt as NMI... and a way to actually configuring
it.

> 
> We'll make sure to keep you in the loop for discussions/suggestions on this.

Thank you!

Thanks and BR,
Ricardo

^ permalink raw reply

* Re: [RFC PATCH 20/23] watchdog/hardlockup/hpet: Rotate interrupt among all monitored CPUs
From: Ricardo Neri @ 2018-06-16  0:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
	Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
	linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
	Don Zickus, Nicholas Piggin, Michael Ellerman,
	Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
	Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
	Andrew Morton, Philippe Ombredanne, Colin Ian King,
	Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
	Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
	Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
	David Rientjes, iommu
In-Reply-To: <alpine.DEB.2.21.1806151122070.2079@nanos.tec.linutronix.de>

On Fri, Jun 15, 2018 at 12:29:06PM +0200, Thomas Gleixner wrote:
> On Thu, 14 Jun 2018, Ricardo Neri wrote:
> > On Wed, Jun 13, 2018 at 11:48:09AM +0200, Thomas Gleixner wrote:
> > > On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > > > +	/* There are no CPUs to monitor. */
> > > > +	if (!cpumask_weight(&hdata->monitored_mask))
> > > > +		return NMI_HANDLED;
> > > > +
> > > >  	inspect_for_hardlockups(regs);
> > > >  
> > > > +	/*
> > > > +	 * Target a new CPU. Keep trying until we find a monitored CPU. CPUs
> > > > +	 * are addded and removed to this mask at cpu_up() and cpu_down(),
> > > > +	 * respectively. Thus, the interrupt should be able to be moved to
> > > > +	 * the next monitored CPU.
> > > > +	 */
> > > > +	spin_lock(&hld_data->lock);
> > > 
> > > Yuck. Taking a spinlock from NMI ...
> > 
> > I am sorry. I will look into other options for locking. Do you think rcu_lock
> > would help in this case? I need this locking because the CPUs being monitored
> > changes as CPUs come online and offline.
> 
> Sure, but you _cannot_ take any locks in NMI context which are also taken
> in !NMI context. And RCU will not help either. How so? The NMI can hit
> exactly before the CPU bit is cleared and then the CPU goes down. So RCU
> _cannot_ protect anything.
> 
> All you can do there is make sure that the TIMn_CONF is only ever accessed
> in !NMI code. Then you can stop the timer _before_ a CPU goes down and make
> sure that the eventually on the fly NMI is finished. After that you can
> fiddle with the CPU mask and restart the timer. Be aware that this is going
> to be more corner case handling that actual functionality.

Thanks for the suggestion. It makes sense to stop the timer when updating the
CPU mask. In this manner the timer will not cause any NMI.
> 
> > > > +	for_each_cpu_wrap(cpu, &hdata->monitored_mask, smp_processor_id() + 1) {
> > > > +		if (!irq_set_affinity(hld_data->irq, cpumask_of(cpu)))
> > > > +			break;
> > > 
> > > ... and then calling into generic interrupt code which will take even more
> > > locks is completely broken.
> > 
> > I will into reworking how the destination of the interrupt is set.
> 
> You have to consider two cases:
> 
>  1) !remapped mode:
> 
>     That's reasonably simple because you just have to deal with the HPET
>     TIMERn_PROCMSG_ROUT register. But then you need to do this directly and
>     not through any of the existing interrupt facilities.

Indeed, there is no need to use the generic interrupt faciities to set affinity;
I am dealing with an NMI anyways.
> 
>  2) remapped mode:
> 
>     That's way more complex as you _cannot_ ever do anything which touches
>     the IOMMU and the related tables.
> 
>     So you'd need to reserve an IOMMU remapping entry for each CPU upfront,
>     store the resulting value for the HPET TIMERn_PROCMSG_ROUT register in
>     per cpu storage and just modify that one from NMI.
> 
>     Though there might be subtle side effects involved, which are related to
>     the acknowledge part. You need to talk to the IOMMU wizards first.

I see. I will look into the code and prototype something that makes sense for
the IOMMU maintainers.

> 
> All in all, the idea itself is interesting, but the envisioned approach of
> round robin and no fast accessible NMI reason detection is going to create
> more problems than it solves.

I see it more clearly now.

Thanks and BR,
Ricardo

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox