LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Michael Ellerman @ 2013-10-01  7:51 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linuxppc-dev, Joerg Roedel, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org,
	Jan Beulich, linux-pci@vger.kernel.org, Tejun Heo, Bjorn Helgaas,
	Ingo Molnar
In-Reply-To: <20130918094759.GA2353@dhcp-26-207.brq.redhat.com>

On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > How about no?
> > 
> > We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out.
> 
> Out of curiosity - how pSeries has had done it without quotas before
> 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
> 
> > Anyway I don't see what problem you're trying to solve? I agree the
> > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > world.
> 
> Well, the interface recently has been re-classified from "ugly" to
> "unnecessarily complex and actively harmful" in Tejun's words ;)
> 
> Indeed, I checked most of the drivers and it is incredible how people
> are creative in misusing the interface: from innocent pci_disable_msix()
> calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> if pci_enable_msix() returned a positive value (apparently untested).

OK, but we have the source to the drivers, we could just fix them.

We could even add:

  pci_enable_msix_i_am_stupid()

Which swallows the positive return and just gives back -ve/0.

> Roughly third of the drivers just do not care and bail out once
> pci_enable_msix() has not succeeded. Not sure how many of these are
> mandated by the hardware.

Sure, that's fine if those drivers do that, it's up to the drivers after
all.

> Another quite common pattern is a call to pci_enable_msix() to figure out
> the number of MSI-Xs available and a repeated call of pci_enable_msix()
> to enable those MSI-Xs, this time.

Also fine, though as the documentation suggests a loop is the best
construct rather than two explicit calls.

> The recommended practice would be:
> 
> 	/*
> 	 * Retrieving 'nvec' by means other than pci_msix_table_size()
> 	 */
> 
> 	rc = pci_get_msix_limit(pdev);
> 	if (rc < 0)
> 		return rc;
> 
> 	/*
> 	 * nvec = min(rc, nvec);
> 	 */
> 
> 	for (i = 0; i < nvec; i++)
> 		msix_entry[i].entry = i;
> 
> 	rc = pci_enable_msix(pdev, msix_entry, nvec);
> 	if (rc)
> 		return rc;
> 
> Thoughts?

We could probably make that work.

The disadvantage is that any restriction imposed on us above the quota
can only be reported as an error from pci_enable_msix().

The quota code, called from pci_get_msix_limit(), can only do so much to
interogate firmware about the limitations. The ultimate way to check if
firmware will give us enough MSIs is to try and allocate them. But we
can't do that from pci_get_msix_limit() because the driver is not asking
us to enable MSIs, just query them.

You'll also need to add another arch hook, for the quota check, and
we'll have to add it to our per-platform indirection as well.

All a lot of bother for no real gain IMHO.

cheers

^ permalink raw reply

* Re: [PATCH V2] powerpc/kernel/sysfs: disable writing to purr in non-powernv
From: Benjamin Herrenschmidt @ 2013-10-01  7:50 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Madhavan Srinivasan, linuxppc-dev
In-Reply-To: <20131001063129.GF17966@concordia>

On Tue, 2013-10-01 at 16:31 +1000, Michael Ellerman wrote:

> > 1)Changed the test for to hypervisor mode instead of platform
> 
> I think Ben's wrong about that.
> 
> Almost all existing code uses FW_FEATURE_LPAR to differentiate
> hypervisor vs guest mode, so I think we should do the same here.

I didn't object to using the FW test, it's a reasonable way to do it, I
objected to using the platform as an indication (powernv vs. pseries)

Ben.

> So it would be:
> 
> > +	if (cpu_has_feature(CPU_FTR_PURR)) {
> > +		if (!firmware_has_feature(FW_FEATURE_LPAR))
> > +			add_write_permission_dev_attr((void *)&dev_attr_purr);
> >  		device_create_file(s, &dev_attr_purr);
> > +	}
> 
> 
> > +static void add_write_permission_dev_attr(void *ptr)
> > +{
> > +	struct device_attribute *attr = (struct device_attribute *)ptr;
> > +
> > +	attr->attr.mode |= (unsigned short) 0200;
> > +}
> 
> Why does it take a void *, which then requires a cast at the call site?
> 
> And do you need the cast to short? If so shouldn't you use umode_t
> directly?
> 
> cheers

^ permalink raw reply

* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Michael Ellerman @ 2013-10-01  7:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Joerg Roedel, x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-ide@vger.kernel.org, Alexander Gordeev, Jan Beulich,
	linux-pci@vger.kernel.org, Bjorn Helgaas, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20130918142231.GA21650@mtj.dyndns.org>

On Wed, Sep 18, 2013 at 09:22:31AM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> > On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > > How about no?
> > > 
> > > We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out.
> > 
> > Out of curiosity - how pSeries has had done it without quotas before
> > 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
> 
> Hmmm... do we need to treat this any differently?  If the platform
> can't allocate full range of requested MSIs, just failing should be
> enough regardless of why such allocation can't be met, no?
> 
> > > Anyway I don't see what problem you're trying to solve? I agree the
> > > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > > world.
> > 
> > Well, the interface recently has been re-classified from "ugly" to
> > "unnecessarily complex and actively harmful" in Tejun's words ;)
> 
> LOL. :)
> 
> > Indeed, I checked most of the drivers and it is incredible how people
> > are creative in misusing the interface: from innocent pci_disable_msix()
> > calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> > if pci_enable_msix() returned a positive value (apparently untested).
> > 
> > Roughly third of the drivers just do not care and bail out once
> > pci_enable_msix() has not succeeded. Not sure how many of these are
> > mandated by the hardware.
> 
> Yeah, I mean, this type of interface is a trap.  People have to
> actively resist to avoid doing silly stuff which is a lot to ask.

I really think you're overstating the complexity here.

Functions typically return a boolean   -> nothing to see here
This function returns a tristate value -> brain explosion!


> > 	/*
> > 	 * Retrieving 'nvec' by means other than pci_msix_table_size()
> > 	 */
> > 
> > 	rc = pci_get_msix_limit(pdev);
> > 	if (rc < 0)
> > 		return rc;
> > 
> > 	/*
> > 	 * nvec = min(rc, nvec);
> > 	 */
> > 
> > 	for (i = 0; i < nvec; i++)
> > 		msix_entry[i].entry = i;
> > 
> > 	rc = pci_enable_msix(pdev, msix_entry, nvec);
> > 	if (rc)
> > 		return rc;
> 
> I really think what we should do is
> 
> * Determine the number of MSIs the controller wants.  Don't worry
>   about quotas or limits or anything.  Just determine the number
>   necessary to enable enhanced interrupt handling.
> 	
> * Try allocating that number of MSIs.  If it fails, then just revert
>   to single interrupt mode.  It's not the end of the world and mostly
>   guaranteed to work.  Let's please not even try to do partial
>   multiple interrupts.  I really don't think it's worth the risk or
>   complexity.

It will potentially break existing setups on our hardware.

Can I make that any clearer?

cheers

^ permalink raw reply

* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Michael Ellerman @ 2013-10-01  7:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Joerg Roedel, x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-ide@vger.kernel.org, Alexander Gordeev, Jan Beulich,
	linux-pci@vger.kernel.org, Bjorn Helgaas, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20130920122603.GC7630@mtj.dyndns.org>

On Fri, Sep 20, 2013 at 07:26:03AM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 18, 2013 at 06:50:45PM +0200, Alexander Gordeev wrote:
> > Actually, I do not see much contradiction with what I proposed. The
> > key words here "determine the number of MSIs the controller wants".
> > 
> > In general case it is not what pci_msix_table_size() returns (or at
> > least we should not limit ourselves to it) - there could be non-
> > standard means to report number of MSIs: hardcoded, version-dependant,
> > device-specific registers etc.
> > 
> > Next, if we opt to determine the number of MSIs by non-MSI standard
> > means then there is no reason not to call pci_get_msix_limit() (or
> > whatever) at this step.
> 
> Yeah, that's all fine.  My point is that we shouldn't try to use
> "degraded" multiple MSI mode where the number of MSIs allocated is
> smaller than performing full multiple MSI operation.  How that number
> is determined doesn't really matter but that number is a property
> which is solely decided by the device driver, right?  If a device
> needs full multiple MSI mode, given specific configuration, it needs
> >= X number of MSIs and that's the number it should request.

Sure, the driver is in full control. If it can ONLY work with N MSIs
then it should try for N, else fallback to 1.

But some drivers are able to work with a range of values for N, and
performance is improved vs using a single MSI.

> > Being Captain Obvious here, but it is up to the device driver to handle
> > a failure. There could be no such option as single MSI mode after all :)
> 
> I don't think there actually is a mainstream device which can't
> fallback to single interrupt.  Anyways, the point is the same, let's
> please not try to create an interface which encourages complex retry
> logic in its users which are likely to involve less traveled and
> tested paths in both the driver and firmware.

Why support > 1 MSI at all? It just adds complex logic and less travelled
paths in the driver and firmware.

cheers

^ permalink raw reply

* [PATCH] Revert "powerpc: 52xx: provide a default in mpc52xx_irqhost_map()"
From: Wolfram Sang @ 2013-10-01  7:26 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Sebastian Andrzej Siewior, Anatolij Gustschin, linux-rt-users,
	Wolfram Sang

This reverts commit 6391f697d4892a6f233501beea553e13f7745a23. The
compiler warning it wants to fix does not appear with my gcc 4.6.2. IMO
we don't need superfluous (and here even misleading) code to make old
compilers happy. Fixing the printout was bogus, too. We want to know
WHICH critical irq failed, not which level it had.

Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Anatolij Gustschin <agust@denx.de>
---

Have I been on CC when the original patch was sent?

 arch/powerpc/platforms/52xx/mpc52xx_pic.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/52xx/mpc52xx_pic.c b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
index b69221b..b89ef65 100644
--- a/arch/powerpc/platforms/52xx/mpc52xx_pic.c
+++ b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
@@ -373,9 +373,8 @@ static int mpc52xx_irqhost_map(struct irq_domain *h, unsigned int virq,
 	case MPC52xx_IRQ_L1_PERP: irqchip = &mpc52xx_periph_irqchip; break;
 	case MPC52xx_IRQ_L1_SDMA: irqchip = &mpc52xx_sdma_irqchip; break;
 	case MPC52xx_IRQ_L1_CRIT:
-	default:
 		pr_warn("%s: Critical IRQ #%d is unsupported! Nopping it.\n",
-			__func__, l1irq);
+			__func__, l2irq);
 		irq_set_chip(virq, &no_irq_chip);
 		return 0;
 	}
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Michael Ellerman @ 2013-10-01  7:19 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-ide@vger.kernel.org, linux-pci@vger.kernel.org,
	Joerg Roedel, x86@kernel.org, linux-kernel@vger.kernel.org,
	Bjorn Helgaas, Jan Beulich, Tejun Heo, linuxppc-dev, Ingo Molnar
In-Reply-To: <20130926143901.GE16774@dhcp-26-207.brq.redhat.com>

On Thu, Sep 26, 2013 at 04:39:02PM +0200, Alexander Gordeev wrote:
> On Thu, Sep 26, 2013 at 09:11:47AM -0400, Tejun Heo wrote:
> > > Because otherwise we will re-introduce a problem described by Michael:
> > > "We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out."
> > 
> > Still not following.  Why wouldn't just letting the drivers request
> > the optimal number they want and falling back to single interrupt mode
> > work?  ie. why can't we just have an all or nothing interface?
> 
> I can imagine a scenario where the first device probes in, requests its
> optimal number, acquires that number and exhausts MSIs in pSeries firmware.
> The next few devices possibly end up with single MSI, since no MSIs left
> to satisfy their optimal numbers. If one of those single-MSI'ed devices
> happened to be a high-performance HBA hitting a degraded performance that
> alone would force (IBM) to introduce the quotas.

Yes that's exactly the scenario, and I didn't imagine it, our test
people actually hit it and yelled at me.

I don't remember exactly which adapters it was, I might be able to find
the details if I looked hard, a quick search through my mail archive
didn't find it - it might have come in via irc / bugzilla etc.

cheers

^ permalink raw reply

* Re: [PATCH 2/2] iommu: Update platform initialisation of iommu to use it_page_shift
From: Alistair Popple @ 2013-10-01  7:13 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20131001041228.GB17966@concordia>

On Tue, 1 Oct 2013 14:12:29 Michael Ellerman wrote:
> On Tue, Oct 01, 2013 at 01:54:10PM +1000, Alistair Popple wrote:
> > This patch initialises the iommu page size used for vio, cell, powernv
> > and pseries platforms to 4K.  It has been boot tested on a pseries
> > machine with vio.
> 
> This patch fixes the build errors introduced by the previous patch
> right?
> 
> In which case you need to rework it so that you introduce the new
> constants, use them everywhere, before removing the old constants. ie.
> the build should never break.

Thanks, I'll rework and resubmit.

Alistair

> cheers
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* [PATCH] Restore registers on error exit from csum_partial_copy_generic()
From: Anton Blanchard @ 2013-10-01  7:11 UTC (permalink / raw)
  To: benh, paulus, paulmck; +Cc: linuxppc-dev

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The csum_partial_copy_generic() function saves the PowerPC non-volatile
r14, r15, and r16 registers for the main checksum-and-copy loop.
Unfortunately, it fails to restore them upon error exit from this loop,
which results in silent corruption of these registers in the presumably
rare event of an access exception within that loop.

This commit therefore restores these register on error exit from the loop.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
---

Index: b/arch/powerpc/lib/checksum_64.S
===================================================================
--- a/arch/powerpc/lib/checksum_64.S
+++ b/arch/powerpc/lib/checksum_64.S
@@ -226,19 +226,35 @@ _GLOBAL(csum_partial)
 	blr
 
 
-	.macro source
+	.macro srcnr
 100:
 	.section __ex_table,"a"
 	.align 3
-	.llong 100b,.Lsrc_error
+	.llong 100b,.Lsrc_error_nr
 	.previous
 	.endm
 
-	.macro dest
+	.macro source
+150:
+	.section __ex_table,"a"
+	.align 3
+	.llong 150b,.Lsrc_error
+	.previous
+	.endm
+
+	.macro dstnr
 200:
 	.section __ex_table,"a"
 	.align 3
-	.llong 200b,.Ldest_error
+	.llong 200b,.Ldest_error_nr
+	.previous
+	.endm
+
+	.macro dest
+250:
+	.section __ex_table,"a"
+	.align 3
+	.llong 250b,.Ldest_error
 	.previous
 	.endm
 
@@ -274,11 +290,11 @@ _GLOBAL(csum_partial_copy_generic)
 	mtctr	r6
 
 1:
-source;	lhz	r6,0(r3)		/* align to doubleword */
+srcnr;	lhz	r6,0(r3)		/* align to doubleword */
 	subi	r5,r5,2
 	addi	r3,r3,2
 	adde	r0,r0,r6
-dest;	sth	r6,0(r4)
+dstnr;	sth	r6,0(r4)
 	addi	r4,r4,2
 	bdnz	1b
 
@@ -392,10 +408,10 @@ dest;	std	r16,56(r4)
 
 	mtctr	r6
 3:
-source;	ld	r6,0(r3)
+srcnr;	ld	r6,0(r3)
 	addi	r3,r3,8
 	adde	r0,r0,r6
-dest;	std	r6,0(r4)
+dstnr;	std	r6,0(r4)
 	addi	r4,r4,8
 	bdnz	3b
 
@@ -405,10 +421,10 @@ dest;	std	r6,0(r4)
 	srdi.	r6,r5,2
 	beq	.Lcopy_tail_halfword
 
-source;	lwz	r6,0(r3)
+srcnr;	lwz	r6,0(r3)
 	addi	r3,r3,4
 	adde	r0,r0,r6
-dest;	stw	r6,0(r4)
+dstnr;	stw	r6,0(r4)
 	addi	r4,r4,4
 	subi	r5,r5,4
 
@@ -416,10 +432,10 @@ dest;	stw	r6,0(r4)
 	srdi.	r6,r5,1
 	beq	.Lcopy_tail_byte
 
-source;	lhz	r6,0(r3)
+srcnr;	lhz	r6,0(r3)
 	addi	r3,r3,2
 	adde	r0,r0,r6
-dest;	sth	r6,0(r4)
+dstnr;	sth	r6,0(r4)
 	addi	r4,r4,2
 	subi	r5,r5,2
 
@@ -427,10 +443,10 @@ dest;	sth	r6,0(r4)
 	andi.	r6,r5,1
 	beq	.Lcopy_finish
 
-source;	lbz	r6,0(r3)
+srcnr;	lbz	r6,0(r3)
 	sldi	r9,r6,8			/* Pad the byte out to 16 bits */
 	adde	r0,r0,r9
-dest;	stb	r6,0(r4)
+dstnr;	stb	r6,0(r4)
 
 .Lcopy_finish:
 	addze	r0,r0			/* add in final carry */
@@ -440,6 +456,11 @@ dest;	stb	r6,0(r4)
 	blr
 
 .Lsrc_error:
+	ld	r14,STK_REG(R14)(r1)
+	ld	r15,STK_REG(R15)(r1)
+	ld	r16,STK_REG(R16)(r1)
+	addi	r1,r1,STACKFRAMESIZE
+.Lsrc_error_nr:
 	cmpdi	0,r7,0
 	beqlr
 	li	r6,-EFAULT
@@ -447,6 +468,11 @@ dest;	stb	r6,0(r4)
 	blr
 
 .Ldest_error:
+	ld	r14,STK_REG(R14)(r1)
+	ld	r15,STK_REG(R15)(r1)
+	ld	r16,STK_REG(R16)(r1)
+	addi	r1,r1,STACKFRAMESIZE
+.Ldest_error_nr:
 	cmpdi	0,r8,0
 	beqlr
 	li	r6,-EFAULT

^ permalink raw reply

* [PATCH] powerpc: Fix parameter clobber in csum_partial_copy_generic()
From: Anton Blanchard @ 2013-10-01  6:54 UTC (permalink / raw)
  To: benh, paulus, paulmck; +Cc: linuxppc-dev

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The csum_partial_copy_generic() uses register r7 to adjust the remaining
bytes to process.  Unfortunately, r7 also holds a parameter, namely the
address of the flag to set in case of access exceptions while reading
the source buffer.  Lacking a quantum implementation of PowerPC, this
commit instead uses register r9 to do the adjusting, leaving r7's
pointer uncorrupted.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
---

Index: b/arch/powerpc/lib/checksum_64.S
===================================================================
--- a/arch/powerpc/lib/checksum_64.S
+++ b/arch/powerpc/lib/checksum_64.S
@@ -269,8 +269,8 @@ _GLOBAL(csum_partial_copy_generic)
 	rldicl. r6,r3,64-1,64-2		/* r6 = (r3 & 0x3) >> 1 */
 	beq	.Lcopy_aligned
 
-	li	r7,4
-	sub	r6,r7,r6
+	li	r9,4
+	sub	r6,r9,r6
 	mtctr	r6
 
 1:

^ permalink raw reply

* Re: [PATCH] powerpc/kernel/sysfs: cleanup set up macros for pmc/non pmc sprs
From: Madhavan Srinivasan @ 2013-10-01  6:44 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev
In-Reply-To: <20131001045035.GC17966@concordia>

On Tuesday 01 October 2013 10:20 AM, Michael Ellerman wrote:
> On Mon, Sep 30, 2013 at 04:47:29PM +0530, Madhavan Srinivasan wrote:
>> Currently pmc setup macros are used for non pmc sprs. This patch
>> add new set of macros and cleans up the code to use the new setup macro
>> for non pmc sprs.
> 
> Hi Maddy,
> 
> Firstly you should use "PMC" not pmc, it's an acronym. You should also
> spell out what it means the first time you use it, eg:
> 
>   Currently the PMC (Performance Monitor Counter) macros are used ..
> 
> Secondly you need to say _why_ it is a bad idea to use the PMC macros
> for non-PMC SPRs.
> 
Will change it.
>> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
>> index 27a90b9..73b6f9f 100644
>> --- a/arch/powerpc/kernel/sysfs.c
>> +++ b/arch/powerpc/kernel/sysfs.c
>> @@ -139,6 +139,37 @@ static ssize_t __used \
>>  	return count; \
>>  }
>>
>> +#define SYSFS_SPRSETUP(NAME, ADDRESS) \
>> +static void read_##NAME(void *val) \
>> +{ \
>> +	*(unsigned long *)val = mfspr(ADDRESS);	\
>> +} \
>> +static void write_##NAME(void *val) \
>> +{ \
>> +	mtspr(ADDRESS, *(unsigned long *)val);	\
>> +} \
>> +static ssize_t show_##NAME(struct device *dev, \
>> +			struct device_attribute *attr, \
>> +			char *buf) \
>> +{ \
>> +	struct cpu *cpu = container_of(dev, struct cpu, dev); \
>> +	unsigned long val; \
>> +	smp_call_function_single(cpu->dev.id, read_##NAME, &val, 1);	\
>> +	return sprintf(buf, "%lx\n", val); \
>> +} \
>> +static ssize_t __used \
>> +	store_##NAME(struct device *dev, struct device_attribute *attr, \
>> +			const char *buf, size_t count) \
>> +{ \
>> +	struct cpu *cpu = container_of(dev, struct cpu, dev); \
>> +	unsigned long val; \
>> +	int ret = sscanf(buf, "%lx", &val); \
>> +	if (ret != 1) \
>> +		return -EINVAL; \
>> +	smp_call_function_single(cpu->dev.id, write_##NAME, &val, 1); \
>> +	return count; \
>> +}
> 
> This is basically a complete copy of the SYSFS_PMCSETUP() macro, except
> for the one line removal of the call to ppc_enable_pmcs().
> 
> You should be able to do better, by defining a macro that does all the
> boiler plate and takes an "extra" argument, which for PMCs is
> "ppc_enable_pmcs()" and for regular SPRs is empty.
>
Sorry about that. Will make the changes.

> cheers
> 
Thanks for the feedback

^ permalink raw reply

* Re: [PATCH V2] powerpc/kernel/sysfs: disable writing to purr in non-powernv
From: Michael Ellerman @ 2013-10-01  6:31 UTC (permalink / raw)
  To: Madhavan Srinivasan; +Cc: linuxppc-dev
In-Reply-To: <1380281634-9921-1-git-send-email-maddy@linux.vnet.ibm.com>

Hi Maddy,

On Fri, Sep 27, 2013 at 05:03:54PM +0530, Madhavan Srinivasan wrote:
> powerpc/kernel/sysfs.c exports purr with write permission.

PURR

> This is only valid for kernel in hypervisor mode.
> But writing to the file in PowerVM lpar causes crash.

In the kernel history/source we refer to it as "phyp". However in this
case it's not an issue with phyp, it's simply that you are not in
hypervisor mode, ie. the same crash would occur under KVM.

So you should just say "writing to the file in guest mode .."

> # echo 0 > purr
> cpu 0x0: Vector: 700 (Program Check) at [c000000000d072b0]
>     pc: c00000000001770c: .write_purr+0x1c/0x40
>     lr: c000000000017708: .write_purr+0x18/0x40
>     sp: c000000000d07530
>    msr: 8000000000049032
>   current = 0xc000000000c53de0
>   paca    = 0xc00000000ec70000	 softe: 0	 irq_happened: 0x01
>     pid   = 0, comm = swapper/0
> enter ? for help
> [c000000000d075b0] c0000000000fba64
> .generic_smp_call_function_single_interrupt+0x104/0x190
> [c000000000d07650] c000000000037748 .smp_ipi_demux+0xa8/0xf0
> [c000000000d076e0] c000000000035314 .doorbell_exception+0x74/0xb0
> [c000000000d07760] c000000000002950 doorbell_super_common+0x150/0x180
> --- Exception: a01 (Doorbell) at c000000000060904
> .plpar_hcall_norets+0x84/0xd4
> [link register   ] c00000000006dbd4 .check_and_cede_processor+0x24/0x40
> [c000000000d07a50] c000000001002558 (unreliable)
> [c000000000d07ac0] c00000000006dd0c .shared_cede_loop+0x2c/0x70
> [c000000000d07b40] c0000000006ae954 .cpuidle_enter_state+0x64/0x150
> [c000000000d07c00] c0000000006aeb30 .cpuidle_idle_call+0xf0/0x300
> [c000000000d07cb0] c000000000062fa0 .pseries_lpar_idle+0x10/0x50
> [c000000000d07d20] c000000000016d14 .arch_cpu_idle+0x64/0x150
> [c000000000d07da0] c0000000000e0060 .cpu_startup_entry+0x1a0/0x2c0
> [c000000000d07e80] c00000000000bca4 .rest_init+0x94/0xb0
> [c000000000d07ef0] c000000000b54530 .start_kernel+0x478/0x494
> [c000000000d07f90] c000000000009be0 .start_here_common+0x20/0x40
> 0:mon>
> 
> Changes:
> 
> 1)Changed the test for to hypervisor mode instead of platform

I think Ben's wrong about that.

Almost all existing code uses FW_FEATURE_LPAR to differentiate
hypervisor vs guest mode, so I think we should do the same here.

So it would be:

> +	if (cpu_has_feature(CPU_FTR_PURR)) {
> +		if (!firmware_has_feature(FW_FEATURE_LPAR))
> +			add_write_permission_dev_attr((void *)&dev_attr_purr);
>  		device_create_file(s, &dev_attr_purr);
> +	}


> +static void add_write_permission_dev_attr(void *ptr)
> +{
> +	struct device_attribute *attr = (struct device_attribute *)ptr;
> +
> +	attr->attr.mode |= (unsigned short) 0200;
> +}

Why does it take a void *, which then requires a cast at the call site?

And do you need the cast to short? If so shouldn't you use umode_t
directly?

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: fix section mismatch warning in free_lppacas
From: Michael Ellerman @ 2013-10-01  6:18 UTC (permalink / raw)
  To: Vladimir Murzin; +Cc: geoff, paulus, linuxppc-dev
In-Reply-To: <1380458478-2324-1-git-send-email-murzin.v@gmail.com>

On Sun, Sep 29, 2013 at 02:41:18PM +0200, Vladimir Murzin wrote:
> While cross-building for PPC64 I've got bunch of
> 
> WARNING: arch/powerpc/kernel/built-in.o(.text.unlikely+0x2d2): Section
> mismatch in reference from the function .free_lppacas() to the variable
> .init.data:lppaca_size The function .free_lppacas() references the variable
> __initdata lppaca_size. This is often because .free_lppacas lacks a __initdata
> annotation or the annotation of lppaca_size is wrong.
> 
> Fix it by using proper annotation for free_lppacas. Additionally, annotate
> {allocate,new}_llpcas properly.

Yep looks good.

Acked-by: Michael Ellerman <michael@ellerman.id.au>

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/legacy_serial: fix incorrect placement of __initdata tag
From: Michael Ellerman @ 2013-10-01  6:13 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Kyungmin Park, linuxppc-dev, linux-kernel, Paul Mackerras
In-Reply-To: <35419229.o6Cv8JYbpg@amdc1032>

On Mon, Sep 30, 2013 at 03:11:42PM +0200, Bartlomiej Zolnierkiewicz wrote:
> __initdata tag should be placed between the variable name and equal
> sign for the variable to be placed in the intended .init.data section.

I see lots of other occurences of that in arch/powerpc. Why not send a
single patch to update them all?

cheers

^ permalink raw reply

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
From: Michael Neuling @ 2013-10-01  4:52 UTC (permalink / raw)
  To: paulmck
  Cc: Waiman Long, ppc-dev, Peter Zijlstra, George Spelvin,
	Linux Kernel Mailing List, Chandramouleeswaran, Aswin,
	Norton, Scott J, linux-fsdevel, Linus Torvalds, Ingo Molnar
In-Reply-To: <20131001031356.GP19582@linux.vnet.ibm.com>

>> Well we don't have to, I think Mikey wasn't totally clear about that
>> "making all registers volatile" business :-) This is just something we
>> need to handle in assembly if we are going to reclaim the suspended
>> transaction.

Yeah, sorry.  The slow path with all registers as volatile is only
needed if we get pre-empted during the transaction.

>>
>> So basically, what we need is something along the lines of
>> enable_kernel_tm() which checks if there's a suspended user transaction
>> and if yes, kills/reclaims it.
>>
>> Then we also need to handle in our interrupt handlers that we have an
>> active/suspended transaction from a kernel state, which we don't deal
>> with at this point, and do whatever has to be done to kill it... we
>> might get away with something simple if we can state that we only allow
>> kernel transactions at task level and not from interrupt/softirq
>> contexts, at least initially.
>
> Call me a coward, but this is starting to sound a bit scary.  ;-)

We are just wanting to prototype it for now to see if we could make it
go faster.  If it's worth it, then we'd consider the additional
complexity this would bring.

I don't think it'll be that bad, but I'd certainly want to make sure
it's worth it before trying :-)

Mikey

^ permalink raw reply

* Re: [PATCH] powerpc/kernel/sysfs: cleanup set up macros for pmc/non pmc sprs
From: Michael Ellerman @ 2013-10-01  4:50 UTC (permalink / raw)
  To: Madhavan Srinivasan; +Cc: linuxppc-dev
In-Reply-To: <1380539849-6162-1-git-send-email-maddy@linux.vnet.ibm.com>

On Mon, Sep 30, 2013 at 04:47:29PM +0530, Madhavan Srinivasan wrote:
> Currently pmc setup macros are used for non pmc sprs. This patch
> add new set of macros and cleans up the code to use the new setup macro
> for non pmc sprs.

Hi Maddy,

Firstly you should use "PMC" not pmc, it's an acronym. You should also
spell out what it means the first time you use it, eg:

  Currently the PMC (Performance Monitor Counter) macros are used ..

Secondly you need to say _why_ it is a bad idea to use the PMC macros
for non-PMC SPRs.

> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 27a90b9..73b6f9f 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -139,6 +139,37 @@ static ssize_t __used \
>  	return count; \
>  }
> 
> +#define SYSFS_SPRSETUP(NAME, ADDRESS) \
> +static void read_##NAME(void *val) \
> +{ \
> +	*(unsigned long *)val = mfspr(ADDRESS);	\
> +} \
> +static void write_##NAME(void *val) \
> +{ \
> +	mtspr(ADDRESS, *(unsigned long *)val);	\
> +} \
> +static ssize_t show_##NAME(struct device *dev, \
> +			struct device_attribute *attr, \
> +			char *buf) \
> +{ \
> +	struct cpu *cpu = container_of(dev, struct cpu, dev); \
> +	unsigned long val; \
> +	smp_call_function_single(cpu->dev.id, read_##NAME, &val, 1);	\
> +	return sprintf(buf, "%lx\n", val); \
> +} \
> +static ssize_t __used \
> +	store_##NAME(struct device *dev, struct device_attribute *attr, \
> +			const char *buf, size_t count) \
> +{ \
> +	struct cpu *cpu = container_of(dev, struct cpu, dev); \
> +	unsigned long val; \
> +	int ret = sscanf(buf, "%lx", &val); \
> +	if (ret != 1) \
> +		return -EINVAL; \
> +	smp_call_function_single(cpu->dev.id, write_##NAME, &val, 1); \
> +	return count; \
> +}

This is basically a complete copy of the SYSFS_PMCSETUP() macro, except
for the one line removal of the call to ppc_enable_pmcs().

You should be able to do better, by defining a macro that does all the
boiler plate and takes an "extra" argument, which for PMCs is
"ppc_enable_pmcs()" and for regular SPRs is empty.

cheers

^ permalink raw reply

* Re: [PATCH v2 RESEND 0/3] cpufreq/ondemand support for Xserve G5 & iMac G5 iSight
From: Rafael J. Wysocki @ 2013-10-01  5:00 UTC (permalink / raw)
  To: Aaro Koskinen; +Cc: Viresh Kumar, linux-pm, linuxppc-dev
In-Reply-To: <1380573873-14448-1-git-send-email-aaro.koskinen@iki.fi>

On Monday, September 30, 2013 11:44:30 PM Aaro Koskinen wrote:
> Hi,
> 
> This is a resend of the v2 patchset:
> 
> 	http://marc.info/?t=137797013200001&r=1&w=2
> 
> No changes except rebasing. Any chance to get these into v3.13?

Well, Ben's ACKs are missing as far as I'm concerned ...

> Aaro Koskinen (3):
>   cpufreq: pmac64: speed up frequency switch
>   cpufreq: pmac64: provide cpufreq transition latency for older G5
>     models
>   cpufreq: pmac64: enable cpufreq on iMac G5 (iSight) model
> 
>  drivers/cpufreq/pmac64-cpufreq.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 2/2] iommu: Update platform initialisation of iommu to use it_page_shift
From: Michael Ellerman @ 2013-10-01  4:12 UTC (permalink / raw)
  To: Alistair Popple; +Cc: linuxppc-dev
In-Reply-To: <1380599650-9541-3-git-send-email-alistair@popple.id.au>

On Tue, Oct 01, 2013 at 01:54:10PM +1000, Alistair Popple wrote:
> This patch initialises the iommu page size used for vio, cell, powernv
> and pseries platforms to 4K.  It has been boot tested on a pseries
> machine with vio.

This patch fixes the build errors introduced by the previous patch
right?

In which case you need to rework it so that you introduce the new
constants, use them everywhere, before removing the old constants. ie.
the build should never break.

cheers

^ permalink raw reply

* Re: Does iommu_init_table need to use GFP_ATOMIC allocations?
From: Michael Ellerman @ 2013-10-01  4:09 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Paul Mackerras, linuxppc-dev, Anton Blanchard,
	Thadeu Lima de Souza Cascardo
In-Reply-To: <20130919165035.GA20144@linux.vnet.ibm.com>

On Thu, Sep 19, 2013 at 09:50:35AM -0700, Nishanth Aravamudan wrote:
> Under heavy (DLPAR?) stress, we tripped this panic() in
> arch/powerpc/kernel/iommu.c::iommu_init_table():
> 
>         page = alloc_pages_node(nid, GFP_ATOMIC, get_order(sz));
>         if (!page)
>                 panic("iommu_init_table: Can't allocate %ld bytes\n",
> sz);
> 
> Before the panic() we got a page allocation failure for an order-2
> allocation. There appears to be memory free, but perhaps not in the
> ATOMIC context. I looked through all the call-sites of
> iommu_init_table() and didn't see any obvious reason to need an ATOMIC
> allocation. Most call-sites in fact have an explicit GFP_KERNEL
> allocation shortly before the call to iommu_init_table(), indicating we
> are not in an atomic context. There is some indirection for some paths,
> but I didn't see any locks indicating that GFP_KERNEL is inappropriate.
> Does anyone know if/why ATOMIC allocations are necessary here?

I can't see any reason for it.

It was GFP_ATOMIC in the initial ppc64 code submission, so there's no
explanation in the commit history for it.

cheers

^ permalink raw reply

* [PATCH 2/2] iommu: Update platform initialisation of iommu to use it_page_shift
From: Alistair Popple @ 2013-10-01  3:54 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple
In-Reply-To: <1380599650-9541-1-git-send-email-alistair@popple.id.au>

This patch initialises the iommu page size used for vio, cell, powernv
and pseries platforms to 4K.  It has been boot tested on a pseries
machine with vio.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/kernel/vio.c              |   20 +++++++++++---------
 arch/powerpc/platforms/cell/iommu.c    |   14 ++++++++------
 arch/powerpc/platforms/powernv/pci.c   |    3 ++-
 arch/powerpc/platforms/pseries/iommu.c |   10 ++++++----
 arch/powerpc/platforms/pseries/setup.c |    4 ++--
 drivers/net/ethernet/ibm/ibmveth.c     |    9 +++++----
 6 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 536016d..1dbab4e 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -520,14 +520,14 @@ static dma_addr_t vio_dma_iommu_map_page(struct device *dev, struct page *page,
 	struct vio_dev *viodev = to_vio_dev(dev);
 	dma_addr_t ret = DMA_ERROR_CODE;
 
-	if (vio_cmo_alloc(viodev, roundup(size, IOMMU_PAGE_SIZE))) {
+	if (vio_cmo_alloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K))) {
 		atomic_inc(&viodev->cmo.allocs_failed);
 		return ret;
 	}
 
 	ret = dma_iommu_ops.map_page(dev, page, offset, size, direction, attrs);
 	if (unlikely(dma_mapping_error(dev, ret))) {
-		vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE));
+		vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K));
 		atomic_inc(&viodev->cmo.allocs_failed);
 	}
 
@@ -543,7 +543,7 @@ static void vio_dma_iommu_unmap_page(struct device *dev, dma_addr_t dma_handle,
 
 	dma_iommu_ops.unmap_page(dev, dma_handle, size, direction, attrs);
 
-	vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE));
+	vio_cmo_dealloc(viodev, roundup(size, IOMMU_PAGE_SIZE_4K));
 }
 
 static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -556,7 +556,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
 	size_t alloc_size = 0;
 
 	for (sgl = sglist; count < nelems; count++, sgl++)
-		alloc_size += roundup(sgl->length, IOMMU_PAGE_SIZE);
+		alloc_size += roundup(sgl->length, IOMMU_PAGE_SIZE_4K);
 
 	if (vio_cmo_alloc(viodev, alloc_size)) {
 		atomic_inc(&viodev->cmo.allocs_failed);
@@ -572,7 +572,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
 	}
 
 	for (sgl = sglist, count = 0; count < ret; count++, sgl++)
-		alloc_size -= roundup(sgl->dma_length, IOMMU_PAGE_SIZE);
+		alloc_size -= roundup(sgl->dma_length, IOMMU_PAGE_SIZE_4K);
 	if (alloc_size)
 		vio_cmo_dealloc(viodev, alloc_size);
 
@@ -590,7 +590,7 @@ static void vio_dma_iommu_unmap_sg(struct device *dev,
 	int count = 0;
 
 	for (sgl = sglist; count < nelems; count++, sgl++)
-		alloc_size += roundup(sgl->dma_length, IOMMU_PAGE_SIZE);
+		alloc_size += roundup(sgl->dma_length, IOMMU_PAGE_SIZE_4K);
 
 	dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction, attrs);
 
@@ -736,7 +736,8 @@ static int vio_cmo_bus_probe(struct vio_dev *viodev)
 			return -EINVAL;
 		}
 
-		viodev->cmo.desired = IOMMU_PAGE_ALIGN(viodrv->get_desired_dma(viodev));
+		viodev->cmo.desired =
+			IOMMU_PAGE_ALIGN_4K(viodrv->get_desired_dma(viodev));
 		if (viodev->cmo.desired < VIO_CMO_MIN_ENT)
 			viodev->cmo.desired = VIO_CMO_MIN_ENT;
 		size = VIO_CMO_MIN_ENT;
@@ -1170,9 +1171,10 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 			    &tbl->it_index, &offset, &size);
 
 	/* TCE table size - measured in tce entries */
-	tbl->it_size = size >> IOMMU_PAGE_SHIFT;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
+	tbl->it_size = size >> tbl->it_page_shift;
 	/* offset for VIO should always be 0 */
-	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT;
+	tbl->it_offset = offset >> tbl->it_page_shift;
 	tbl->it_busno = 0;
 	tbl->it_type = TCE_VB;
 	tbl->it_blocksize = 16;
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 946306b..c0bb88a 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -197,7 +197,7 @@ static int tce_build_cell(struct iommu_table *tbl, long index, long npages,
 
 	io_pte = (unsigned long *)tbl->it_base + (index - tbl->it_offset);
 
-	for (i = 0; i < npages; i++, uaddr += IOMMU_PAGE_SIZE)
+	for (i = 0; i < npages; i++, uaddr += tbl->it_page_shift)
 		io_pte[i] = base_pte | (__pa(uaddr) & CBE_IOPTE_RPN_Mask);
 
 	mb();
@@ -430,7 +430,7 @@ static void cell_iommu_setup_hardware(struct cbe_iommu *iommu,
 {
 	cell_iommu_setup_stab(iommu, base, size, 0, 0);
 	iommu->ptab = cell_iommu_alloc_ptab(iommu, base, size, 0, 0,
-					    IOMMU_PAGE_SHIFT);
+					    IOMMU_PAGE_SHIFT_4K);
 	cell_iommu_enable_hardware(iommu);
 }
 
@@ -487,8 +487,10 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 	window->table.it_blocksize = 16;
 	window->table.it_base = (unsigned long)iommu->ptab;
 	window->table.it_index = iommu->nid;
-	window->table.it_offset = (offset >> IOMMU_PAGE_SHIFT) + pte_offset;
-	window->table.it_size = size >> IOMMU_PAGE_SHIFT;
+	window->table.it_page_shift = IOMMU_PAGE_SHIFT_4K
+	window->table.it_offset =
+		(offset >> window->table.it_page_shift) + pte_offset;
+	window->table.it_size = size >> window->table.it_page_shift;
 
 	iommu_init_table(&window->table, iommu->nid);
 
@@ -773,7 +775,7 @@ static void __init cell_iommu_init_one(struct device_node *np,
 
 	/* Setup the iommu_table */
 	cell_iommu_setup_window(iommu, np, base, size,
-				offset >> IOMMU_PAGE_SHIFT);
+				offset >> IOMMU_PAGE_SHIFT_4K);
 }
 
 static void __init cell_disable_iommus(void)
@@ -1122,7 +1124,7 @@ static int __init cell_iommu_fixed_mapping_init(void)
 
 		cell_iommu_setup_stab(iommu, dbase, dsize, fbase, fsize);
 		iommu->ptab = cell_iommu_alloc_ptab(iommu, dbase, dsize, 0, 0,
-						    IOMMU_PAGE_SHIFT);
+						    IOMMU_PAGE_SHIFT_4K);
 		cell_iommu_setup_fixed_ptab(iommu, np, dbase, dsize,
 					     fbase, fsize);
 		cell_iommu_enable_hardware(iommu);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a28d3b5..cfab147 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -452,7 +452,8 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 {
 	tbl->it_blocksize = 16;
 	tbl->it_base = (unsigned long)tce_mem;
-	tbl->it_offset = dma_offset >> IOMMU_PAGE_SHIFT;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
+	tbl->it_offset = dma_offset >> tbl->it_page_shift;
 	tbl->it_index = 0;
 	tbl->it_size = tce_size >> 3;
 	tbl->it_busno = 0;
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 23fc1dc..7a18f24 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -486,9 +486,10 @@ static void iommu_table_setparms(struct pci_controller *phb,
 		memset((void *)tbl->it_base, 0, *sizep);
 
 	tbl->it_busno = phb->bus->number;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
 
 	/* Units of tce entries */
-	tbl->it_offset = phb->dma_window_base_cur >> IOMMU_PAGE_SHIFT;
+	tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
 
 	/* Test if we are going over 2GB of DMA space */
 	if (phb->dma_window_base_cur + phb->dma_window_size > 0x80000000ul) {
@@ -499,7 +500,7 @@ static void iommu_table_setparms(struct pci_controller *phb,
 	phb->dma_window_base_cur += phb->dma_window_size;
 
 	/* Set the tce table size - measured in entries */
-	tbl->it_size = phb->dma_window_size >> IOMMU_PAGE_SHIFT;
+	tbl->it_size = phb->dma_window_size >> tbl->it_page_shift;
 
 	tbl->it_index = 0;
 	tbl->it_blocksize = 16;
@@ -537,11 +538,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
 	of_parse_dma_window(dn, dma_window, &tbl->it_index, &offset, &size);
 
 	tbl->it_busno = phb->bus->number;
+	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
 	tbl->it_base   = 0;
 	tbl->it_blocksize  = 16;
 	tbl->it_type = TCE_PCI;
-	tbl->it_offset = offset >> IOMMU_PAGE_SHIFT;
-	tbl->it_size = size >> IOMMU_PAGE_SHIFT;
+	tbl->it_offset = offset >> tbl->it_page_shift;
+	tbl->it_size = size >> tbl->it_page_shift;
 }
 
 static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index c11c823..a1b9e40 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -72,7 +72,7 @@
 
 int CMO_PrPSP = -1;
 int CMO_SecPSP = -1;
-unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT);
+unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K);
 EXPORT_SYMBOL(CMO_PageSize);
 
 int fwnmi_active;  /* TRUE if an FWNMI handler is present */
@@ -532,7 +532,7 @@ void pSeries_cmo_feature_init(void)
 {
 	char *ptr, *key, *value, *end;
 	int call_status;
-	int page_order = IOMMU_PAGE_SHIFT;
+	int page_order = IOMMU_PAGE_SHIFT_4K;
 
 	pr_debug(" -> fw_cmo_feature_init()\n");
 	spin_lock(&rtas_data_buf_lock);
diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 70fd559..465c7b0 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1282,24 +1282,25 @@ static unsigned long ibmveth_get_desired_dma(struct vio_dev *vdev)
 
 	/* netdev inits at probe time along with the structures we need below*/
 	if (netdev == NULL)
-		return IOMMU_PAGE_ALIGN(IBMVETH_IO_ENTITLEMENT_DEFAULT);
+		return IOMMU_PAGE_ALIGN_4K(IBMVETH_IO_ENTITLEMENT_DEFAULT);
 
 	adapter = netdev_priv(netdev);
 
 	ret = IBMVETH_BUFF_LIST_SIZE + IBMVETH_FILT_LIST_SIZE;
-	ret += IOMMU_PAGE_ALIGN(netdev->mtu);
+	ret += IOMMU_PAGE_ALIGN_4K(netdev->mtu);
 
 	for (i = 0; i < IBMVETH_NUM_BUFF_POOLS; i++) {
 		/* add the size of the active receive buffers */
 		if (adapter->rx_buff_pool[i].active)
 			ret +=
 			    adapter->rx_buff_pool[i].size *
-			    IOMMU_PAGE_ALIGN(adapter->rx_buff_pool[i].
+			    IOMMU_PAGE_ALIGN_4K(adapter->rx_buff_pool[i].
 			            buff_size);
 		rxqentries += adapter->rx_buff_pool[i].size;
 	}
 	/* add the size of the receive queue entries */
-	ret += IOMMU_PAGE_ALIGN(rxqentries * sizeof(struct ibmveth_rx_q_entry));
+	ret += IOMMU_PAGE_ALIGN_4K(
+		rxqentries * sizeof(struct ibmveth_rx_q_entry));
 
 	return ret;
 }
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 1/2] iommu: Add support for iommu page sizes other than 4K
From: Alistair Popple @ 2013-10-01  3:54 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple
In-Reply-To: <1380599650-9541-1-git-send-email-alistair@popple.id.au>

Currently the iommu uses hardcoded pages sizes of 4K even though some
hardware supports other page sizes. This patch adds a field (it_page_shift)
to struct iommu_table to support different page sizes and updates the generic
iommu code to use that field.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/include/asm/iommu.h |   28 ++++++++++-------
 arch/powerpc/kernel/dma-iommu.c  |    4 +--
 arch/powerpc/kernel/iommu.c      |   64 +++++++++++++++++++-------------------
 3 files changed, 51 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..e412a15 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -30,22 +30,19 @@
 #include <asm/machdep.h>
 #include <asm/types.h>
 
-#define IOMMU_PAGE_SHIFT      12
-#define IOMMU_PAGE_SIZE       (ASM_CONST(1) << IOMMU_PAGE_SHIFT)
-#define IOMMU_PAGE_MASK       (~((1 << IOMMU_PAGE_SHIFT) - 1))
-#define IOMMU_PAGE_ALIGN(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE)
+#define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
+#define IOMMU_PAGE_MASK(tblptr) (~((1 << (tblptr)->it_page_shift) - 1))
+#define IOMMU_PAGE_ALIGN(addr, tblptr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE(tblptr))
+
+#define IOMMU_PAGE_SHIFT_4K	(12)
+#define IOMMU_PAGE_SIZE_4K	(ASM_CONST(1) << 12)
+#define IOMMU_PAGE_MASK_4k	(~((1 << 12) - 1))
+#define IOMMU_PAGE_ALIGN_4K(addr)	_ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
 
 /* Boot time flags */
 extern int iommu_is_off;
 extern int iommu_force_on;
 
-/* Pure 2^n version of get_order */
-static __inline__ __attribute_const__ int get_iommu_order(unsigned long size)
-{
-	return __ilog2((size - 1) >> IOMMU_PAGE_SHIFT) + 1;
-}
-
-
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -76,11 +73,20 @@ struct iommu_table {
 	struct iommu_pool large_pool;
 	struct iommu_pool pools[IOMMU_NR_POOLS];
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
+	unsigned long  it_page_shift;/* table iommu page size */
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
 };
 
+/* Pure 2^n version of get_order */
+static inline __attribute_const__
+int get_iommu_order(unsigned long size, struct iommu_table *tbl)
+{
+	return __ilog2((size - 1) >> tbl->it_page_shift) + 1;
+}
+
+
 struct scatterlist;
 
 static inline void set_iommu_table_base(struct device *dev, void *base)
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index e489752..54d0116 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -83,10 +83,10 @@ static int dma_iommu_dma_supported(struct device *dev, u64 mask)
 		return 0;
 	}
 
-	if (tbl->it_offset > (mask >> IOMMU_PAGE_SHIFT)) {
+	if (tbl->it_offset > (mask >> tbl->it_page_shift)) {
 		dev_info(dev, "Warning: IOMMU offset too big for device mask\n");
 		dev_info(dev, "mask: 0x%08llx, table offset: 0x%08lx\n",
-				mask, tbl->it_offset << IOMMU_PAGE_SHIFT);
+				mask, tbl->it_offset << tbl->it_page_shift);
 		return 0;
 	} else
 		return 1;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..38f6e0c 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -251,14 +251,13 @@ again:
 
 	if (dev)
 		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
-				      1 << IOMMU_PAGE_SHIFT);
+				      1 << tbl->it_page_shift);
 	else
-		boundary_size = ALIGN(1UL << 32, 1 << IOMMU_PAGE_SHIFT);
+		boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
 	/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
 
-	n = iommu_area_alloc(tbl->it_map, limit, start, npages,
-			     tbl->it_offset, boundary_size >> IOMMU_PAGE_SHIFT,
-			     align_mask);
+	n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
+			     boundary_size >> tbl->it_page_shift, align_mask);
 	if (n == -1) {
 		if (likely(pass == 0)) {
 			/* First try the pool from the start */
@@ -320,12 +319,12 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 		return DMA_ERROR_CODE;
 
 	entry += tbl->it_offset;	/* Offset into real TCE table */
-	ret = entry << IOMMU_PAGE_SHIFT;	/* Set the return dma address */
+	ret = entry << tbl->it_page_shift;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
 	build_fail = ppc_md.tce_build(tbl, entry, npages,
-	                              (unsigned long)page & IOMMU_PAGE_MASK,
-	                              direction, attrs);
+				      (unsigned long)page &
+				      IOMMU_PAGE_MASK(tbl), direction, attrs);
 
 	/* ppc_md.tce_build() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
@@ -352,7 +351,7 @@ static bool iommu_free_check(struct iommu_table *tbl, dma_addr_t dma_addr,
 {
 	unsigned long entry, free_entry;
 
-	entry = dma_addr >> IOMMU_PAGE_SHIFT;
+	entry = dma_addr >> tbl->it_page_shift;
 	free_entry = entry - tbl->it_offset;
 
 	if (((free_entry + npages) > tbl->it_size) ||
@@ -401,7 +400,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	unsigned long flags;
 	struct iommu_pool *pool;
 
-	entry = dma_addr >> IOMMU_PAGE_SHIFT;
+	entry = dma_addr >> tbl->it_page_shift;
 	free_entry = entry - tbl->it_offset;
 
 	pool = get_pool(tbl, free_entry);
@@ -468,13 +467,13 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		}
 		/* Allocate iommu entries for that segment */
 		vaddr = (unsigned long) sg_virt(s);
-		npages = iommu_num_pages(vaddr, slen, IOMMU_PAGE_SIZE);
+		npages = iommu_num_pages(vaddr, slen, IOMMU_PAGE_SIZE(tbl));
 		align = 0;
-		if (IOMMU_PAGE_SHIFT < PAGE_SHIFT && slen >= PAGE_SIZE &&
+		if (tbl->it_page_shift < PAGE_SHIFT && slen >= PAGE_SIZE &&
 		    (vaddr & ~PAGE_MASK) == 0)
-			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+			align = PAGE_SHIFT - tbl->it_page_shift;
 		entry = iommu_range_alloc(dev, tbl, npages, &handle,
-					  mask >> IOMMU_PAGE_SHIFT, align);
+					  mask >> tbl->it_page_shift, align);
 
 		DBG("  - vaddr: %lx, size: %lx\n", vaddr, slen);
 
@@ -489,16 +488,16 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 
 		/* Convert entry to a dma_addr_t */
 		entry += tbl->it_offset;
-		dma_addr = entry << IOMMU_PAGE_SHIFT;
-		dma_addr |= (s->offset & ~IOMMU_PAGE_MASK);
+		dma_addr = entry << tbl->it_page_shift;
+		dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
 
 		DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
 		build_fail = ppc_md.tce_build(tbl, entry, npages,
-		                              vaddr & IOMMU_PAGE_MASK,
-		                              direction, attrs);
+					      vaddr & IOMMU_PAGE_MASK(tbl),
+					      direction, attrs);
 		if(unlikely(build_fail))
 			goto failure;
 
@@ -559,9 +558,9 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		if (s->dma_length != 0) {
 			unsigned long vaddr, npages;
 
-			vaddr = s->dma_address & IOMMU_PAGE_MASK;
+			vaddr = s->dma_address & IOMMU_PAGE_MASK(tbl);
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
-						 IOMMU_PAGE_SIZE);
+						 IOMMU_PAGE_SIZE(tbl));
 			__iommu_free(tbl, vaddr, npages);
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
@@ -592,7 +591,7 @@ void iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 		if (sg->dma_length == 0)
 			break;
 		npages = iommu_num_pages(dma_handle, sg->dma_length,
-					 IOMMU_PAGE_SIZE);
+					 IOMMU_PAGE_SIZE(tbl));
 		__iommu_free(tbl, dma_handle, npages);
 		sg = sg_next(sg);
 	}
@@ -676,7 +675,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 		set_bit(0, tbl->it_map);
 
 	/* We only split the IOMMU table if we have 1GB or more of space */
-	if ((tbl->it_size << IOMMU_PAGE_SHIFT) >= (1UL * 1024 * 1024 * 1024))
+	if ((tbl->it_size << tbl->it_page_shift) >= (1UL * 1024 * 1024 * 1024))
 		tbl->nr_pools = IOMMU_NR_POOLS;
 	else
 		tbl->nr_pools = 1;
@@ -768,16 +767,16 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl,
 
 	vaddr = page_address(page) + offset;
 	uaddr = (unsigned long)vaddr;
-	npages = iommu_num_pages(uaddr, size, IOMMU_PAGE_SIZE);
+	npages = iommu_num_pages(uaddr, size, IOMMU_PAGE_SIZE(tbl));
 
 	if (tbl) {
 		align = 0;
-		if (IOMMU_PAGE_SHIFT < PAGE_SHIFT && size >= PAGE_SIZE &&
+		if (tbl->it_page_shift < PAGE_SHIFT && size >= PAGE_SIZE &&
 		    ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-			align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+			align = PAGE_SHIFT - tbl->it_page_shift;
 
 		dma_handle = iommu_alloc(dev, tbl, vaddr, npages, direction,
-					 mask >> IOMMU_PAGE_SHIFT, align,
+					 mask >> tbl->it_page_shift, align,
 					 attrs);
 		if (dma_handle == DMA_ERROR_CODE) {
 			if (printk_ratelimit())  {
@@ -786,7 +785,7 @@ dma_addr_t iommu_map_page(struct device *dev, struct iommu_table *tbl,
 					 npages);
 			}
 		} else
-			dma_handle |= (uaddr & ~IOMMU_PAGE_MASK);
+			dma_handle |= (uaddr & ~IOMMU_PAGE_MASK(tbl));
 	}
 
 	return dma_handle;
@@ -801,7 +800,8 @@ void iommu_unmap_page(struct iommu_table *tbl, dma_addr_t dma_handle,
 	BUG_ON(direction == DMA_NONE);
 
 	if (tbl) {
-		npages = iommu_num_pages(dma_handle, size, IOMMU_PAGE_SIZE);
+		npages = iommu_num_pages(dma_handle, size,
+					 IOMMU_PAGE_SIZE(tbl));
 		iommu_free(tbl, dma_handle, npages);
 	}
 }
@@ -845,10 +845,10 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
 	memset(ret, 0, size);
 
 	/* Set up tces to cover the allocated range */
-	nio_pages = size >> IOMMU_PAGE_SHIFT;
-	io_order = get_iommu_order(size);
+	nio_pages = size >> tbl->it_page_shift;
+	io_order = get_iommu_order(size, tbl);
 	mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
-			      mask >> IOMMU_PAGE_SHIFT, io_order, NULL);
+			      mask >> tbl->it_page_shift, io_order, NULL);
 	if (mapping == DMA_ERROR_CODE) {
 		free_pages((unsigned long)ret, order);
 		return NULL;
@@ -864,7 +864,7 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
 		unsigned int nio_pages;
 
 		size = PAGE_ALIGN(size);
-		nio_pages = size >> IOMMU_PAGE_SHIFT;
+		nio_pages = size >> tbl->it_page_shift;
 		iommu_free(tbl, dma_handle, nio_pages);
 		size = PAGE_ALIGN(size);
 		free_pages((unsigned long)vaddr, get_order(size));
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/2] iommu: Support pages size other than 4K
From: Alistair Popple @ 2013-10-01  3:54 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple

This series of patches adds support for iommu page sizes other than
4K. Currently iommu page sizes are hardcoded to 4K. This series does
not actually change the page size but adds support for doing so.

It has been tested on a pSeries machine.

Alistair Popple (2):
  iommu: Add support for iommu page sizes other than 4K
  iommu: Update platform initialisation of iommu to use it_page_shift

 arch/powerpc/include/asm/iommu.h       |   28 ++++++++------
 arch/powerpc/kernel/dma-iommu.c        |    4 +-
 arch/powerpc/kernel/iommu.c            |   64 ++++++++++++++++----------------
 arch/powerpc/kernel/vio.c              |   20 +++++-----
 arch/powerpc/platforms/cell/iommu.c    |   14 ++++---
 arch/powerpc/platforms/powernv/pci.c   |    3 +-
 arch/powerpc/platforms/pseries/iommu.c |   10 +++--
 arch/powerpc/platforms/pseries/setup.c |    4 +-
 drivers/net/ethernet/ibm/ibmveth.c     |    9 +++--
 9 files changed, 85 insertions(+), 71 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* Re: [PATCH 1/2][v7] powerpc/mpc85xx:Add initial device tree support of T104x
From: Prabhakar Kushwaha @ 2013-10-01  3:49 UTC (permalink / raw)
  To: Scott Wood; +Cc: Varun Sethi, linuxppc-dev, Poonam Aggrwal, Priyanka Jain
In-Reply-To: <1380570471.24959.517.camel@snotra.buserror.net>

On 10/01/2013 01:17 AM, Scott Wood wrote:
> On Mon, 2013-09-30 at 12:24 +0530, Prabhakar Kushwaha wrote:
>>      - Removed l2switch. It will be added later
> Why?

I am not aware of bindings required for l2switch as we are not working 
on the driver.
Earlier I thought of putting a place holder. but as you suggested to put 
bindings in documentation.
it will require proper discussion this may delay the base patch acceptance.
so It will be good if it is put by actual driver owner.

>> +sata@220000 {
>> +			fsl,iommu-parent = <&pamu0>;
>> +			fsl,liodn-reg = <&guts 0x550>; /* SATA1LIODNR */
>> +};
>> +/include/ "qoriq-sata2-1.dtsi"
>> +sata@221000 {
>> +			fsl,iommu-parent = <&pamu0>;
>> +			fsl,liodn-reg = <&guts 0x554>; /* SATA2LIODNR */
>> +};
> Whitespace

do we have any scripts which check for whitespace as checkpatch never 
give any warning/error.
it is a very silly mistake which I am doing continuously :(

>> +/include/ "qoriq-sec5.0-0.dtsi"
>> +};
>> diff --git a/arch/powerpc/boot/dts/fsl/t1042si-post.dtsi b/arch/powerpc/boot/dts/fsl/t1042si-post.dtsi
>> new file mode 100644
>> index 0000000..f286a50
>> --- /dev/null
>> +++ b/arch/powerpc/boot/dts/fsl/t1042si-post.dtsi
>> @@ -0,0 +1,35 @@
>> +/*
>> + * T1042 Silicon/SoC Device Tree Source (post include)
>> + *
>> + * Copyright 2013 Freescale Semiconductor Inc.
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions are met:
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above copyright
>> + *       notice, this list of conditions and the following disclaimer in the
>> + *       documentation and/or other materials provided with the distribution.
>> + *     * Neither the name of Freescale Semiconductor nor the
>> + *       names of its contributors may be used to endorse or promote products
>> + *       derived from this software without specific prior written permission.
>> + *
>> + *
>> + * ALTERNATIVELY, this software may be distributed under the terms of the
>> + * GNU General Public License ("GPL") as published by the Free Software
>> + * Foundation, either version 2 of that License or (at your option) any
>> + * later version.
>> + *
>> + * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND ANY
>> + * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
>> + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
>> + * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
>> + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
>> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
>> + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>> + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
>> + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> + */
>> +
>> +/include/ "t1040si-post.dtsi"
> Should at least have a comment indicating that eventually this should
> hold the l2 switch node.

yes. Ideally it should be.
but if i put a comment then I believe this patch will not be completed. 
it will think as a RFC.
as I believe putting of TODO is generally for RFC patches.

should I put comment without TODO?

>> +	aliases {
>> +		ccsr = &soc;
>> +		dcsr = &dcsr;
>> +
>> +		serial0 = &serial0;
>> +		serial1 = &serial1;
>> +		serial2 = &serial2;
>> +		serial3 = &serial3;
>> +		pci0 = &pci0;
>> +		pci1 = &pci1;
>> +		pci2 = &pci2;
>> +		pci3 = &pci3;
>> +		usb0 = &usb0;
>> +		usb1 = &usb1;
>> +		sdhc = &sdhc;
>> +
>> +		crypto = &crypto;
>> +
>> +	};
>> +
>> +	cpus {
>> +		#address-cells = <1>;
>> +		#size-cells = <0>;
> [snip]
>> +			L2_4: l2-cache {
>> +				next-level-cache = <&cpc>;
>> +			};
>> +		};
>> +
>> +	};
>> +};
> Don't leave a blank line before a closing brace.
>
>
>
>    
>
My mistake. I will take care.

thanks,
Prabhakar

^ permalink raw reply

* Re: [PATCH 1/2][v7] powerpc/mpc85xx:Add initial device tree support of T104x
From: Prabhakar Kushwaha @ 2013-10-01  3:26 UTC (permalink / raw)
  To: Scott Wood; +Cc: Varun Sethi, linuxppc-dev, Poonam Aggrwal, Priyanka Jain
In-Reply-To: <1380570471.24959.517.camel@snotra.buserror.net>

On 10/01/2013 01:17 AM, Scott Wood wrote:
> On Mon, 2013-09-30 at 12:24 +0530, Prabhakar Kushwaha wrote:
>>      - Removed l2switch. It will be added later
> Why?

I am not aware of bindings required for l2switch as we are not working 
on the driver.
Earlier I thought of putting a place holder. but as you suggested to put 
bindings in documentation.
It will be good if it is put by actual driver owner.

>
>> +sata@220000 {
>> +			fsl,iommu-parent = <&pamu0>;
>> +			fsl,liodn-reg = <&guts 0x550>; /* SATA1LIODNR */
>> +};
>> +/include/ "qoriq-sata2-1.dtsi"
>> +sata@221000 {
>> +			fsl,iommu-parent = <&pamu0>;
>> +			fsl,liodn-reg = <&guts 0x554>; /* SATA2LIODNR */
>> +};
> Whitespace

do we have any scripts which check for whitespace as checkpatch never 
give any warning/error.
it is a very silly mistake which I am doing continuously :(

>> +/include/ "qoriq-sec5.0-0.dtsi"
>> +};
>> diff --git a/arch/powerpc/boot/dts/fsl/t1042si-post.dtsi b/arch/powerpc/boot/dts/fsl/t1042si-post.dtsi
>> new file mode 100644
>> index 0000000..f286a50
>> --- /dev/null
>> +++ b/arch/powerpc/boot/dts/fsl/t1042si-post.dtsi
>> @@ -0,0 +1,35 @@
>> +/*
>> + * T1042 Silicon/SoC Device Tree Source (post include)
>> + *
>> + * Copyright 2013 Freescale Semiconductor Inc.
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions are met:
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above copyright
>> + *       notice, this list of conditions and the following disclaimer in the
>> + *       documentation and/or other materials provided with the distribution.
>> + *     * Neither the name of Freescale Semiconductor nor the
>> + *       names of its contributors may be used to endorse or promote products
>> + *       derived from this software without specific prior written permission.
>> + *
>> + *
>> + * ALTERNATIVELY, this software may be distributed under the terms of the
>> + * GNU General Public License ("GPL") as published by the Free Software
>> + * Foundation, either version 2 of that License or (at your option) any
>> + * later version.
>> + *
>> + * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND ANY
>> + * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
>> + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
>> + * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR ANY
>> + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
>> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
>> + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>> + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
>> + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> + */
>> +
>> +/include/ "t1040si-post.dtsi"
> Should at least have a comment indicating that eventually this should
> hold the l2 switch node.

yes. Ideally it should be.
but if I put a comment then I believe this patch will not be completed. 
it will think as a RFC.
as I believe putting of TODO is generally for RFC patches.

>> +	aliases {
>> +		ccsr = &soc;
>> +		dcsr = &dcsr;
>> +
>> +		serial0 = &serial0;
>> +		serial1 = &serial1;
>> +		serial2 = &serial2;
>> +		serial3 = &serial3;
>> +		pci0 = &pci0;
>> +		pci1 = &pci1;
>> +		pci2 = &pci2;
>> +		pci3 = &pci3;
>> +		usb0 = &usb0;
>> +		usb1 = &usb1;
>> +		sdhc = &sdhc;
>> +
>> +		crypto = &crypto;
>> +
>> +	};
>> +
>> +	cpus {
>> +		#address-cells = <1>;
>> +		#size-cells = <0>;
> [snip]
>> +			L2_4: l2-cache {
>> +				next-level-cache = <&cpc>;
>> +			};
>> +		};
>> +
>> +	};
>> +};
> Don't leave a blank line before a closing brace.
my mistake. I will take care.

Thanks,
Prabhakar

^ permalink raw reply

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
From: Paul E. McKenney @ 2013-10-01  3:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Waiman Long, Michael Neuling, Peter Zijlstra, Norton, Scott J,
	ppc-dev, Linux Kernel Mailing List, Chandramouleeswaran, Aswin,
	George Spelvin, linux-fsdevel, Linus Torvalds, Ingo Molnar
In-Reply-To: <1380593103.6396.38.camel@pasglop>

On Tue, Oct 01, 2013 at 12:05:03PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-09-30 at 17:56 -0700, Linus Torvalds wrote:
> > On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling <mikey@neuling.org> wrote:
> > >
> > > The scary part is that we to make all register volatile.  You were not
> > > that keen on doing this as there are a lot of places in exception
> > > entry/exit where we only save/restore a subset of the registers.  We'd
> > > need to catch all these.
> > 
> > Ugh. It's very possible it's not worth using for the kernel then. The
> > example I posted is normally fine *without* any transactional support,
> > since it's a very local per-dentry lock, and since we only take that
> > lock when the last reference drops (so it's not some common directory
> > dentry, it's a end-point file dentry). In fact, on ARM they just made
> > the cmpxchg much faster by making it entirely non-serializing (since
> > it only updates a reference count, there is no locking involved apart
> > from checking that the lock state is unlocked)

A memory-barrier-free cmpxchg() would be easy on Power as well.

> > So there is basically never any contention, and the transaction needs
> > to basically be pretty much the same cost as a "cmpxchg". It's not
> > clear if the intel TSX is good enough for that, and if you have to
> > save a lot of registers in order to use transactions on POWER8, I
> > doubt it's worthwhile.
> 
> Well we don't have to, I think Mikey wasn't totally clear about that
> "making all registers volatile" business :-) This is just something we
> need to handle in assembly if we are going to reclaim the suspended
> transaction.
> 
> So basically, what we need is something along the lines of
> enable_kernel_tm() which checks if there's a suspended user transaction
> and if yes, kills/reclaims it.
> 
> Then we also need to handle in our interrupt handlers that we have an
> active/suspended transaction from a kernel state, which we don't deal
> with at this point, and do whatever has to be done to kill it... we
> might get away with something simple if we can state that we only allow
> kernel transactions at task level and not from interrupt/softirq
> contexts, at least initially.

Call me a coward, but this is starting to sound a bit scary.  ;-)

> > We have very few - if any - locks where contention or even cache
> > bouncing is common or normal. Sure, we have a few particular loads
> > that can trigger it, but even that is becoming rare. So from a
> > performance standpoint, the target always needs to be "comparable to
> > hot spinlock in local cache".
> 
> I am not quite familiar with the performance profile of our
> transactional hardware. I think we should definitely try to hack
> something together for that dput() case and measure it.
> 
> > >> They also have interesting ordering semantics vs. locks, we need to be
> > >> a tad careful (as long as we don't access a lock variable
> > >> transactionally we should be ok. If we do, then spin_unlock needs a
> > >> stronger barrier).
> > >
> > > Yep.
> > 
> > Well, just about any kernel transaction will at least read the state
> > of a lock. Without that, it's generally totally useless. My dput()
> > example sequence very much verified that the lock was not held, for
> > example.
> > 
> > I'm not sure how that affects anything. The actual transaction had
> > better not be visible inside the locked region (ie as far as any lock
> > users go, transactions better all happen fully before or after the
> > lock, if they read the lock and see it being unlocked).
> > 
> > That said, I cannot see how POWER8 could possibly violate that rule.
> > The whole "transactions are atomic" is kind of the whole and only
> > point of a transaction. So I'm not sure what odd lock restrictions
> > POWER8 could have.
> 
> Has to do with the memory model :-(
> 
> I dug the whole story from my mbox and the situation is indeed as dire
> as feared. If the transaction reads the lock, then the corresponding
> spin_lock must have a full sync barrier in it instead of the current
> lighter one.
> 
> Now I believe we are already "on the fence" with our locks today since
> technically speaking, our unlock + lock sequence is *not* exactly a full
> barrier (it is only if it's the same lock I think)
> 
> CC'ing Paul McKenney here who's been chasing that issue. In the end, we
> might end up having to turn our locks into sync anyway

Well, there have been a lot of fixed software bugs since the last
suspicious sighting, but on the other hand, I am just now getting my
RCU testing going again on Power.  I would certainly feel better
about things if unlock-lock was really truly a full barrier, but
this clearly needs a clean sighting.

> Yay ! The isanity^Wjoy of an OO memory model !

;-) ;-) ;-)

							Thanx, Paul

> > > FWIW eg.
> > >
> > >      tbegin
> > >      beq abort /* passes first time through */
> > >      ....
> > >      transactional stuff
> > >      ....
> > >      tend
> > >      b pass
> > >
> > > abort:
> > >
> > > pass:
> > 
> > That's fine, and matches the x86 semantics fairly closely, except
> > "xbegin" kind of "contains" that "jump to abort address". But we could
> > definitely use the same models. Call it
> > "transaction_begin/abort/end()", and it should be architecture-neutral
> > naming-wise.
> 
> Right.
> 
> > Of course, if tbegin then acts basically like some crazy
> > assembly-level setjmp (I'm guessing it does exactly, and presumably
> > precisely that kind of compiler support - ie a function with
> > "__attribute((returns_twice))" in gcc-speak), the overhead of doing it
> > may kill it.
> 
> Well, all the registers are checkpointed so we *should* be ok but gcc
> always makes me nervous in those circumstances ...
> 
> Ben.
> 
> 
> >             Linus
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

^ permalink raw reply

* Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory
From: Benjamin Herrenschmidt @ 2013-10-01  2:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Michael Neuling, Peter Zijlstra, Norton, Scott J,
	Linux Kernel Mailing List, Chandramouleeswaran, Aswin,
	George Spelvin, linux-fsdevel, Paul E. McKenney, ppc-dev,
	Ingo Molnar
In-Reply-To: <CA+55aFwi5sZK4mFxs3PT1wU4_vNzsLOFiocqi3g14fPamKXYsw@mail.gmail.com>

On Mon, 2013-09-30 at 17:56 -0700, Linus Torvalds wrote:
> On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling <mikey@neuling.org> wrote:
> >
> > The scary part is that we to make all register volatile.  You were not
> > that keen on doing this as there are a lot of places in exception
> > entry/exit where we only save/restore a subset of the registers.  We'd
> > need to catch all these.
> 
> Ugh. It's very possible it's not worth using for the kernel then. The
> example I posted is normally fine *without* any transactional support,
> since it's a very local per-dentry lock, and since we only take that
> lock when the last reference drops (so it's not some common directory
> dentry, it's a end-point file dentry). In fact, on ARM they just made
> the cmpxchg much faster by making it entirely non-serializing (since
> it only updates a reference count, there is no locking involved apart
> from checking that the lock state is unlocked)
> 
> So there is basically never any contention, and the transaction needs
> to basically be pretty much the same cost as a "cmpxchg". It's not
> clear if the intel TSX is good enough for that, and if you have to
> save a lot of registers in order to use transactions on POWER8, I
> doubt it's worthwhile.

Well we don't have to, I think Mikey wasn't totally clear about that
"making all registers volatile" business :-) This is just something we
need to handle in assembly if we are going to reclaim the suspended
transaction.

So basically, what we need is something along the lines of
enable_kernel_tm() which checks if there's a suspended user transaction
and if yes, kills/reclaims it.

Then we also need to handle in our interrupt handlers that we have an
active/suspended transaction from a kernel state, which we don't deal
with at this point, and do whatever has to be done to kill it... we
might get away with something simple if we can state that we only allow
kernel transactions at task level and not from interrupt/softirq
contexts, at least initially.

> We have very few - if any - locks where contention or even cache
> bouncing is common or normal. Sure, we have a few particular loads
> that can trigger it, but even that is becoming rare. So from a
> performance standpoint, the target always needs to be "comparable to
> hot spinlock in local cache".

I am not quite familiar with the performance profile of our
transactional hardware. I think we should definitely try to hack
something together for that dput() case and measure it.

> >> They also have interesting ordering semantics vs. locks, we need to be
> >> a tad careful (as long as we don't access a lock variable
> >> transactionally we should be ok. If we do, then spin_unlock needs a
> >> stronger barrier).
> >
> > Yep.
> 
> Well, just about any kernel transaction will at least read the state
> of a lock. Without that, it's generally totally useless. My dput()
> example sequence very much verified that the lock was not held, for
> example.
> 
> I'm not sure how that affects anything. The actual transaction had
> better not be visible inside the locked region (ie as far as any lock
> users go, transactions better all happen fully before or after the
> lock, if they read the lock and see it being unlocked).
> 
> That said, I cannot see how POWER8 could possibly violate that rule.
> The whole "transactions are atomic" is kind of the whole and only
> point of a transaction. So I'm not sure what odd lock restrictions
> POWER8 could have.

Has to do with the memory model :-(

I dug the whole story from my mbox and the situation is indeed as dire
as feared. If the transaction reads the lock, then the corresponding
spin_lock must have a full sync barrier in it instead of the current
lighter one.

Now I believe we are already "on the fence" with our locks today since
technically speaking, our unlock + lock sequence is *not* exactly a full
barrier (it is only if it's the same lock I think)

CC'ing Paul McKenney here who's been chasing that issue. In the end, we
might end up having to turn our locks into sync anyway

Yay ! The isanity^Wjoy of an OO memory model !

> > FWIW eg.
> >
> >      tbegin
> >      beq abort /* passes first time through */
> >      ....
> >      transactional stuff
> >      ....
> >      tend
> >      b pass
> >
> > abort:
> >
> > pass:
> 
> That's fine, and matches the x86 semantics fairly closely, except
> "xbegin" kind of "contains" that "jump to abort address". But we could
> definitely use the same models. Call it
> "transaction_begin/abort/end()", and it should be architecture-neutral
> naming-wise.

Right.

> Of course, if tbegin then acts basically like some crazy
> assembly-level setjmp (I'm guessing it does exactly, and presumably
> precisely that kind of compiler support - ie a function with
> "__attribute((returns_twice))" in gcc-speak), the overhead of doing it
> may kill it.

Well, all the registers are checkpointed so we *should* be ok but gcc
always makes me nervous in those circumstances ...

Ben.


>             Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox