LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: pci and pcie device-tree binding - range No cells
From: Rob Herring @ 2012-12-10 16:02 UTC (permalink / raw)
  To: monstr
  Cc: linux-pci, devicetree-discuss, Thierry Reding, Rob Herring,
	linuxppc-dev
In-Reply-To: <50C601B6.2080107@monstr.eu>

On 12/10/2012 09:37 AM, Michal Simek wrote:
> On 12/10/2012 04:21 PM, Rob Herring wrote:
>> On 12/10/2012 09:05 AM, Michal Simek wrote:
>>> On 12/10/2012 03:26 PM, Rob Herring wrote:
>>>> On 12/10/2012 06:20 AM, Michal Simek wrote:
>>>>> Hi Grant and others,
>>>>>
>>>>> I have a question regarding number of cells in ranges property
>>>>> for pci and pcie nodes.
>>>>>
>>>>> Linux pci/pcie powerpc DTSes contain 7 cells (xpedite5370.dts,
>>>>> sequoia.dts, etc)
>>>>> but also 6 cells format too (mpc832x_mds.dts)
>>>>>
>>>>> Here is shown 6 cells ranges format and describe
>>>>> http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge
>>>>>
>>>>> And also in documentation in the linux
>>>>> Documentation/devicetree/bindings/pci/83xx-512x-pci.txt
>>>>>
>>>>> Both format uses:
>>>>> #size-cells = <2>;
>>>>> #address-cells = <3>;
>>>>>
>>>>> What is valid format?
>>>>
>>>> Both. 7 cells are valid when the host (parent) bus is 64-bit and 6
>>>> cells
>>>> are valid when the host bus is 32-bit. The ranges property is <<child
>>>> address> <parent address> <size>>. The parent address #address-cells is
>>>> taken from the parent node.
>>>
>>> Ok. Got it.
>>>
>>> Here is what we use on zynq and microblaze - both 32bit which should be
>>> fine.
>>>
>>>      ps7_axi_interconnect_0: axi@0 {
>>>          #address-cells = <1>;
>>>          #size-cells = <1>;
>>>          axi_pcie_0: axi-pcie@50000000 {
>>>              #address-cells = <3>;
>>>              #size-cells = <2>;
>>>              compatible = "xlnx,axi-pcie-1.05.a";
>>>              ranges = < 0x02000000 0 0x60000000 0x60000000 0
>>> 0x10000000 >;
>>>              ...
>>>          }
>>>      }
>>>
>>> What I am wondering is pci_process_bridge_OF_ranges() at
>>> arch/powerpc/kernel/pci-common.c
>>> where there are used some hardcoded values which should be probably
>>> loaded from device-tree.
>>>
>>> For example:
>>> 683         int np = pna + 5;
>>> ...
>>> 702                 pci_addr = of_read_number(ranges + 1, 2);
>>> 703                 cpu_addr = of_translate_address(dev, ranges + 3);
>>> 704                 size = of_read_number(ranges + pna + 3, 2);
>>
>> These would always be correct whether you have 6 or 7 cells. pna is the
>> parent bus address cells size. The pci address is fixed at 3 cells.
> 
> Sorry for my pci ignorance (have never got hw for mb/zynq)
> I just want to get better overview how we should we our drivers to be
> compatible.
> 
> Does it mean that pci is supposed be always 64 bit wide?
> And there is no option to have just 32bit values.

That's what the DT documentation says.

BTW, you can use 2 cells even if you only have a 32-bit bus.

> 
>>> Unfortunately we have copied it to microblaze.
>>
>> I look at the PCI DT code in powerpc and see a whole bunch of code that
>> seems like it should be common. The different per arch pci structs
>> complicates that. No one has really gotten to looking at PCI DT on ARM
>> yet except you and Thierry for Tegra. We definitely don't want to create
>> a 3rd copy. Starting the process of moving it to something like
>> drivers/pci/pci-of.c would be great.
> 
> We have done pcie working on zynq and the same pcie host is working on
> Microblaze too.
> There are just small differences. That's why I have sent another email
> ("Sharing PCIE driver between Microblaze and Arm zynq") to find out
> proper location.

Yes, I've followed the thread. There are lots of areas for consolidation
with PCIe hosts both across architectures and within ARM. There are
multiple people using DW PCIe IP although not upstream yet that I'm
aware of.

Rob

> 
> PCI: Microblaze shares almost the same code with powerpc that's why I am
> definitely open
> to move this code out of architecture.
> 
> Thanks,
> Michal
> 
> 
> 

^ permalink raw reply

* RE: pci and pcie device-tree binding - range No cells
From: David Laight @ 2012-12-10 15:52 UTC (permalink / raw)
  To: monstr, Rob Herring
  Cc: linux-pci, devicetree-discuss, Thierry Reding, linuxppc-dev,
	Rob Herring
In-Reply-To: <50C601B6.2080107@monstr.eu>

> Does it mean that pci is supposed be always 64 bit wide?
> And there is no option to have just 32bit values.

I certainly believe that all PCIe (not PCI) transfers are
nominally multiples of 64bit data.
There are (effectively) 8 byte-lane enables to allow partial word
transfers (I'm not sure whether disjoint subwords are valid).

Some PCIe slave->32bit bus bridges always do two slave cycles.
Even though, for a 32bit word transfer, one of them ends up having
no byte enables asserted.

	David

^ permalink raw reply

* Re: pci and pcie device-tree binding - range No cells
From: Michal Simek @ 2012-12-10 15:37 UTC (permalink / raw)
  To: Rob Herring
  Cc: linux-pci, devicetree-discuss, Thierry Reding, Rob Herring,
	linuxppc-dev
In-Reply-To: <50C5FE0F.3050108@gmail.com>

On 12/10/2012 04:21 PM, Rob Herring wrote:
> On 12/10/2012 09:05 AM, Michal Simek wrote:
>> On 12/10/2012 03:26 PM, Rob Herring wrote:
>>> On 12/10/2012 06:20 AM, Michal Simek wrote:
>>>> Hi Grant and others,
>>>>
>>>> I have a question regarding number of cells in ranges property
>>>> for pci and pcie nodes.
>>>>
>>>> Linux pci/pcie powerpc DTSes contain 7 cells (xpedite5370.dts,
>>>> sequoia.dts, etc)
>>>> but also 6 cells format too (mpc832x_mds.dts)
>>>>
>>>> Here is shown 6 cells ranges format and describe
>>>> http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge
>>>>
>>>> And also in documentation in the linux
>>>> Documentation/devicetree/bindings/pci/83xx-512x-pci.txt
>>>>
>>>> Both format uses:
>>>> #size-cells = <2>;
>>>> #address-cells = <3>;
>>>>
>>>> What is valid format?
>>>
>>> Both. 7 cells are valid when the host (parent) bus is 64-bit and 6 cells
>>> are valid when the host bus is 32-bit. The ranges property is <<child
>>> address> <parent address> <size>>. The parent address #address-cells is
>>> taken from the parent node.
>>
>> Ok. Got it.
>>
>> Here is what we use on zynq and microblaze - both 32bit which should be
>> fine.
>>
>>      ps7_axi_interconnect_0: axi@0 {
>>          #address-cells = <1>;
>>          #size-cells = <1>;
>>          axi_pcie_0: axi-pcie@50000000 {
>>              #address-cells = <3>;
>>              #size-cells = <2>;
>>              compatible = "xlnx,axi-pcie-1.05.a";
>>              ranges = < 0x02000000 0 0x60000000 0x60000000 0 0x10000000 >;
>>              ...
>>          }
>>      }
>>
>> What I am wondering is pci_process_bridge_OF_ranges() at
>> arch/powerpc/kernel/pci-common.c
>> where there are used some hardcoded values which should be probably
>> loaded from device-tree.
>>
>> For example:
>> 683         int np = pna + 5;
>> ...
>> 702                 pci_addr = of_read_number(ranges + 1, 2);
>> 703                 cpu_addr = of_translate_address(dev, ranges + 3);
>> 704                 size = of_read_number(ranges + pna + 3, 2);
>
> These would always be correct whether you have 6 or 7 cells. pna is the
> parent bus address cells size. The pci address is fixed at 3 cells.

Sorry for my pci ignorance (have never got hw for mb/zynq)
I just want to get better overview how we should we our drivers to be compatible.

Does it mean that pci is supposed be always 64 bit wide?
And there is no option to have just 32bit values.

>> Unfortunately we have copied it to microblaze.
>
> I look at the PCI DT code in powerpc and see a whole bunch of code that
> seems like it should be common. The different per arch pci structs
> complicates that. No one has really gotten to looking at PCI DT on ARM
> yet except you and Thierry for Tegra. We definitely don't want to create
> a 3rd copy. Starting the process of moving it to something like
> drivers/pci/pci-of.c would be great.

We have done pcie working on zynq and the same pcie host is working on Microblaze too.
There are just small differences. That's why I have sent another email
("Sharing PCIE driver between Microblaze and Arm zynq") to find out proper location.

PCI: Microblaze shares almost the same code with powerpc that's why I am definitely open
to move this code out of architecture.

Thanks,
Michal



-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

* Re: pci and pcie device-tree binding - range No cells
From: Rob Herring @ 2012-12-10 15:21 UTC (permalink / raw)
  To: monstr
  Cc: linux-pci, devicetree-discuss, Thierry Reding, Rob Herring,
	linuxppc-dev
In-Reply-To: <50C5FA3E.9030303@monstr.eu>

On 12/10/2012 09:05 AM, Michal Simek wrote:
> On 12/10/2012 03:26 PM, Rob Herring wrote:
>> On 12/10/2012 06:20 AM, Michal Simek wrote:
>>> Hi Grant and others,
>>>
>>> I have a question regarding number of cells in ranges property
>>> for pci and pcie nodes.
>>>
>>> Linux pci/pcie powerpc DTSes contain 7 cells (xpedite5370.dts,
>>> sequoia.dts, etc)
>>> but also 6 cells format too (mpc832x_mds.dts)
>>>
>>> Here is shown 6 cells ranges format and describe
>>> http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge
>>>
>>> And also in documentation in the linux
>>> Documentation/devicetree/bindings/pci/83xx-512x-pci.txt
>>>
>>> Both format uses:
>>> #size-cells = <2>;
>>> #address-cells = <3>;
>>>
>>> What is valid format?
>>
>> Both. 7 cells are valid when the host (parent) bus is 64-bit and 6 cells
>> are valid when the host bus is 32-bit. The ranges property is <<child
>> address> <parent address> <size>>. The parent address #address-cells is
>> taken from the parent node.
> 
> Ok. Got it.
> 
> Here is what we use on zynq and microblaze - both 32bit which should be
> fine.
> 
>     ps7_axi_interconnect_0: axi@0 {
>         #address-cells = <1>;
>         #size-cells = <1>;
>         axi_pcie_0: axi-pcie@50000000 {
>             #address-cells = <3>;
>             #size-cells = <2>;
>             compatible = "xlnx,axi-pcie-1.05.a";
>             ranges = < 0x02000000 0 0x60000000 0x60000000 0 0x10000000 >;
>             ...
>         }
>     }
> 
> What I am wondering is pci_process_bridge_OF_ranges() at
> arch/powerpc/kernel/pci-common.c
> where there are used some hardcoded values which should be probably
> loaded from device-tree.
> 
> For example:
> 683         int np = pna + 5;
> ...
> 702                 pci_addr = of_read_number(ranges + 1, 2);
> 703                 cpu_addr = of_translate_address(dev, ranges + 3);
> 704                 size = of_read_number(ranges + pna + 3, 2);

These would always be correct whether you have 6 or 7 cells. pna is the
parent bus address cells size. The pci address is fixed at 3 cells.

> 
> 
> Unfortunately we have copied it to microblaze.

I look at the PCI DT code in powerpc and see a whole bunch of code that
seems like it should be common. The different per arch pci structs
complicates that. No one has really gotten to looking at PCI DT on ARM
yet except you and Thierry for Tegra. We definitely don't want to create
a 3rd copy. Starting the process of moving it to something like
drivers/pci/pci-of.c would be great.

Rob

> 
> Thanks,
> Michal
> 
> 
> 
> 
> 
> 

^ permalink raw reply

* Re: pci and pcie device-tree binding - range No cells
From: Michal Simek @ 2012-12-10 15:05 UTC (permalink / raw)
  To: Rob Herring; +Cc: linux-pci, devicetree-discuss, Rob Herring, linuxppc-dev
In-Reply-To: <50C5F11D.9060006@gmail.com>

On 12/10/2012 03:26 PM, Rob Herring wrote:
> On 12/10/2012 06:20 AM, Michal Simek wrote:
>> Hi Grant and others,
>>
>> I have a question regarding number of cells in ranges property
>> for pci and pcie nodes.
>>
>> Linux pci/pcie powerpc DTSes contain 7 cells (xpedite5370.dts,
>> sequoia.dts, etc)
>> but also 6 cells format too (mpc832x_mds.dts)
>>
>> Here is shown 6 cells ranges format and describe
>> http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge
>>
>> And also in documentation in the linux
>> Documentation/devicetree/bindings/pci/83xx-512x-pci.txt
>>
>> Both format uses:
>> #size-cells = <2>;
>> #address-cells = <3>;
>>
>> What is valid format?
>
> Both. 7 cells are valid when the host (parent) bus is 64-bit and 6 cells
> are valid when the host bus is 32-bit. The ranges property is <<child
> address> <parent address> <size>>. The parent address #address-cells is
> taken from the parent node.

Ok. Got it.

Here is what we use on zynq and microblaze - both 32bit which should be fine.

	ps7_axi_interconnect_0: axi@0 {
		#address-cells = <1>;
		#size-cells = <1>;
		axi_pcie_0: axi-pcie@50000000 {
			#address-cells = <3>;
			#size-cells = <2>;
			compatible = "xlnx,axi-pcie-1.05.a";
			ranges = < 0x02000000 0 0x60000000 0x60000000 0 0x10000000 >;
			...
		}
	}

What I am wondering is pci_process_bridge_OF_ranges() at arch/powerpc/kernel/pci-common.c
where there are used some hardcoded values which should be probably loaded from device-tree.

For example:
683         int np = pna + 5;
...
702                 pci_addr = of_read_number(ranges + 1, 2);
703                 cpu_addr = of_translate_address(dev, ranges + 3);
704                 size = of_read_number(ranges + pna + 3, 2);


Unfortunately we have copied it to microblaze.

Thanks,
Michal






-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

* Re: pci and pcie device-tree binding - range No cells
From: Rob Herring @ 2012-12-10 14:26 UTC (permalink / raw)
  To: monstr; +Cc: linux-pci, devicetree-discuss, Rob Herring, linuxppc-dev
In-Reply-To: <50C5D387.90908@monstr.eu>

On 12/10/2012 06:20 AM, Michal Simek wrote:
> Hi Grant and others,
> 
> I have a question regarding number of cells in ranges property
> for pci and pcie nodes.
> 
> Linux pci/pcie powerpc DTSes contain 7 cells (xpedite5370.dts,
> sequoia.dts, etc)
> but also 6 cells format too (mpc832x_mds.dts)
> 
> Here is shown 6 cells ranges format and describe
> http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge
> 
> And also in documentation in the linux
> Documentation/devicetree/bindings/pci/83xx-512x-pci.txt
> 
> Both format uses:
> #size-cells = <2>;
> #address-cells = <3>;
> 
> What is valid format?

Both. 7 cells are valid when the host (parent) bus is 64-bit and 6 cells
are valid when the host bus is 32-bit. The ranges property is <<child
address> <parent address> <size>>. The parent address #address-cells is
taken from the parent node.

Rob

^ permalink raw reply

* pci and pcie device-tree binding - range No cells
From: Michal Simek @ 2012-12-10 12:20 UTC (permalink / raw)
  To: devicetree-discuss, Grant Likely, linux-pci, Rob Herring,
	linuxppc-dev, Benjamin Herrenschmidt

Hi Grant and others,

I have a question regarding number of cells in ranges property
for pci and pcie nodes.

Linux pci/pcie powerpc DTSes contain 7 cells (xpedite5370.dts, sequoia.dts, etc)
but also 6 cells format too (mpc832x_mds.dts)

Here is shown 6 cells ranges format and describe
http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge

And also in documentation in the linux
Documentation/devicetree/bindings/pci/83xx-512x-pci.txt

Both format uses:
#size-cells = <2>;
#address-cells = <3>;

What is valid format?

Thanks,
Michal

-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

* Re: [PATCH v2 2/4] powerpc: Move the single step enable code to a generic path
From: Ananth N Mavinakayanahalli @ 2012-12-10 10:34 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: srikar, peterz, bigeasy, oleg, linuxppc-dev, anton, mingo,
	linux-kernel
In-Reply-To: <20121203150754.7727.42629.stgit@suzukikp>

On Mon, Dec 03, 2012 at 08:38:37PM +0530, Suzuki K. Poulose wrote:
> From: Suzuki K. Poulose <suzuki@in.ibm.com>
> 
> This patch moves the single step enable code used by kprobe to a generic
> routine header so that, it can be re-used by other code, in this case,
> uprobes. No functional changes.
> 
> Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
> Cc:	Ananth N Mavinakaynahalli <ananth@in.ibm.com>
> Cc:	Kumar Gala <galak@kernel.crashing.org>
> Cc:	linuxppc-dev@ozlabs.org

Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>

^ permalink raw reply

* RE: [PATCH 3/4 v5] iommu/fsl: Add iommu domain attributes required by fsl PAMU driver.
From: Sethi Varun-B16395 @ 2012-12-10 10:10 UTC (permalink / raw)
  To: Wood Scott-B07421
  Cc: Tabi Timur-B04825, Joerg Roedel, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux-foundation.org
In-Reply-To: <1354645371.16710.15@snotra>



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Tuesday, December 04, 2012 11:53 PM
> To: Sethi Varun-B16395
> Cc: Wood Scott-B07421; Joerg Roedel; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; linuxppc-dev@lists.ozlabs.org; Tabi
> Timur-B04825
> Subject: Re: [PATCH 3/4 v5] iommu/fsl: Add iommu domain attributes
> required by fsl PAMU driver.
>=20
> On 12/04/2012 05:53:33 AM, Sethi Varun-B16395 wrote:
> >
> >
> > > -----Original Message-----
> > > From: Wood Scott-B07421
> > > Sent: Monday, December 03, 2012 10:34 PM
> > > To: Sethi Varun-B16395
> > > Cc: Joerg Roedel; linux-kernel@vger.kernel.org; iommu@lists.linux-
> > > foundation.org; Wood Scott-B07421; linuxppc-dev@lists.ozlabs.org;
> > Tabi
> > > Timur-B04825
> > > Subject: Re: [PATCH 3/4 v5] iommu/fsl: Add iommu domain attributes
> > > required by fsl PAMU driver.
> > >
> > > On 12/03/2012 10:57:29 AM, Sethi Varun-B16395 wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
> > > > > bounces@lists.linux-foundation.org] On Behalf Of Joerg Roedel
> > > > > Sent: Sunday, December 02, 2012 7:33 PM
> > > > > To: Sethi Varun-B16395
> > > > > Cc: linux-kernel@vger.kernel.org;
> > iommu@lists.linux-foundation.org;
> > > > Wood
> > > > > Scott-B07421; linuxppc-dev@lists.ozlabs.org; Tabi Timur-B04825
> > > > > Subject: Re: [PATCH 3/4 v5] iommu/fsl: Add iommu domain
> > attributes
> > > > > required by fsl PAMU driver.
> > > > >
> > > > > Hmm, we need to work out a good abstraction for this.
> > > > >
> > > > > On Tue, Nov 20, 2012 at 07:24:56PM +0530, Varun Sethi wrote:
> > > > > > Added the following domain attributes required by FSL PAMU
> > driver:
> > > > > > 1. Subwindows field added to the iommu domain geometry
> > attribute.
> > > > >
> > > > > Are the Subwindows mapped with full size or do you map only
> > parts
> > > > of the
> > > > > subwindows?
> > > > >
> > > > [Sethi Varun-B16395] It's possible to map a part of the subwindow
> > i.e.
> > > > size of the mapping can be less than the sub window size.
> > > >
> > > > > > +	 * This attribute indicates number of DMA subwindows
> > > supported
> > > > by
> > > > > > +	 * the geometry. If there is a single window that maps
> > the
> > > > entire
> > > > > > +	 * geometry, attribute must be set to "1". A value of
> > "0"
> > > > implies
> > > > > > +	 * that this mechanism is not used at all(normal paging
> > is
> > > > used).
> > > > > > +	 * Value other than* "0" or "1" indicates the actual
> > number
> > > of
> > > > > > +	 * subwindows.
> > > > > > +	 */
> > > > >
> > > > > This semantic is ugly, how about a feature detection mechanism?
> > > > >
> > > > [Sethi Varun-B16395] A feature mechanism to query the type of
> > IOMMU?
> > >
> > > A feature mechanism to determine whether this subwindow mechanism is
> > > available, and what the limits are.
> > >
> > So, we use the IOMMU capability interface to find out if IOMMU
> > supports sub windows or not, right? But still number of sub windows
> > would be specified as a part of the geometry and the valid value for
> > sub windows would  0,1 or actual number of sub windows.
>=20
> How does a user of the interface find out what values are possible for
> the "actual number of subwindows"?  How does a user of the interface find
> out whether there are any limitations on specifying a value of zero (in
> the case of PAMU, that would be a maximum 1 MiB naturally-aligned
> aperture to support arbitrary 4KiB mappings)?
How about if we say that the default value for subwindows is zero and this =
what you get when you read the geometry (iommu_get_attr) after initializing=
 the domain? In that case the user would know that implication of setting s=
ubwindows to zero with respect to the aperture size. Also, should we introd=
uce an additional API like "iommu_check_attr", which the user can use to va=
lidate the attribute value. For example in case of geometry, the user can f=
ill up the structure and pass it to the iommu driver in order to verify the=
 aperture and subwindows field.

-Varun

^ permalink raw reply

* Re: [RFC] Add IBM Blue Gene/Q Platform
From: Michael Neuling @ 2012-12-10  6:06 UTC (permalink / raw)
  To: Jimi Xenidis
  Cc: Andrew T Tauferner, Kumar Gala, Jay Bryant, Josh Boyer,
	Todd Inglett, linuxppc-dev
In-Reply-To: <FAE4A73A-BDC0-43E3-A416-E8F0F161C5B4@pobox.com>

Jimi Xenidis <jimix@pobox.com> wrote:

> 
> 
> On Dec 9, 2012, at 6:47 PM, Michael Neuling <mikey@neuling.org> wrote:
> 
> > Jimi Xenidis <jimix@pobox.com> wrote:
> > 
> >> 
> >> On Dec 7, 2012, at 7:38 AM, Jimi Xenidis <jimix@pobox.com> wrote:
> >> 
> >>> 
> >>> On Dec 6, 2012, at 11:54 PM, Michael Neuling <mikey@neuling.org> wrote:
> >>> 
> >>>>> commit 279c0615917b959a652e81f4ad0d886e2d426d85
> >>>>> Author: Jimi Xenidis <jimix@pobox.com>
> >>>>> Date:   Wed Dec 5 13:43:22 2012 -0500
> >>>>> 
> >>>>>  powerpc/book3e: IBM Blue Gene/Q Quad Processing eXtention (QPX)
> >>>>> 
> >>>>>  This enables kernel support for the QPX extention and is intended for
> >>>>>  processors that support it, usually an IBM Blue Gene processor.
> >>>>>  Turning it on does not effect other processors but it does add code
> >>>>>  and will quadruple the per thread save and restore area for the FPU
> >>>>>  (hense the name).  If you have enabled VSX it will only double the
> >>>>>  space.
> >>>>> 
> >>>>>  Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> <<snip>>
> >>>> 
> >>>> 
> >>>> 
> >>>> +BEGIN_FTR_SECTION                            \
> >>>> +    SAVE_32VSRS(n,c,base);                        \
> >>>> +END_FTR_SECTION_IFSET(CPU_FTR_VSX);                    \
> >>>> +BEGIN_FTR_SECTION                            \
> >>>> +    SAVE_32QRS(n,c,base);                        \
> >>>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX);    
> >>>> 
> >>>> I don't think we want to do this.  We are going to end up with 64
> >>>> NOPS here somewhere.
> >>> 
> >>> Excellent point, NOPs are cheap on most processors but not A2 and a lot of embedded, I can wrap some branches with the FTR instead.
> >>> Do you have a concern on the code size?
> >> 
> >> Thought about it a bit and came up with this solution for arch/powerpc/kernel/fpu.S.
> >> This should address the following issues
> >> - MSR_VSX vs MSR_VEC
> >> - Big chunks of NOPs in the code path
> >> - Less FTR space fixups at boot time.
> >> - IMNHSO easier to read especially when disassembled
> > 
> > Indeed, I think it looks better.  I was going to mention that it was
> > already pretty complex to read, so a rewrite like this was probably
> > needed.  So thanks!!
> > 
> > That being said, there is a pretty complex testing matrix of
> > CONFIG_VSX/VMX/FPU/QPX/SMP/64/32 CPU_FTR/VSX/FPU/QPX/VMX so I'd need to
> > look/test more carefully to make sure all of these are covered.
> > 
> > Also, transactional memory (see
> > http://lists.ozlabs.org/pipermail/linuxppc-dev/2012-November/102216.html)
> > will change this code.  You should rebase on top of that if you really
> > want it considered for upstream.
> 
> Is this in a git tree anywhere? perhaps BenH's next branch?

It's not in benh's tree as yet (so could change).

Grab the power8 branch here git://github.com/mikey/linux.git
or checkout https://github.com/mikey/linux/commits/power8.

It's about 20 patches on top of benh's next tree.  Also includes the
power8 doorbell changes from Ian.

Mikey

^ permalink raw reply

* Re: [RFC] Add IBM Blue Gene/Q Platform
From: Jimi Xenidis @ 2012-12-10  5:56 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Andrew T Tauferner, Kumar Gala, Jay Bryant, Josh Boyer,
	Todd Inglett, linuxppc-dev
In-Reply-To: <3051.1355100425@neuling.org>



On Dec 9, 2012, at 6:47 PM, Michael Neuling <mikey@neuling.org> wrote:

> Jimi Xenidis <jimix@pobox.com> wrote:
>=20
>>=20
>> On Dec 7, 2012, at 7:38 AM, Jimi Xenidis <jimix@pobox.com> wrote:
>>=20
>>>=20
>>> On Dec 6, 2012, at 11:54 PM, Michael Neuling <mikey@neuling.org> wrote:
>>>=20
>>>>> commit 279c0615917b959a652e81f4ad0d886e2d426d85
>>>>> Author: Jimi Xenidis <jimix@pobox.com>
>>>>> Date:   Wed Dec 5 13:43:22 2012 -0500
>>>>>=20
>>>>>  powerpc/book3e: IBM Blue Gene/Q Quad Processing eXtention (QPX)
>>>>>=20
>>>>>  This enables kernel support for the QPX extention and is intended for=

>>>>>  processors that support it, usually an IBM Blue Gene processor.
>>>>>  Turning it on does not effect other processors but it does add code
>>>>>  and will quadruple the per thread save and restore area for the FPU
>>>>>  (hense the name).  If you have enabled VSX it will only double the
>>>>>  space.
>>>>>=20
>>>>>  Signed-off-by: Jimi Xenidis <jimix@pobox.com>
>>=20
>> <<snip>>
>>>>=20
>>>>=20
>>>>=20
>>>> +BEGIN_FTR_SECTION                            \
>>>> +    SAVE_32VSRS(n,c,base);                        \
>>>> +END_FTR_SECTION_IFSET(CPU_FTR_VSX);                    \
>>>> +BEGIN_FTR_SECTION                            \
>>>> +    SAVE_32QRS(n,c,base);                        \
>>>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX);   =20
>>>>=20
>>>> I don't think we want to do this.  We are going to end up with 64
>>>> NOPS here somewhere.
>>>=20
>>> Excellent point, NOPs are cheap on most processors but not A2 and a lot o=
f embedded, I can wrap some branches with the FTR instead.
>>> Do you have a concern on the code size?
>>=20
>> Thought about it a bit and came up with this solution for arch/powerpc/ke=
rnel/fpu.S.
>> This should address the following issues
>> - MSR_VSX vs MSR_VEC
>> - Big chunks of NOPs in the code path
>> - Less FTR space fixups at boot time.
>> - IMNHSO easier to read especially when disassembled
>=20
> Indeed, I think it looks better.  I was going to mention that it was
> already pretty complex to read, so a rewrite like this was probably
> needed.  So thanks!!
>=20
> That being said, there is a pretty complex testing matrix of
> CONFIG_VSX/VMX/FPU/QPX/SMP/64/32 CPU_FTR/VSX/FPU/QPX/VMX so I'd need to
> look/test more carefully to make sure all of these are covered.
>=20
> Also, transactional memory (see
> http://lists.ozlabs.org/pipermail/linuxppc-dev/2012-November/102216.html)
> will change this code.  You should rebase on top of that if you really
> want it considered for upstream.

Is this in a git tree anywhere? perhaps BenH's next branch?
-jx

>=20
> Mikey
>=20
>>=20
>> I did consider using the LR and BLR, but the !CONFIG_SMP case only adds o=
ne more special block and uses a different register set.
>> Also if this is agreeable I would like us to consider removing the *_32FP=
VSRS* macros entirely and put the FTR tests in the actual code.
>> This would allow us to use #ifdefs and reduce the amount of code that act=
ually gets compiled.
>>=20
>> Thoughts?
>>=20
>> diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
>> index e0ada05..5964067 100644
>> --- a/arch/powerpc/kernel/fpu.S
>> +++ b/arch/powerpc/kernel/fpu.S
>> @@ -25,30 +25,81 @@
>> #include <asm/asm-offsets.h>
>> #include <asm/ptrace.h>
>>=20
>> -#ifdef CONFIG_VSX
>> -#define __REST_32FPVSRS(n,c,base)                    \
>> -BEGIN_FTR_SECTION                            \
>> -    b    2f;                            \
>> -END_FTR_SECTION_IFSET(CPU_FTR_VSX);                    \
>> -    REST_32FPRS(n,base);                        \
>> -    b    3f;                            \
>> -2:    REST_32VSRS(n,c,base);                        \
>> -3:
>> -
>> -#define __SAVE_32FPVSRS(n,c,base)                    \
>> -BEGIN_FTR_SECTION                            \
>> -    b    2f;                            \
>> -END_FTR_SECTION_IFSET(CPU_FTR_VSX);                    \
>> -    SAVE_32FPRS(n,base);                        \
>> -    b    3f;                            \
>> -2:    SAVE_32VSRS(n,c,base);                        \
>> -3:
>> -#else
>> -#define __REST_32FPVSRS(n,b,base)    REST_32FPRS(n, base)
>> -#define __SAVE_32FPVSRS(n,b,base)    SAVE_32FPRS(n, base)
>> -#endif
>> -#define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base=
)
>> -#define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base=
)
>> +
>> +/*
>> + * Restore subroutines, R4 is scratch and R5 is base
>> + */
>> +vsx_restore:
>> +    REST_32VSRS(0, __REG_R4, __REG_R5)
>> +    b after_restore
>> +qpx_restore:
>> +    REST_32QRS(0, __REG_R4, __REG_R5)
>> +    b after_restore
>> +fpu_restore:
>> +    REST_32FPRS(0, __REG_R5)
>> +    b after_restore
>> +
>> +#define REST_32FPVSRS(n, c, base)        \
>> +BEGIN_FTR_SECTION                \
>> +    b vsx_restore;                \
>> +END_FTR_SECTION_IFSET(CPU_FTR_VSX)        \
>> +BEGIN_FTR_SECTION                \
>> +    b qpx_restore;                \
>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)        \
>> +    b fpu_restore;                \
>> +after_restore:
>> +
>> +/*
>> + * Save subroutines, R4 is scratch and R3 is base
>> + */
>> +vsx_save:
>> +    SAVE_32VSRS(0, __REG_R4, __REG_R3)
>> +    b after_save
>> +qpx_save:
>> +    SAVE_32QRS(0, __REG_R4, __REG_R3)
>> +    b after_save
>> +fpu_save:
>> +    SAVE_32FPRS(0, __REG_R3)
>> +    b after_save
>> +
>> +#define SAVE_32FPVSRS(n, c, base)        \
>> +BEGIN_FTR_SECTION                \
>> +    b vsx_save;                \
>> +END_FTR_SECTION_IFSET(CPU_FTR_VSX)        \
>> +BEGIN_FTR_SECTION                \
>> +    b qpx_save;                \
>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)        \
>> +    b fpu_save;                \
>> +after_save:
>> +
>> +#ifndef CONFIG_SMP
>> +/*
>> + * we need an extra save set for the !CONFIG_SMP case, see below
>> + * Scratch it R5 and base is R4
>> + */
>> +vsx_save_nosmp:
>> +    SAVE_32VSRS(0,R5,R4)
>> +    b after_save_nosmp
>> +
>> +qpx_save_nosmp:
>> +    SAVE_32QRS(0,R5,R4)
>> +    b after_save_nosmp
>> +
>> +fpu_save_nosmp:
>> +    SAVE_32FPRS(0,R4)
>> +    b after_save_nosmp
>> +
>> +#define SAVE_32FPVSRS_NOSMP(n,c,base)        \
>> +BEGIN_FTR_SECTION                \
>> +    b vsx_save_nosmp;            \
>> +END_FTR_SECTION_IFSET(CPU_FTR_VSX)        \
>> +BEGIN_FTR_SECTION                \
>> +    b qpx_save_nosmp;            \
>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)        \
>> +    b fpu_save_nosmp;            \
>> +after_save_nosmp:
>> +
>> +#endif /* !CONFIG_SMP */
>>=20
>> /*
>>  * This task wants to use the FPU now.
>> @@ -65,6 +116,11 @@ BEGIN_FTR_SECTION
>>    oris    r5,r5,MSR_VSX@h
>> END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>> #endif
>> +#ifdef CONFIG_PPC_QPX
>> +BEGIN_FTR_SECTION
>> +    oris    r5,r5,MSR_VEC@h
>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)
>> +#endif
>>    SYNC
>>    MTMSRD(r5)            /* enable use of fpu now */
>>    isync
>> @@ -81,7 +137,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>>    beq    1f
>>    toreal(r4)
>>    addi    r4,r4,THREAD        /* want last_task_used_math->thread */
>> -    SAVE_32FPVSRS(0, R5, R4)
>> +    SAVE_32FPVSRS_NOSMP(0, R5, R4)
>>    mffs    fr0
>>    stfd    fr0,THREAD_FPSCR(r4)
>>    PPC_LL    r5,PT_REGS(r4)
>> @@ -94,7 +150,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>> #endif /* CONFIG_SMP */
>>    /* enable use of FP after return */
>> #ifdef CONFIG_PPC32
>> -    mfspr    r5,SPRN_SPRG_THREAD        /* current task's THREAD (phys) *=
/
>> +    mfspr    r5,SPRN_SPRG_THREAD    /* current task's THREAD (phys) */
>>    lwz    r4,THREAD_FPEXC_MODE(r5)
>>    ori    r9,r9,MSR_FP        /* enable FP for current */
>>    or    r9,r9,r4
>> @@ -132,6 +188,11 @@ BEGIN_FTR_SECTION
>>    oris    r5,r5,MSR_VSX@h
>> END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>> #endif
>> +#ifdef CONFIG_PPC_QPX
>> +BEGIN_FTR_SECTION
>> +    oris    r5,r5,MSR_VEC@h
>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)
>> +#endif
>>    SYNC_601
>>    ISYNC_601
>>    MTMSRD(r5)            /* enable use of fpu now */
>> @@ -148,10 +209,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>>    beq    1f
>>    PPC_LL    r4,_MSR-STACK_FRAME_OVERHEAD(r5)
>>    li    r3,MSR_FP|MSR_FE0|MSR_FE1
>> -#ifdef CONFIG_VSX
>> +#if defined(CONFIG_VSX) || defined(CONFIG_PPC_QPX)
>> BEGIN_FTR_SECTION
>>    oris    r3,r3,MSR_VSX@h
>> -END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>> +END_FTR_SECTION_IFSET(CPU_FTR_VSX | CPU_FTR_QPX)
>> #endif
>>    andc    r4,r4,r3        /* disable FP for previous task */
>>    PPC_STL    r4,_MSR-STACK_FRAME_OVERHEAD(r5)
>>=20
>> -jx
>>=20
>>=20
>>>=20
>>>>=20
>>>> I'd like to see this patch broken into different parts.
>>>=20
>>> I'm not sure how _this_ patch:
>>> <https://github.com/jimix/linux-bgq/commit/279c0615917b959a652e81f4ad0d8=
86e2d426d85>
>>> could be broken up, please advise.
>>>=20
>>>>=20
>>>> Also, have you boot tested this change on a VSX enabled box?
>>>=20
>>> I can try, I may bug you for help.
>>> Is there a commonly test (or apps) I should run?
>>>=20
>>> -jx
>>>=20
>>>=20
>>>>=20
>>>> Mikey
>>=20

^ permalink raw reply

* Re: [RFC] Add IBM Blue Gene/Q Platform
From: Michael Neuling @ 2012-12-10  0:47 UTC (permalink / raw)
  To: Jimi Xenidis
  Cc: Andrew T Tauferner, Kumar Gala, Jay Bryant, Josh Boyer,
	Todd Inglett, linuxppc-dev
In-Reply-To: <1ECB6402-9BC0-4468-B405-A6F6E971486B@pobox.com>

Jimi Xenidis <jimix@pobox.com> wrote:

> 
> On Dec 7, 2012, at 7:38 AM, Jimi Xenidis <jimix@pobox.com> wrote:
> 
> > 
> > On Dec 6, 2012, at 11:54 PM, Michael Neuling <mikey@neuling.org> wrote:
> > 
> >>> commit 279c0615917b959a652e81f4ad0d886e2d426d85
> >>> Author: Jimi Xenidis <jimix@pobox.com>
> >>> Date:   Wed Dec 5 13:43:22 2012 -0500
> >>> 
> >>>   powerpc/book3e: IBM Blue Gene/Q Quad Processing eXtention (QPX)
> >>> 
> >>>   This enables kernel support for the QPX extention and is intended for
> >>>   processors that support it, usually an IBM Blue Gene processor.
> >>>   Turning it on does not effect other processors but it does add code
> >>>   and will quadruple the per thread save and restore area for the FPU
> >>>   (hense the name).  If you have enabled VSX it will only double the
> >>>   space.
> >>> 
> >>>   Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> 
> <<snip>>
> >> 
> >> 
> >> 
> >> +BEGIN_FTR_SECTION							\
> >> +	SAVE_32VSRS(n,c,base);						\
> >> +END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> >> +BEGIN_FTR_SECTION							\
> >> +	SAVE_32QRS(n,c,base);						\
> >> +END_FTR_SECTION_IFSET(CPU_FTR_QPX);	
> >> 
> >> I don't think we want to do this.  We are going to end up with 64
> >> NOPS here somewhere.
> > 
> > Excellent point, NOPs are cheap on most processors but not A2 and a lot of embedded, I can wrap some branches with the FTR instead.
> > Do you have a concern on the code size?
> 
> Thought about it a bit and came up with this solution for arch/powerpc/kernel/fpu.S.
> This should address the following issues
>  - MSR_VSX vs MSR_VEC
>  - Big chunks of NOPs in the code path
>  - Less FTR space fixups at boot time.
>  - IMNHSO easier to read especially when disassembled

Indeed, I think it looks better.  I was going to mention that it was
already pretty complex to read, so a rewrite like this was probably
needed.  So thanks!!

That being said, there is a pretty complex testing matrix of
CONFIG_VSX/VMX/FPU/QPX/SMP/64/32 CPU_FTR/VSX/FPU/QPX/VMX so I'd need to
look/test more carefully to make sure all of these are covered.

Also, transactional memory (see
http://lists.ozlabs.org/pipermail/linuxppc-dev/2012-November/102216.html)
will change this code.  You should rebase on top of that if you really
want it considered for upstream.

Mikey

> 
> I did consider using the LR and BLR, but the !CONFIG_SMP case only adds one more special block and uses a different register set.
> Also if this is agreeable I would like us to consider removing the *_32FPVSRS* macros entirely and put the FTR tests in the actual code.
> This would allow us to use #ifdefs and reduce the amount of code that actually gets compiled.
> 
> Thoughts?
> 
> diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
> index e0ada05..5964067 100644
> --- a/arch/powerpc/kernel/fpu.S
> +++ b/arch/powerpc/kernel/fpu.S
> @@ -25,30 +25,81 @@
>  #include <asm/asm-offsets.h>
>  #include <asm/ptrace.h>
>  
> -#ifdef CONFIG_VSX
> -#define __REST_32FPVSRS(n,c,base)					\
> -BEGIN_FTR_SECTION							\
> -	b	2f;							\
> -END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> -	REST_32FPRS(n,base);						\
> -	b	3f;							\
> -2:	REST_32VSRS(n,c,base);						\
> -3:
> -
> -#define __SAVE_32FPVSRS(n,c,base)					\
> -BEGIN_FTR_SECTION							\
> -	b	2f;							\
> -END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> -	SAVE_32FPRS(n,base);						\
> -	b	3f;							\
> -2:	SAVE_32VSRS(n,c,base);						\
> -3:
> -#else
> -#define __REST_32FPVSRS(n,b,base)	REST_32FPRS(n, base)
> -#define __SAVE_32FPVSRS(n,b,base)	SAVE_32FPRS(n, base)
> -#endif
> -#define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
> -#define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
> +
> +/*
> + * Restore subroutines, R4 is scratch and R5 is base
> + */
> +vsx_restore:
> +	REST_32VSRS(0, __REG_R4, __REG_R5)
> +	b after_restore
> +qpx_restore:
> +	REST_32QRS(0, __REG_R4, __REG_R5)
> +	b after_restore
> +fpu_restore:
> +	REST_32FPRS(0, __REG_R5)
> +	b after_restore
> +
> +#define REST_32FPVSRS(n, c, base)		\
> +BEGIN_FTR_SECTION				\
> +	b vsx_restore;				\
> +END_FTR_SECTION_IFSET(CPU_FTR_VSX)		\
> +BEGIN_FTR_SECTION				\
> +	b qpx_restore;				\
> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)		\
> +	b fpu_restore;				\
> +after_restore:
> +
> +/*
> + * Save subroutines, R4 is scratch and R3 is base
> + */
> +vsx_save:
> +	SAVE_32VSRS(0, __REG_R4, __REG_R3)
> +	b after_save
> +qpx_save:
> +	SAVE_32QRS(0, __REG_R4, __REG_R3)
> +	b after_save
> +fpu_save:
> +	SAVE_32FPRS(0, __REG_R3)
> +	b after_save
> +
> +#define SAVE_32FPVSRS(n, c, base)		\
> +BEGIN_FTR_SECTION				\
> +	b vsx_save;				\
> +END_FTR_SECTION_IFSET(CPU_FTR_VSX)		\
> +BEGIN_FTR_SECTION				\
> +	b qpx_save;				\
> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)		\
> +	b fpu_save;				\
> +after_save:
> +
> +#ifndef CONFIG_SMP
> +/*
> + * we need an extra save set for the !CONFIG_SMP case, see below
> + * Scratch it R5 and base is R4
> + */
> +vsx_save_nosmp:
> +	SAVE_32VSRS(0,R5,R4)
> +	b after_save_nosmp
> +
> +qpx_save_nosmp:
> +	SAVE_32QRS(0,R5,R4)
> +	b after_save_nosmp
> +
> +fpu_save_nosmp:
> +	SAVE_32FPRS(0,R4)
> +	b after_save_nosmp
> +
> +#define SAVE_32FPVSRS_NOSMP(n,c,base)		\
> +BEGIN_FTR_SECTION				\
> +	b vsx_save_nosmp;			\
> +END_FTR_SECTION_IFSET(CPU_FTR_VSX)		\
> +BEGIN_FTR_SECTION				\
> +	b qpx_save_nosmp;			\
> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)		\
> +	b fpu_save_nosmp;			\
> +after_save_nosmp:
> +
> +#endif /* !CONFIG_SMP */
>  
>  /*
>   * This task wants to use the FPU now.
> @@ -65,6 +116,11 @@ BEGIN_FTR_SECTION
>  	oris	r5,r5,MSR_VSX@h
>  END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>  #endif
> +#ifdef CONFIG_PPC_QPX
> +BEGIN_FTR_SECTION
> +	oris	r5,r5,MSR_VEC@h
> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)
> +#endif
>  	SYNC
>  	MTMSRD(r5)			/* enable use of fpu now */
>  	isync
> @@ -81,7 +137,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>  	beq	1f
>  	toreal(r4)
>  	addi	r4,r4,THREAD		/* want last_task_used_math->thread */
> -	SAVE_32FPVSRS(0, R5, R4)
> +	SAVE_32FPVSRS_NOSMP(0, R5, R4)
>  	mffs	fr0
>  	stfd	fr0,THREAD_FPSCR(r4)
>  	PPC_LL	r5,PT_REGS(r4)
> @@ -94,7 +150,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>  #endif /* CONFIG_SMP */
>  	/* enable use of FP after return */
>  #ifdef CONFIG_PPC32
> -	mfspr	r5,SPRN_SPRG_THREAD		/* current task's THREAD (phys) */
> +	mfspr	r5,SPRN_SPRG_THREAD	/* current task's THREAD (phys) */
>  	lwz	r4,THREAD_FPEXC_MODE(r5)
>  	ori	r9,r9,MSR_FP		/* enable FP for current */
>  	or	r9,r9,r4
> @@ -132,6 +188,11 @@ BEGIN_FTR_SECTION
>  	oris	r5,r5,MSR_VSX@h
>  END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>  #endif
> +#ifdef CONFIG_PPC_QPX
> +BEGIN_FTR_SECTION
> +	oris	r5,r5,MSR_VEC@h
> +END_FTR_SECTION_IFSET(CPU_FTR_QPX)
> +#endif
>  	SYNC_601
>  	ISYNC_601
>  	MTMSRD(r5)			/* enable use of fpu now */
> @@ -148,10 +209,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>  	beq	1f
>  	PPC_LL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
>  	li	r3,MSR_FP|MSR_FE0|MSR_FE1
> -#ifdef CONFIG_VSX
> +#if defined(CONFIG_VSX) || defined(CONFIG_PPC_QPX)
>  BEGIN_FTR_SECTION
>  	oris	r3,r3,MSR_VSX@h
> -END_FTR_SECTION_IFSET(CPU_FTR_VSX)
> +END_FTR_SECTION_IFSET(CPU_FTR_VSX | CPU_FTR_QPX)
>  #endif
>  	andc	r4,r4,r3		/* disable FP for previous task */
>  	PPC_STL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
> 
> -jx
> 
> 
> > 
> >> 
> >> I'd like to see this patch broken into different parts.
> > 
> > I'm not sure how _this_ patch:
> >  <https://github.com/jimix/linux-bgq/commit/279c0615917b959a652e81f4ad0d886e2d426d85>
> > could be broken up, please advise.
> > 
> >> 
> >> Also, have you boot tested this change on a VSX enabled box?
> > 
> > I can try, I may bug you for help.
> > Is there a commonly test (or apps) I should run?
> > 
> > -jx
> > 
> > 
> >> 
> >> Mikey
> > 
> 

^ permalink raw reply

* Re: [RFC] Add IBM Blue Gene/Q Platform
From: Michael Neuling @ 2012-12-10  0:26 UTC (permalink / raw)
  To: Jimi Xenidis
  Cc: Andrew T Tauferner, Kumar Gala, Jay Bryant, Josh Boyer,
	Todd Inglett, linuxppc-dev
In-Reply-To: <644E8572-68F8-4C31-AA9D-96EBC95BC37F@pobox.com>

Jimi Xenidis <jimix@pobox.com> wrote:

> 
> On Dec 6, 2012, at 11:56 PM, Michael Neuling <mikey@neuling.org> wrote:
> 
> > Jimi Xenidis <jimix@pobox.com> wrote:
> > 
> >> Rather than flood the mailing list with the patches, I've arranged for a git repo to hold the changesets.
> >> You can find the repo here:
> >>  <https://github.com/jimix/linux-bgq>
> >> 
> >> They are against GregKH's linux-stable.git long-term 3.4.y (y=22) branch.
> >> The first 9 (6e58088f..) effect common code and the rest are BGQ specific.
> > 
> > Do you actually want this upstream?  I assume no.
> 
> I needed to get these long-term patches out there for the BGQ
> community for test.  I would very much like to get a version of these
> upstream.

Ok, cool.  

> I expect only the QPX, kexec, and (maybe) the DCR changes to cause any controversy, but I've been wrong before.

>From my perspective, the QPX stuff should be fine to get upstream once
we beat the patches into shape.  No comment on kexec and DCR.  

Mikey

> I'll be making those patches soon and hope to get a lot of feedback
> from these patches.  -jx
> 
> 
> > 
> > Mikey
> > 
> >> 
> >> Here is a are the summary logs:
> >> 
> >> $ git log --reverse linux-stable/linux-3.4.y..
> >> commit 5a8edb2bdd914597693eed299119ff4c2e6d31f2
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Fri Nov 9 09:26:00 2012 -0600
> >> 
> >>    powerpc: Fix cputable #ifdef where CONFIG_PPC_A2 is used for CONFIG_PPC_BOOK3E_64
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit ea51920d7035c8d23801d6de46261e7d0a537dfd
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Fri Nov 9 08:58:27 2012 -0600
> >> 
> >>    powerpc/book3e: Remove config for PPC_A2_DD2 since there is no reference to it
> >> 
> >>    This must have been leftover from early DD1 days which is not
> >>    present in any current kernel code.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 08151401a5db4ff0d441a1b7bf8ad92bd92b14c5
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Mon Nov 5 09:38:01 2012 -0600
> >> 
> >>    powerpc/dcr: Some native DCR fixes
> >> 
> >>    The following fixes have been made:
> >>     - dcr_read/write_native() must use the indexed version of the
> >>       m[ft]dcrx since the non-indexed version only allows a 10-bit
> >>       numerical space, but the C interface allows a full 32-bits.
> >>     - C bindings for m[ft]dcrx, and the "table" versions, should use
> >>       "unsigned long" so that they are 64/32 bit neutral.
> >>     - The "table" versions (__m[ft]cdr) should obtain the table address
> >>       with LOAD_REG_ADDR(), this will also make it 64/32bit neutral.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit c8320a5daaceed03992d763302020834ea8e17dd
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Mon Nov 5 09:12:00 2012 -0600
> >> 
> >>    powerpc/dcr: Add 64-bit DCR access methods.
> >> 
> >>    This patch adds the ability to make 64-bit Device Control Register
> >>    (DCR) accesses.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit a763b3f8453b3bd83d7dded8c6644939863af430
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Thu Nov 29 12:49:24 2012 -0500
> >> 
> >>    powerpc/boot: Add a "spin_threads" hook to platform_ops
> >> 
> >>    It is useful for the boot program to arrange for all secondary cpus
> >>    and threads to enter the kernel in a "kexec" fashion.  This hook makes
> >>    it possible.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 391e43393380b514d4d02a42d059619542c7597b
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Thu Nov 29 13:01:23 2012 -0500
> >> 
> >>    powerpc/kexec: Add kexec "hold" support for Book3e processors
> >> 
> >>    This patch add two items:
> >>    1) Book3e requires that GPR4 survive the "hold" process, so we make
> >>       sure that happens.
> >>    2) Book3e has no real mode, and the hold code exploits this.  Since
> >>       these processors ares always translated, we arrange for the kexeced
> >>       threads to enter the hold code using the normal kernel linear mapping.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit f6e3c1f706cb6922349d639a74ff6c50acc8b9f8
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Wed Dec 5 13:41:25 2012 -0500
> >> 
> >>    powerpc: Remove unecessary VSX symbols
> >> 
> >>    The symbol THREAD_VSR0 is defined to be the same as THREAD_FPR0.  Its
> >>    presence causes build issues with more complex configurations.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 4e817bb42ec8e3d3689877528dd97c4286a870eb
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Tue Nov 20 10:10:52 2012 -0600
> >> 
> >>    Blue Gene/Q wicked optimizing compiler does not know the rfdi instruction yet
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 2071aa58b2f3b33d97c94e3a127f7c5d4ffaeb34
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Tue Nov 20 10:14:22 2012 -0600
> >> 
> >>    Blue Gene/Q wicked optimizing compiler does not know the mfdcrx instruction yet
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 6e58088fabedbb2d724637b539ba180c03ed8b68
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Wed Oct 31 16:33:21 2012 -0500
> >> 
> >>    powerpc/book3e: IBM Blue Gene/Q Boot
> >> 
> >>    This patch specifically deals with the initial program load
> >>    environment so that a boot image (dtbImage.bgq) can be loaded by the
> >>    BGQ management tools.  The boot code is a little odd because it has to
> >>    deal with the following issues:
> >>     - Linux boot image wrappers are 32-bit programs
> >>     - BGQ Tools only load 64bit ELF programs
> >>     - BGQ Firmware information is typically loaded at an address > 4G
> >>     - BGQ FW information contains 64-bit ABI function pointers (which are
> >>       actually function descriptors) to access firmware methods
> >>     - BGQ FW methods must be called in 64-bit mode
> >> 
> >>    Includes code contributed from:
> >>      Andrew Tauferner <ataufer@us.ibm.com>
> >>      Todd Inglett <tinglett@us.ibm.com>
> >>      Eric Van Hensbergen <ericvh@gmail.com>
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 3bc841935eb4955ce2b2db69bff16f7928464597
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Wed Oct 31 22:36:54 2012 -0500
> >> 
> >>    powerpc/book3e: IBM Blue Gene/Q Platform and SMP
> >> 
> >>    This patch introduces BGQ as a platform and adds SMP functionality
> >> 
> >>    Includes code contributed from:
> >>      Andrew Tauferner <ataufer@us.ibm.com>
> >>      Todd Inglett <tinglett@us.ibm.com>
> >>      Eric Van Hensbergen <ericvh@gmail.com>
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 279c0615917b959a652e81f4ad0d886e2d426d85
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Wed Dec 5 13:43:22 2012 -0500
> >> 
> >>    powerpc/book3e: IBM Blue Gene/Q Quad Processing eXtention (QPX)
> >> 
> >>    This enables kernel support for the QPX extention and is intended for
> >>    processors that support it, usually an IBM Blue Gene processor.
> >>    Turning it on does not effect other processors but it does add code
> >>    and will quadruple the per thread save and restore area for the FPU
> >>    (hense the name).  If you have enabled VSX it will only double the
> >>    space.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 6ff45170ab0463fb34d7011e08c7e47c396f0fd7
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Fri Nov 23 14:52:14 2012 -0600
> >> 
> >>    powerpc/book3e: IBM Blue Gene/Q HVC Based Firmware Console
> >> 
> >>    New HVC device that uses the Blue Gene Firmware methods to erad and
> >>    write to console.
> >> 
> >>        Includes code contributed from:
> >>          Andrew Tauferner <ataufer@us.ibm.com>
> >>          Todd Inglett <tinglett@us.ibm.com>
> >>          Eric Van Hensbergen <ericvh@gmail.com>
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit e4ddc0c2ad8b3f0260d15d73e153095e95da84ac
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Thu Nov 29 15:52:04 2012 -0500
> >> 
> >>    powerpc/book3e: IBM Blue Gene/Q PCIe and MSI
> >> 
> >>    The following patch adds support for the BG/Q PCIe bridge and MSI interrupts.
> >> 
> >>    Includes code contributed from:
> >>      Jay S. Bryant <jsbryant@us.ibm.com>
> >>      Eric Van Hensbergen <ericvh@gmail.com>
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 9fc0b6f729f7bd7e31338283640a718fa4b1693b
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Wed Dec 5 07:01:49 2012 -0500
> >> 
> >>    powerpc/book3e: IBM Blue Gene/Q Character Drivers
> >> 
> >>    The following patch adds support for user and administration
> >>    applications to access the BG/Q control system.
> >> 
> >>    Includes code contributed from:
> >>      Jay S. Bryant <jsbryant@us.ibm.com>
> >>      Eric Van Hensbergen <ericvh@gmail.com>
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> >> commit 9df2c4dfde0ac75f8b2bfcf565f78c2b7382b031
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Thu Dec 6 18:07:16 2012 -0500
> >> 
> >>    Linux 3.4.22-BGQ-rc1
> >> 
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> Linuxppc-dev mailing list
> >> Linuxppc-dev@lists.ozlabs.org
> >> https://lists.ozlabs.org/listinfo/linuxppc-dev
> >> 
> 

^ permalink raw reply

* Re: [RFC] Add IBM Blue Gene/Q Platform
From: Michael Neuling @ 2012-12-10  0:18 UTC (permalink / raw)
  To: Jimi Xenidis
  Cc: Andrew T Tauferner, Kumar Gala, Jay Bryant, Josh Boyer,
	Todd Inglett, linuxppc-dev
In-Reply-To: <F49F5B26-671F-4D3F-BAC3-67F852E03798@pobox.com>

Jimi Xenidis <jimix@pobox.com> wrote:

> 
> On Dec 6, 2012, at 11:54 PM, Michael Neuling <mikey@neuling.org> wrote:
> 
> >> commit 279c0615917b959a652e81f4ad0d886e2d426d85
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Wed Dec 5 13:43:22 2012 -0500
> >> 
> >>    powerpc/book3e: IBM Blue Gene/Q Quad Processing eXtention (QPX)
> >> 
> >>    This enables kernel support for the QPX extention and is intended for
> >>    processors that support it, usually an IBM Blue Gene processor.
> >>    Turning it on does not effect other processors but it does add code
> >>    and will quadruple the per thread save and restore area for the FPU
> >>    (hense the name).  If you have enabled VSX it will only double the
> >>    space.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> > 
> > Can you give a diagram of how the QPX registers are layed out.
> > 
> > +#if defined(CONFIG_PPC_QPX)
> > +#define TS_FPRWIDTH 4
> > +#elif defined(CONFIG_VSX)
> > 
> > Are they 256 bits wide?
> 
> Yes, this is why we nicknamed it the "Quad Hummer" :)
>  - 4-wide double precision FPU SIMD
>  - 2-wide complex SIMD
>  - 4R/2W register file (32x256 bits per thread)
>  - 32B (256 bits) datapath to/from L1 cache

OK, can you add a comment like this to the commit log and to the code so
that people know what it looks like.

> 
> > 
> > 
> > +#define QVLFDXA(QRT,RA,RB)	\
> > +	.long (0x7c00048f | ((QRT) << 21) | ((RA) << 16) | ((RB) << 11))
> > 
> > Put this in ppc-opcode.h.
> > 
> > +#if defined(CONFIG_VSX) || defined(CONFIG_PPC_QPX)
> > +	/* they are the same MSR bit */
> > 
> > OMG!
> 
> Ooops, you are correct, this was in the original patch.
> I'll double check the work book, but it should be the architected VEC/SPV bit which is really for VMX.
> I'll track it down.
> 
> > 
> > 
> > +BEGIN_FTR_SECTION							\
> > +	SAVE_32VSRS(n,c,base);						\
> > +END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> > +BEGIN_FTR_SECTION							\
> > +	SAVE_32QRS(n,c,base);						\
> > +END_FTR_SECTION_IFSET(CPU_FTR_QPX);	
> > 
> > I don't think we want to do this.  We are going to end up with 64
> > NOPS here somewhere.
> 
> Excellent point, NOPs are cheap on most processors but not A2 and a
> lot of embedded, I can wrap some branches with the FTR instead.  Do
> you have a concern on the code size?

Code size is not the issue.  Just running over 64NOPS for no reason.  In
the past, we've preferred a branch over a section of code.

> 
> > 
> > I'd like to see this patch broken into different parts.
> 
> I'm not sure how _this_ patch:
>   <https://github.com/jimix/linux-bgq/commit/279c0615917b959a652e81f4ad0d886e2d426d85>
> could be broken up, please advise.

Add it in reviewable chunks.  Add the infrastructure (instructions,
macros, config options) then hook it into the existing code.


> > Also, have you boot tested this change on a VSX enabled box?
> 
> I can try, I may bug you for help.  Is there a commonly test (or apps)
> I should run?

I have some tests squirred away when I did the initial VSX stuff.  I can
grab them. I suspect this will either completely blow up VSX context
switching or work perfectly well.  It's unlikely to introduce subtle
bugs.

Mikey 

^ permalink raw reply

* Re: [RFC] Add IBM Blue Gene/Q Platform
From: Michael Neuling @ 2012-12-10  0:12 UTC (permalink / raw)
  To: Jimi Xenidis
  Cc: Andrew T Tauferner, Kumar Gala, Jay Bryant, Todd Inglett,
	linuxppc-dev
In-Reply-To: <17203923-C5C4-4659-9F3C-786D20C5C62E@pobox.com>

Jimi Xenidis <jimix@pobox.com> wrote:

> 
> On Dec 6, 2012, at 11:41 PM, Michael Neuling <mikey@neuling.org> wrote:
> 
> >> commit f6e3c1f706cb6922349d639a74ff6c50acc8b9f8
> >> Author: Jimi Xenidis <jimix@pobox.com>
> >> Date:   Wed Dec 5 13:41:25 2012 -0500
> >> 
> >>    powerpc: Remove unecessary VSX symbols
> >> 
> >>    The symbol THREAD_VSR0 is defined to be the same as THREAD_FPR0.  Its
> >>    presence causes build issues with more complex configurations.
> >> 
> >>    Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> >> 
> > 
> > Can you explain what these "complex configurations" are?
> 
> In an earlier email we discussed the possibility that there was the
> possibility/desire that a single binary could support either FPU, VSX
> and the new QPX.  However, if a CONFIG_VSX is not defined then
> THREAD_VSR0 does not get defined even though there might be some code
> that refers to it.  Since it is an alias for the same piece of storage
> I was hoping to solve my config issue and be simplify the code.

Yep, I remember the conversation.  I just want that explanation in the
commit log so others know *why* it was changed.

Mikey

^ permalink raw reply

* Re: Understanding how kernel updates MMU hash table
From: Benjamin Herrenschmidt @ 2012-12-09 21:10 UTC (permalink / raw)
  To: Pegasus11; +Cc: linuxppc-dev
In-Reply-To: <34775807.post@talk.nabble.com>

On Sat, 2012-12-08 at 23:18 -0800, Pegasus11 wrote:
> 3. For 64bit architecture there is no such 'segment register we use a
> segment table entry (STE) from within an SLB (segment lookaside buffer)
> which caches recently used mappings from ESID (part of effective address) to
> VSID (part of Virtual address). This SLB is again maintained in main memory
> by the OS.

Almost ;-) Older processors (pre-power4) used the STAB in memory. Since
power4 however (and thus including 970), the segments are
software-loaded in the SLB. IE. Whenever there's a segment "miss", an
interrupt is generated (0x380 or 0x480) which directly reloads a new
entry in the SLB HW buffer, without going via an STAB in memory. The SLB
is 64-entries up to power6 and 32 entries on power7 and 8.

The reduction in the number of entries comes with the support for 1T
segments which was added on power5+ (arch 2.03), which reduces the
pressure on the SLB significantly.

> 4. This hashed page table is located in a fixed region of main memory, the
> starting address of which is given by the SDR1 register.

Yes, and it's size too.

> 5. (Now this is something that I was perhaps missing. Please correct me if I
> am wrong). Every access to a memory location will picture the MMU since it
> is a hardware component which is always between the CPU bus and the memory
> bus. This basic fact of computer design was somehow escaping me,,i wonder
> why :thinking:. Thus the MMU first consulting the hardware TLB and on
> encountering a TLB miss, it looking for the same in the hashed page table,

Yes. There's actually two levels of HW TLB but you generally don't need
to be too much concerned about it. The ERAT is right in the core, is
small, fast, split (I and D are separate) and translates all the way
from effective addresses to real addresses (the translations in there
encompass both the SLB and hash). It's flushed automatically when either
flushing a TLB entry or an SLB entry. The TLB is larger, a bit further
away from the core (thus higher latency to access), and is more strictly
a cache of the hash table.

So the actual scenario for translating an address is:

 * Lookup in I or D ERAT, hit->done, miss->...
 * Lookup in SLB, miss->fault (0x480 or 0x380), hit->...
 * With SLB entry, construct the VA (replace the top bits with VSID)...
 * Lookup in TLB, hit->populate ERAT & done, miss->...
 * Lookup in hash table, hit->populate TLB & ERAT, done, miss->fault
(0400 or 0x300).

> is something that happens without any sort of OS interference (as the HW has
> been programmed to do).

Yes.

> 6. So now when you say that the kernel's job is to maintain this hashed
> paged table, since the MMU will need it during a TLB miss, makes sense to me
> now. And this page table has a peculiar format of Page table entry groups
> (PTEGs) and for each translation first the primary PTEG is searched and if
> the entry isn't found in it, the MMU searches the secondary PTEG for the
> same.

Yes.

>  All this happens in the background without the OS having as much as a
> hint for the same unless of course the entry is not found even in the
> secondary PTEG upon which a page fault exception is generated and the
> subsequent handling code ensues.

Correct.

> Now that I have spelled out what I understand (and ask you to please let me
> know if I am missing anything anywhere), what is there for me to understand
> is the relation between Linux's page table that is a pure software construct
> dictated by the kernel itself and the hardware dictated page table (which in
> my case here is an inverted page table maintained in a fixed location in
> main memory). I stumbled upon this link:
> http://yarchive.net/comp/linux/page_tables.html . Although its an old link,
> linus, in his usual candid style explains to a curios fellow the
> significance of maintaining a seperate page table distinct from the hardware
> dictated page table.

This is the story. Basically Linux can't easily be made to use something
other than a radix tree (though there is some flexibility as to the
format of the tree) so on "server class" powerpc, we pretty much have to
maintain a separate construct.

> Now, pardon me if my post hereon digresses a bit on the semantics of Linux
> page tables in general. I believe understanding why things are the way they
> are, would ultimately help me understand how Linux works so well on a
> plethora of hardware architectures including powerpc. In the link, he talks
> about 'Linux page tables matching hardware architectures closely' for a lot
> of architectures and machines. Which means Linux is using the page tables
> to, sort of, mirror the virtual memory related hardware as closely as
> possible. So in addition to satisfying what the architecture vendor
> specifies as the job of the OS in maintaining the VM infrastructure, it has
> its own VM infrastructure which it used to keep track of the Virtual memory.
> Right?

Yes.

> In that same link, Linus again stresses the fact that, such hash tables can
> be used as extended TLBs for the kernel. And he seems to have a particular
> dislike for PPC virtual memory management. He calls the architecture (or
> called it back then) a 'sick puppy' =^D.

Linus is known for is strong opinions and associated language :-) It did
still enjoy the use of a Mac G5 for a few years so he wasn't *that*
disgusted about it that he wouldn't use it :-)

> Now coming to the topic of TLB flush, all we are really talking about is
> invalidating the MMU hash table right?

Not exactly, see below.

>  But you mentioned that the kernel
> does not populate the TLB, the MMU does that from the hash table. So what
> exactly are we referring to as a TLB here? Linus considers the hash table as
> an 'extended TLB' but extended to what? The hardware TLBs? 

Yes, the hardware TLB, it needs to be flushed when a translation is
modified in the hash table.

> So when we talk about flushing the TLB which one are we talking about? The
> in memory hash table or the TLB or both? Or does it depend on the virtual
> address(es)?

Well, from a Linux perspective both. Ie. we modify the Linux PTE which
in our case is a SW only construct, and if the HASHPTE bit is set, which
is whenever we have hashed that at least once, we then also update the
hash table, and if something was found in there, perform the appropriate
HW invalidation sequence to make sure the HW TLB is also aware of the
change (tlbie instruction along with various synchronization).

> And since it is NOT in the form of a tree, invalidating an entire hash
> table, should be faster than clear a page table atleast on paper. Right?

We never invalidate the entire hash. When invalidating a single page we
target that specific page. On recent kernels at least (dunno about
2.6.10) we also keep track in the PTE of which slot in which group a
given linux PTE has been hashed which makes it faster to target it for
invalidation.

Unfortunately, when invalidating an entire process address space, we
have no choice but walk all the PTEs and target the ones that have been
hashed, though we do a little bit of batching here to improve
performances.

>  Is
> there any way one can actually speedup the TLB flush if one has such
> inverted Hash tables (which I think) are being used as extended TLBs? Linus
> seems to have a pretty nasty opinion about them old PPC machines
> though...but im still interested to know if any good could come out of it.
> 
> You also said that, most hash table loads tend to be cache misses. I believe
> you've used the term 'cache' here loosely and it corresponds to the three
> hardware TLBs that you had mentioned. Right? Since it there the MMU first
> looks for before taking a shot at the in-memory hash table isn't it?

No, I mean L2/L3 caches. IE, the powerpc hash table is fairly big and
the access pattern fairly random, meaning that the chances of hitting
the L2 or L3 when doing the HW lookup in the hash table are pretty low.

Cheers,
Ben.

> Keen to know more Ben. Thanks in advance.
> :-)
> Cheers. 
> 
> 
> 
> > By your words, I understand that the hash table is an in-memory cache of
> > translations meaning it is implemented in software.
> 
> Well, it's populated by software and read by HW. IE. On x86, the MMU
> will walk a radix tree of page tables, on powerpc it will walk an in
> memory hash table. The main difference is that on x86, there is usually
> a tree per process while the powerpc hash table tends to be global.
> 
> > So whenever the MMU wishes to translate a virtual address, it first checks
> > the TLB and if it
> > isn't found there, it looks for it in the hash table. Now this seems fine
> > to
> > me when looked at from the perspective of the MMU. Now when I look at it
> > from the kernel's perspective, I am a bit confused.
> > 
> > So when we (the kernel) encounter a virtual address, we walk the page
> > tables
> > and if we find that there is no valid entry for this address, we page
> > fault
> > which causes an exception right?
> 
> Hrm ... not sure what we mean by "the kernel". There are two different
> path here, but let's focus on the usual case... the processor encounters
> an address, whether it's trying to fetch an instruction, or having done
> that, is performing a load or a store. This will use what we call in
> powerpc lingua an "effective" address. This gets in turn turned into a
> "virtual address" after an SLB lookup.
> 
> I refer you to the architecture here, it's a bit tricky but basically
> the principle is that the virtual address space is *somewhat* the
> effective address space along with the process id. Except that on
> powerpc, we do that per-segment (we divide the address space into
> segments) so each segment has its top bits "transformed" into something
> larger called the VSID.
> 
> In any case, this results in a virtual address which is then looked up
> in the TLB (I'm ignoring the ERAT here which is the 1-st level TLB but
> let's not complicate things even more). If that misses, the CPU looks up
> in the hash table. If that misses, it causes an exception (0x300 for
> data accesses, 0x400 for instruction accesses).
> 
> There, Linux will usually go into hash_page which looks for the Linux
> PTE. If the PTE is absent (or has any other reason to be unusable such
> as being read-only for a write access), we get to do_page_fault.
> 
> Else, we populate the hash table with a translation, set the HASHPTE bit
> in the PTE, and retry the access.
> 
> >  And this exception then takes us to the
> > exception handler which I guess is 'do_page_fault'. On checking this
> > function I see that it gets the PGD, allocates a PMD, allocates a PTE and
> > then it calls handle_pte_fault. The comment banner for handle_pte_fault
> > reads:
> > 
> > 1638 /* These routines also need to handle stuff like marking pages dirty
> > 1639 * and/or accessed for architectures that don't do it in hardware
> > (most
> > 1640 * RISC architectures).  The early dirtying is also good on the i386.
> > 1641 *
> > 1642 * There is also a hook called "update_mmu_cache()" that architectures
> > 1643 * with external mmu caches can use to update those (ie the Sparc or
> > 1644 * PowerPC hashed page tables that act as extended TLBs)....
> > .........
> > */
> 
> Yes, when we go to do_page_fault() because the PTE wasn't populated in
> the first place, we have a hook to pre-fill the hash table instead of
> taking a fault again which will fill it the second time around. It's
> just a shortcut.
> 
> > It is from such comments that I inferred that the hash tables were being
> > used as "extended TLBs". However the above also infers (atleast to me)
> > that
> > these caches are in hardware as theyve used the word 'extended'. Pardon me
> > if I am being nitpicky but these things are confusing me a bit. So to
> > clear
> > this confusion, there are three things I would like to know.
> > 1. Is the MMU cache implemented in hardware or software? I trust you on it
> > being software but it would be great if you could address my concern in
> > the
> > above paragraph.
> 
> The TLB is a piece of HW. (there's really three in fact, the I-ERAT, the
> D-ERAT and the TLB ;-)
> 
> The Hash Table is a piece of RAM (pointed to by the SDR1 register) setup
> by the OS and populated by the OS but read by the HW. Just like the page
> tables on x86.
> 
> > 2. The kernel, it looks from the do_page_fault sequence, is updating its
> > internal page table first and then it goes on to update the mmu cache. So
> > this only means it is satisfying the requirement of someone else, perhaps
> > the MMU here. 
> 
> update_mmu_cache() is just a shortcut.
> 
> As I explained above, we populate the hash table lazily on fault.
> However, when taking an actual high level page fault (do_page_fault), we
> *know* the hash doesn't have an appropriate translation, so rather than
> just filling up the linux PTE and then taking the fault again to fill
> the hash from the linux PTE, we have a hook so we can pre-fill the hash.
> 
> > This should imply that this MMU cache does the kernel no good
> > in fact it adds one more entry in its to-do list when it plays around with
> > a
> > process's page table.
> 
> This is a debatable topic ;-) Some of us do indeed thing that the hash
> table isn't a very useful construct in the grand scheme of things and
> ends up being fairly inefficient, for a variety of reasons including the
> added overhead of maintaining it that you mention above, though that can
> easily be dwarfed by the overhead caused by the fact that most hash
> table loads tend to be cache misses (the hash is simply not very cache
> friendly).
> 
> On the other hand, it means that unlike a page table tree, the hash tend
> to resolve a translation in a single load, at least when well primed and
> big enough. So for some types of workloads, it makes quite a bit of
> sense, at least on paper.
> 
> > 3. If the above is true, where is the TLB for the kernel? I mean when I
> > see
> > head.S for the ppc64 architecture (all files are from 2.6.10 by the way),
> > I
> > do see an unconditional branch for do_hash_page wherein we "try to insert
> > an
> > HPTE". Within do_hash_page, after doing some sanity checking to make sure
> > we
> > don't have any weird conditions here, we jump to 'handle_page_fault' which
> > is again encoded in assembly and in the same file viz. head.S. Following
> > it
> > I again arrive back to handle_mm_fault from within 'do_page_fault' and we
> > are back to square one. I understand that stuff is happening transparently
> > behind our backs, but what and where exactly? I mean if I could understand
> > this sequence of what is in hardware, what is in software and the
> > sequence,
> > perhaps I could get my head around it a lot better...
> > 
> > Again, I am keen to hear from you and I am sorry if I going round round
> > and
> > round..but I seriously am a bit confused with this..
> 
> The TLB is not directly populated by the kernel, the HW does it by
> reading from the hash table, though we do invalidate it ourselves.
> 
> Cheers,
> Ben.
> 
> > Thanks again.
> > 
> > Benjamin Herrenschmidt wrote:
> > > 
> > > On Wed, 2012-12-05 at 09:14 -0800, Pegasus11 wrote:
> > >> Hi Ben.
> > >> 
> > >> Thanks for your input. Please find my comments inline.
> > > 
> > > Please don't quote your replies ! Makes it really hard to read.
> > > 
> > >> 
> > >> Benjamin Herrenschmidt wrote:
> > >> > 
> > >> > On Tue, 2012-12-04 at 21:56 -0800, Pegasus11 wrote:
> > >> >> Hello.
> > >> >> 
> > >> >> Ive been trying to understand how an hash PTE is updated. Im on a
> > >> >> PPC970MP
> > >> >> machine which using the IBM PowerPC 604e core. 
> > >> > 
> > >> > Ben: Ah no, the 970 is a ... 970 core :-) It's a derivative of
> > POWER4+
> > >> > which
> > >> > is quite different from the old 32-bit 604e.
> > >> > 
> > >> > Peg: So the 970 is a 64bit core whereas the 604e is a 32 bit core.
> > The
> > >> > former is used in the embedded segment whereas the latter for server
> > >> > market right?
> > > 
> > > Not quite. The 604e is an ancient core, I don't think it's still used
> > > anymore. It was a "server class" (sort-of) 32-bit core. Embedded
> > > nowadays would be things like FSL e500 etc...
> > > 
> > > 970 aka G5 is a 64-bit server class core designed originally for Apple
> > > G5 machines, derivative of the POWER4+ design.
> > > 
> > > IE. They are both server-class (or "classic") processors, not embedded
> > > though of course they can be used in embedded setups as well.
> > > 
> > >> >> My Linux version is 2.6.10 (I
> > >> >> am sorry I cannot migrate at the moment. Management issues and I
> > can't
> > >> >> help
> > >> >> :-(( )
> > >> >> 
> > >> >> Now onto the problem:
> > >> >> hpte_update is invoked to sync the on-chip MMU cache which Linux
> > uses
> > >> as
> > >> >> its
> > >> >> TLB.
> > >> > 
> > >> > Ben: It's actually in-memory cache. There's also an on-chip TLB.
> > > 
> > >> > Peg: An in-memory cache of what?
> > > 
> > > Of translations :-) It's sort-of a memory overflow of the TLB, it's read
> > > by HW though.
> > > 
> > >>  You mean the kernel caches the PTEs in its own software cache as well?
> > > 
> > > No. The HW MMU will look into the hash table if it misses the TLB, so
> > > the hash table is part of the HW architecture definition. It can be seen
> > > as a kind of in-memory cache of the TLB.
> > > 
> > > The kernel populates it from the Linux page table PTEs "on demand".
> > > 
> > >> And is this cache not related in anyway to
> > >> > the on-chip TLB? 
> > > 
> > > It is in that it's accessed by HW when the TLB misses.
> > > 
> > >> If that is indeed the case, then ive read a paper on some
> > >> > of the MMU tricks for the PPC by court dougan which says Linux uses
> > (or
> > >> > perhaps used to when he wrote that) the MMU hardware cache as the
> > >> hardware
> > >> > TLB. What is that all about? Its called : Optimizing the Idle Task
> > and
> > >> > Other MMU Tricks - Usenix
> > > 
> > > Probably very ancient and not very relevant anymore :-)
> > > 
> > >> >>  So whenever a change is made to the PTE, it has to be propagated to
> > >> the
> > >> >> corresponding TLB entry. And this uses hpte_update for the same. Am
> > I
> > >> >> right
> > >> >> here?
> > >> > 
> > >> > Ben: hpte_update takes care of tracking whether a Linux PTE was also
> > >> > cached
> > >> > into the hash, in which case the hash is marked for invalidation. I
> > >> > don't remember precisely how we did it in 2.6.10 but it's possible
> > that
> > >> > the actual invalidation of the hash and the corresponding TLB
> > >> > invalidations are delayed.
> > >> > Peg: But in 2.6.10, Ive seen the code first check for the existence
> > of
> > >> the
> > >> > HASHPTE flag in a given PTE and if it exists, only then is this
> > >> > hpte_update function being called. Could you for the love of tux
> > >> elaborate
> > >> > a bit on how the hash and the underlying TLB entries are related?
> > I'll
> > >> > then try to see how it was done back then..since it would probably be
> > >> > quite similar at least conceptually (if I am lucky :jumping:)
> > > 
> > > Basically whenever there's a HW fault (TLB miss -> hash miss), we try to
> > > populate the hash table based on the content of the linux PTE and if we
> > > succeed (permission ok etc...) we set the HASHPTE flag in the PTE. This
> > > indicates that this PTE was hashed at least once.
> > > 
> > > This is used in a couple of cases, such as when doing invalidations, in
> > > order to know whether it's worth searching the hash for a match that
> > > needs to be cleared as well, and issuing a tlbie instruction to flush
> > > any corresponding TLB entry or not.
> > > 
> > >> >> Now  http://lxr.linux.no/linux-bk+*/+code=hpte_update hpte_update 
> > is
> > >> >> declared as
> > >> >>  
> > >> >> ' void hpte_update(pte_t *ptep, unsigned long pte, int wrprot) '. 
> > >> >> The arguments to this function is a POINTER to the PTE entry (needed
> > >> to
> > >> >> make
> > >> >> a change persistent across function call right?), the PTE entry (as
> > in
> > >> >> the
> > >> >> value) as well the wrprot flag.
> > >> >> 
> > >> >> Now the code snippet thats bothering me is this:
> > >> >> '
> > >> >>   86        ptepage = virt_to_page(ptep);
> > >> >>   87        mm = (struct mm_struct *) ptepage->mapping;
> > >> >>   88        addr = ptepage->index +
> > >> >>   89                (((unsigned long)ptep & ~PAGE_MASK) *
> > >> PTRS_PER_PTE);
> > >> >> '
> > >> >> 
> > >> >> On line 86, we get the page structure for a given PTE but we pass
> > the
> > >> >> pointer to PTE not the PTE itself whereas virt_to_page is a macro
> > >> defined
> > >> >> as:
> > >> > 
> > >> > I don't remember why we did that in 2.6.10 however...
> > >> > 
> > >> >> #define virt_to_page(kaddr)   pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
> > >> >> 
> > >> >> Why are passing the POINTER to pte here? I mean are we looking for
> > the
> > >> >> PAGE
> > >> >> that is described by the PTE or are we looking for the PAGE which
> > >> >> contains
> > >> >> the pointer to PTE? Me things it is the later since the former is
> > >> given
> > >> >> by
> > >> >> the VALUE of the PTE not its POINTER. Right?
> > >> > 
> > >> > Ben: The above gets the page that contains the PTEs indeed, in order
> > to
> > >> > get
> > >> > the associated mapping pointer which points to the struct mm_struct,
> > >> and
> > >> > the index, which together are used to re-constitute the virtual
> > >> address,
> > >> > probably in order to perform the actual invalidation. Nowadays, we
> > just
> > >> > pass the virtual address down from the call site.
> > >> > Peg: Re-constitute the virtual address of what exactly? The virtual
> > >> > address that led us to the PTE is the most natural thought that comes
> > >> to
> > >> > mind.
> > > 
> > > Yes.
> > > 
> > >>  However, the page which contains all these PTEs, would be typically
> > >> > categorized as a page directory right? So are we trying to get the
> > page
> > >> > directory here...Sorry for sounding a bit hazy on this one...but I
> > >> really
> > >> > am on this...:confused:
> > > 
> > > The struct page corresponding to the page directory page contains some
> > > information about the context which allows us to re-constitute the
> > > virtual address. It's nasty and awkward and we don't do it that way
> > > anymore in recent kernels, the vaddr is passed all the way down as
> > > argument.
> > > 
> > > That vaddr is necessary to locate the corresponding hash entries and to
> > > perform TLB invalidations if needed.
> > > 
> > >> >> So if it indeed the later, what trickery are we here after? Perhaps
> > >> >> following the snippet will make us understand? As I see from above,
> > >> after
> > >> >> that we get the 'address space object' associated with this page. 
> > >> >> 
> > >> >> What I don't understand is the following line:
> > >> >>  addr = ptepage->index + (((unsigned long)ptep & ~PAGE_MASK) *
> > >> >> PTRS_PER_PTE);
> > >> >> 
> > >> >> First we get the index of the page in the file i.e. the number of
> > >> pages
> > >> >> preceding the page which holds the address of PTEP. Then we get the
> > >> lower
> > >> >> 12
> > >> >> bits of this page. Then we shift that these bits to the left by 12
> > >> again
> > >> >> and
> > >> >> to it we add the above index. What is this doing?
> > >> >> 
> > >> >> There are other things in this function that I do not understand.
> > I'd
> > >> be
> > >> >> glad if someone could give me a heads up on this.
> > >> > 
> > >> > Ben: It's gross, the point is to rebuild the virtual address. You
> > >> should
> > >> > *REALLY* update to a more recent kernel, that ancient code is broken
> > in
> > >> > many ways as far as I can tell.
> > >> > Peg: Well Ben, if I could I would..but you do know the higher
> > ups..and
> > >> the
> > >> > way those baldies think now don't u? Its hard as such to work with
> > >> > them..helping them to a platter of such goodies would only mean that
> > >> one
> > >> > is trying to undermine them (or so they'll think)...So Im between a
> > >> rock
> > >> > and a hard place here....hence..i'd rather go with the hard
> > place..and
> > >> > hope nice folks like yourself would help me make my life just a lil
> > bit
> > >> > easier...:handshake:
> > > 
> > > Are you aware of how old 2.6.10 is ? I know higher ups and I know they
> > > are capable of getting it sometimes ... :-)
> > > 
> > > Cheers,
> > > Ben.
> > > 
> > >> > Thanks again.
> > >> > 
> > >> > Pegasus
> > >> > 
> > >> > Cheers,
> > >> > Ben.
> > >> > 
> > >> > 
> > >> > _______________________________________________
> > >> > Linuxppc-dev mailing list
> > >> > Linuxppc-dev@lists.ozlabs.org
> > >> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > >> > 
> > >> > 
> > >> 
> > > 
> > > 
> > > _______________________________________________
> > > Linuxppc-dev mailing list
> > > Linuxppc-dev@lists.ozlabs.org
> > > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > > 
> > > 
> > 
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 
> 
> 

^ permalink raw reply

* Re: Understanding how kernel updates MMU hash table
From: Pegasus11 @ 2012-12-09  7:18 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1354794967.28585.16.camel@pasglop>


Hi Ben.
Firstly thanks a lot for being so succint and patient in explaining these
things to me. It helped me guide my way through an assortment of documents
and things are slowly becoming clear. So summing it all up, what I have
understood is this (pls correct me if I am wrong anywhere):
1. The particulars for address translation differ slightly between 32bit and
64bit processors
2. For the 32bit architecture, the 4GB address space is divided into 16
segments, which are addressed using the upper 4bits of the effective address
(EA) by means of a 'segment register'. From this , the VSID is obtained
which is then concatenated with the next 16 bits of the EA and the 40bit
resulting bitstream is used to index into a hash paged table to get the page
frame number (PFN).
3. For 64bit architecture there is no such 'segment register we use a
segment table entry (STE) from within an SLB (segment lookaside buffer)
which caches recently used mappings from ESID (part of effective address) to
VSID (part of Virtual address). This SLB is again maintained in main memory
by the OS.
4. This hashed page table is located in a fixed region of main memory, the
starting address of which is given by the SDR1 register.
5. (Now this is something that I was perhaps missing. Please correct me if I
am wrong). Every access to a memory location will picture the MMU since it
is a hardware component which is always between the CPU bus and the memory
bus. This basic fact of computer design was somehow escaping me,,i wonder
why :thinking:. Thus the MMU first consulting the hardware TLB and on
encountering a TLB miss, it looking for the same in the hashed page table,
is something that happens without any sort of OS interference (as the HW has
been programmed to do).
6. So now when you say that the kernel's job is to maintain this hashed
paged table, since the MMU will need it during a TLB miss, makes sense to me
now. And this page table has a peculiar format of Page table entry groups
(PTEGs) and for each translation first the primary PTEG is searched and if
the entry isn't found in it, the MMU searches the secondary PTEG for the
same. All this happens in the background without the OS having as much as a
hint for the same unless of course the entry is not found even in the
secondary PTEG upon which a page fault exception is generated and the
subsequent handling code ensues.

Now that I have spelled out what I understand (and ask you to please let me
know if I am missing anything anywhere), what is there for me to understand
is the relation between Linux's page table that is a pure software construct
dictated by the kernel itself and the hardware dictated page table (which in
my case here is an inverted page table maintained in a fixed location in
main memory). I stumbled upon this link:
http://yarchive.net/comp/linux/page_tables.html . Although its an old link,
linus, in his usual candid style explains to a curios fellow the
significance of maintaining a seperate page table distinct from the hardware
dictated page table.

Now, pardon me if my post hereon digresses a bit on the semantics of Linux
page tables in general. I believe understanding why things are the way they
are, would ultimately help me understand how Linux works so well on a
plethora of hardware architectures including powerpc. In the link, he talks
about 'Linux page tables matching hardware architectures closely' for a lot
of architectures and machines. Which means Linux is using the page tables
to, sort of, mirror the virtual memory related hardware as closely as
possible. So in addition to satisfying what the architecture vendor
specifies as the job of the OS in maintaining the VM infrastructure, it has
its own VM infrastructure which it used to keep track of the Virtual memory.
Right?

In that same link, Linus again stresses the fact that, such hash tables can
be used as extended TLBs for the kernel. And he seems to have a particular
dislike for PPC virtual memory management. He calls the architecture (or
called it back then) a 'sick puppy' =^D.

Now coming to the topic of TLB flush, all we are really talking about is
invalidating the MMU hash table right? But you mentioned that the kernel
does not populate the TLB, the MMU does that from the hash table. So what
exactly are we referring to as a TLB here? Linus considers the hash table as
an 'extended TLB' but extended to what? The hardware TLBs? 

So when we talk about flushing the TLB which one are we talking about? The
in memory hash table or the TLB or both? Or does it depend on the virtual
address(es)?
And since it is NOT in the form of a tree, invalidating an entire hash
table, should be faster than clear a page table atleast on paper. Right? Is
there any way one can actually speedup the TLB flush if one has such
inverted Hash tables (which I think) are being used as extended TLBs? Linus
seems to have a pretty nasty opinion about them old PPC machines
though...but im still interested to know if any good could come out of it.

You also said that, most hash table loads tend to be cache misses. I believe
you've used the term 'cache' here loosely and it corresponds to the three
hardware TLBs that you had mentioned. Right? Since it there the MMU first
looks for before taking a shot at the in-memory hash table isn't it?

Keen to know more Ben. Thanks in advance.
:-)
Cheers. 



> By your words, I understand that the hash table is an in-memory cache of
> translations meaning it is implemented in software.

Well, it's populated by software and read by HW. IE. On x86, the MMU
will walk a radix tree of page tables, on powerpc it will walk an in
memory hash table. The main difference is that on x86, there is usually
a tree per process while the powerpc hash table tends to be global.

> So whenever the MMU wishes to translate a virtual address, it first checks
> the TLB and if it
> isn't found there, it looks for it in the hash table. Now this seems fine
> to
> me when looked at from the perspective of the MMU. Now when I look at it
> from the kernel's perspective, I am a bit confused.
> 
> So when we (the kernel) encounter a virtual address, we walk the page
> tables
> and if we find that there is no valid entry for this address, we page
> fault
> which causes an exception right?

Hrm ... not sure what we mean by "the kernel". There are two different
path here, but let's focus on the usual case... the processor encounters
an address, whether it's trying to fetch an instruction, or having done
that, is performing a load or a store. This will use what we call in
powerpc lingua an "effective" address. This gets in turn turned into a
"virtual address" after an SLB lookup.

I refer you to the architecture here, it's a bit tricky but basically
the principle is that the virtual address space is *somewhat* the
effective address space along with the process id. Except that on
powerpc, we do that per-segment (we divide the address space into
segments) so each segment has its top bits "transformed" into something
larger called the VSID.

In any case, this results in a virtual address which is then looked up
in the TLB (I'm ignoring the ERAT here which is the 1-st level TLB but
let's not complicate things even more). If that misses, the CPU looks up
in the hash table. If that misses, it causes an exception (0x300 for
data accesses, 0x400 for instruction accesses).

There, Linux will usually go into hash_page which looks for the Linux
PTE. If the PTE is absent (or has any other reason to be unusable such
as being read-only for a write access), we get to do_page_fault.

Else, we populate the hash table with a translation, set the HASHPTE bit
in the PTE, and retry the access.

>  And this exception then takes us to the
> exception handler which I guess is 'do_page_fault'. On checking this
> function I see that it gets the PGD, allocates a PMD, allocates a PTE and
> then it calls handle_pte_fault. The comment banner for handle_pte_fault
> reads:
> 
> 1638 /* These routines also need to handle stuff like marking pages dirty
> 1639 * and/or accessed for architectures that don't do it in hardware
> (most
> 1640 * RISC architectures).  The early dirtying is also good on the i386.
> 1641 *
> 1642 * There is also a hook called "update_mmu_cache()" that architectures
> 1643 * with external mmu caches can use to update those (ie the Sparc or
> 1644 * PowerPC hashed page tables that act as extended TLBs)....
> .........
> */

Yes, when we go to do_page_fault() because the PTE wasn't populated in
the first place, we have a hook to pre-fill the hash table instead of
taking a fault again which will fill it the second time around. It's
just a shortcut.

> It is from such comments that I inferred that the hash tables were being
> used as "extended TLBs". However the above also infers (atleast to me)
> that
> these caches are in hardware as theyve used the word 'extended'. Pardon me
> if I am being nitpicky but these things are confusing me a bit. So to
> clear
> this confusion, there are three things I would like to know.
> 1. Is the MMU cache implemented in hardware or software? I trust you on it
> being software but it would be great if you could address my concern in
> the
> above paragraph.

The TLB is a piece of HW. (there's really three in fact, the I-ERAT, the
D-ERAT and the TLB ;-)

The Hash Table is a piece of RAM (pointed to by the SDR1 register) setup
by the OS and populated by the OS but read by the HW. Just like the page
tables on x86.

> 2. The kernel, it looks from the do_page_fault sequence, is updating its
> internal page table first and then it goes on to update the mmu cache. So
> this only means it is satisfying the requirement of someone else, perhaps
> the MMU here. 

update_mmu_cache() is just a shortcut.

As I explained above, we populate the hash table lazily on fault.
However, when taking an actual high level page fault (do_page_fault), we
*know* the hash doesn't have an appropriate translation, so rather than
just filling up the linux PTE and then taking the fault again to fill
the hash from the linux PTE, we have a hook so we can pre-fill the hash.

> This should imply that this MMU cache does the kernel no good
> in fact it adds one more entry in its to-do list when it plays around with
> a
> process's page table.

This is a debatable topic ;-) Some of us do indeed thing that the hash
table isn't a very useful construct in the grand scheme of things and
ends up being fairly inefficient, for a variety of reasons including the
added overhead of maintaining it that you mention above, though that can
easily be dwarfed by the overhead caused by the fact that most hash
table loads tend to be cache misses (the hash is simply not very cache
friendly).

On the other hand, it means that unlike a page table tree, the hash tend
to resolve a translation in a single load, at least when well primed and
big enough. So for some types of workloads, it makes quite a bit of
sense, at least on paper.

> 3. If the above is true, where is the TLB for the kernel? I mean when I
> see
> head.S for the ppc64 architecture (all files are from 2.6.10 by the way),
> I
> do see an unconditional branch for do_hash_page wherein we "try to insert
> an
> HPTE". Within do_hash_page, after doing some sanity checking to make sure
> we
> don't have any weird conditions here, we jump to 'handle_page_fault' which
> is again encoded in assembly and in the same file viz. head.S. Following
> it
> I again arrive back to handle_mm_fault from within 'do_page_fault' and we
> are back to square one. I understand that stuff is happening transparently
> behind our backs, but what and where exactly? I mean if I could understand
> this sequence of what is in hardware, what is in software and the
> sequence,
> perhaps I could get my head around it a lot better...
> 
> Again, I am keen to hear from you and I am sorry if I going round round
> and
> round..but I seriously am a bit confused with this..

The TLB is not directly populated by the kernel, the HW does it by
reading from the hash table, though we do invalidate it ourselves.

Cheers,
Ben.

> Thanks again.
> 
> Benjamin Herrenschmidt wrote:
> > 
> > On Wed, 2012-12-05 at 09:14 -0800, Pegasus11 wrote:
> >> Hi Ben.
> >> 
> >> Thanks for your input. Please find my comments inline.
> > 
> > Please don't quote your replies ! Makes it really hard to read.
> > 
> >> 
> >> Benjamin Herrenschmidt wrote:
> >> > 
> >> > On Tue, 2012-12-04 at 21:56 -0800, Pegasus11 wrote:
> >> >> Hello.
> >> >> 
> >> >> Ive been trying to understand how an hash PTE is updated. Im on a
> >> >> PPC970MP
> >> >> machine which using the IBM PowerPC 604e core. 
> >> > 
> >> > Ben: Ah no, the 970 is a ... 970 core :-) It's a derivative of
> POWER4+
> >> > which
> >> > is quite different from the old 32-bit 604e.
> >> > 
> >> > Peg: So the 970 is a 64bit core whereas the 604e is a 32 bit core.
> The
> >> > former is used in the embedded segment whereas the latter for server
> >> > market right?
> > 
> > Not quite. The 604e is an ancient core, I don't think it's still used
> > anymore. It was a "server class" (sort-of) 32-bit core. Embedded
> > nowadays would be things like FSL e500 etc...
> > 
> > 970 aka G5 is a 64-bit server class core designed originally for Apple
> > G5 machines, derivative of the POWER4+ design.
> > 
> > IE. They are both server-class (or "classic") processors, not embedded
> > though of course they can be used in embedded setups as well.
> > 
> >> >> My Linux version is 2.6.10 (I
> >> >> am sorry I cannot migrate at the moment. Management issues and I
> can't
> >> >> help
> >> >> :-(( )
> >> >> 
> >> >> Now onto the problem:
> >> >> hpte_update is invoked to sync the on-chip MMU cache which Linux
> uses
> >> as
> >> >> its
> >> >> TLB.
> >> > 
> >> > Ben: It's actually in-memory cache. There's also an on-chip TLB.
> > 
> >> > Peg: An in-memory cache of what?
> > 
> > Of translations :-) It's sort-of a memory overflow of the TLB, it's read
> > by HW though.
> > 
> >>  You mean the kernel caches the PTEs in its own software cache as well?
> > 
> > No. The HW MMU will look into the hash table if it misses the TLB, so
> > the hash table is part of the HW architecture definition. It can be seen
> > as a kind of in-memory cache of the TLB.
> > 
> > The kernel populates it from the Linux page table PTEs "on demand".
> > 
> >> And is this cache not related in anyway to
> >> > the on-chip TLB? 
> > 
> > It is in that it's accessed by HW when the TLB misses.
> > 
> >> If that is indeed the case, then ive read a paper on some
> >> > of the MMU tricks for the PPC by court dougan which says Linux uses
> (or
> >> > perhaps used to when he wrote that) the MMU hardware cache as the
> >> hardware
> >> > TLB. What is that all about? Its called : Optimizing the Idle Task
> and
> >> > Other MMU Tricks - Usenix
> > 
> > Probably very ancient and not very relevant anymore :-)
> > 
> >> >>  So whenever a change is made to the PTE, it has to be propagated to
> >> the
> >> >> corresponding TLB entry. And this uses hpte_update for the same. Am
> I
> >> >> right
> >> >> here?
> >> > 
> >> > Ben: hpte_update takes care of tracking whether a Linux PTE was also
> >> > cached
> >> > into the hash, in which case the hash is marked for invalidation. I
> >> > don't remember precisely how we did it in 2.6.10 but it's possible
> that
> >> > the actual invalidation of the hash and the corresponding TLB
> >> > invalidations are delayed.
> >> > Peg: But in 2.6.10, Ive seen the code first check for the existence
> of
> >> the
> >> > HASHPTE flag in a given PTE and if it exists, only then is this
> >> > hpte_update function being called. Could you for the love of tux
> >> elaborate
> >> > a bit on how the hash and the underlying TLB entries are related?
> I'll
> >> > then try to see how it was done back then..since it would probably be
> >> > quite similar at least conceptually (if I am lucky :jumping:)
> > 
> > Basically whenever there's a HW fault (TLB miss -> hash miss), we try to
> > populate the hash table based on the content of the linux PTE and if we
> > succeed (permission ok etc...) we set the HASHPTE flag in the PTE. This
> > indicates that this PTE was hashed at least once.
> > 
> > This is used in a couple of cases, such as when doing invalidations, in
> > order to know whether it's worth searching the hash for a match that
> > needs to be cleared as well, and issuing a tlbie instruction to flush
> > any corresponding TLB entry or not.
> > 
> >> >> Now  http://lxr.linux.no/linux-bk+*/+code=hpte_update hpte_update 
> is
> >> >> declared as
> >> >>  
> >> >> ' void hpte_update(pte_t *ptep, unsigned long pte, int wrprot) '. 
> >> >> The arguments to this function is a POINTER to the PTE entry (needed
> >> to
> >> >> make
> >> >> a change persistent across function call right?), the PTE entry (as
> in
> >> >> the
> >> >> value) as well the wrprot flag.
> >> >> 
> >> >> Now the code snippet thats bothering me is this:
> >> >> '
> >> >>   86        ptepage = virt_to_page(ptep);
> >> >>   87        mm = (struct mm_struct *) ptepage->mapping;
> >> >>   88        addr = ptepage->index +
> >> >>   89                (((unsigned long)ptep & ~PAGE_MASK) *
> >> PTRS_PER_PTE);
> >> >> '
> >> >> 
> >> >> On line 86, we get the page structure for a given PTE but we pass
> the
> >> >> pointer to PTE not the PTE itself whereas virt_to_page is a macro
> >> defined
> >> >> as:
> >> > 
> >> > I don't remember why we did that in 2.6.10 however...
> >> > 
> >> >> #define virt_to_page(kaddr)   pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
> >> >> 
> >> >> Why are passing the POINTER to pte here? I mean are we looking for
> the
> >> >> PAGE
> >> >> that is described by the PTE or are we looking for the PAGE which
> >> >> contains
> >> >> the pointer to PTE? Me things it is the later since the former is
> >> given
> >> >> by
> >> >> the VALUE of the PTE not its POINTER. Right?
> >> > 
> >> > Ben: The above gets the page that contains the PTEs indeed, in order
> to
> >> > get
> >> > the associated mapping pointer which points to the struct mm_struct,
> >> and
> >> > the index, which together are used to re-constitute the virtual
> >> address,
> >> > probably in order to perform the actual invalidation. Nowadays, we
> just
> >> > pass the virtual address down from the call site.
> >> > Peg: Re-constitute the virtual address of what exactly? The virtual
> >> > address that led us to the PTE is the most natural thought that comes
> >> to
> >> > mind.
> > 
> > Yes.
> > 
> >>  However, the page which contains all these PTEs, would be typically
> >> > categorized as a page directory right? So are we trying to get the
> page
> >> > directory here...Sorry for sounding a bit hazy on this one...but I
> >> really
> >> > am on this...:confused:
> > 
> > The struct page corresponding to the page directory page contains some
> > information about the context which allows us to re-constitute the
> > virtual address. It's nasty and awkward and we don't do it that way
> > anymore in recent kernels, the vaddr is passed all the way down as
> > argument.
> > 
> > That vaddr is necessary to locate the corresponding hash entries and to
> > perform TLB invalidations if needed.
> > 
> >> >> So if it indeed the later, what trickery are we here after? Perhaps
> >> >> following the snippet will make us understand? As I see from above,
> >> after
> >> >> that we get the 'address space object' associated with this page. 
> >> >> 
> >> >> What I don't understand is the following line:
> >> >>  addr = ptepage->index + (((unsigned long)ptep & ~PAGE_MASK) *
> >> >> PTRS_PER_PTE);
> >> >> 
> >> >> First we get the index of the page in the file i.e. the number of
> >> pages
> >> >> preceding the page which holds the address of PTEP. Then we get the
> >> lower
> >> >> 12
> >> >> bits of this page. Then we shift that these bits to the left by 12
> >> again
> >> >> and
> >> >> to it we add the above index. What is this doing?
> >> >> 
> >> >> There are other things in this function that I do not understand.
> I'd
> >> be
> >> >> glad if someone could give me a heads up on this.
> >> > 
> >> > Ben: It's gross, the point is to rebuild the virtual address. You
> >> should
> >> > *REALLY* update to a more recent kernel, that ancient code is broken
> in
> >> > many ways as far as I can tell.
> >> > Peg: Well Ben, if I could I would..but you do know the higher
> ups..and
> >> the
> >> > way those baldies think now don't u? Its hard as such to work with
> >> > them..helping them to a platter of such goodies would only mean that
> >> one
> >> > is trying to undermine them (or so they'll think)...So Im between a
> >> rock
> >> > and a hard place here....hence..i'd rather go with the hard
> place..and
> >> > hope nice folks like yourself would help me make my life just a lil
> bit
> >> > easier...:handshake:
> > 
> > Are you aware of how old 2.6.10 is ? I know higher ups and I know they
> > are capable of getting it sometimes ... :-)
> > 
> > Cheers,
> > Ben.
> > 
> >> > Thanks again.
> >> > 
> >> > Pegasus
> >> > 
> >> > Cheers,
> >> > Ben.
> >> > 
> >> > 
> >> > _______________________________________________
> >> > Linuxppc-dev mailing list
> >> > Linuxppc-dev@lists.ozlabs.org
> >> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> >> > 
> >> > 
> >> 
> > 
> > 
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> > 
> 


_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev



-- 
View this message in context: http://old.nabble.com/Understanding-how-kernel-updates-MMU-hash-table-tp34760537p34775807.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: [PATCH] powerpc: fix wii_memory_fixups() compile error on 3.0.y tree
From: Shuah Khan @ 2012-12-09  1:21 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: stable, paulus, Greg KH, shuah.khan, linuxppc-dev
In-Reply-To: <1354987359.20838.67.camel@deadeye.wl.decadent.org.uk>

On Sat, Dec 8, 2012 at 10:22 AM, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Fri, 2012-12-07 at 19:07 -0700, Shuah Khan wrote:
>> Fix wii_memory_fixups() the following compile error on 3.0.y tree with
>> wii_defconfig on 3.0.y tree.
>>
>>   CC      arch/powerpc/platforms/embedded6xx/wii.o
>> arch/powerpc/platforms/embedded6xx/wii.c: In function =91wii_memory_fixu=
ps=92:
>> arch/powerpc/platforms/embedded6xx/wii.c:88:2: error: format =91%llx=92 =
expects argument of type =91long long unsigned int=92, but argument 2 has t=
ype =91phys_addr_t=92 [-Werror=3Dformat]
>> arch/powerpc/platforms/embedded6xx/wii.c:88:2: error: format =91%llx=92 =
expects argument of type =91long long unsigned int=92, but argument 3 has t=
ype =91phys_addr_t=92 [-Werror=3Dformat]
>> arch/powerpc/platforms/embedded6xx/wii.c:90:2: error: format =91%llx=92 =
expects argument of type =91long long unsigned int=92, but argument 2 has t=
ype =91phys_addr_t=92 [-Werror=3Dformat]
>> arch/powerpc/platforms/embedded6xx/wii.c:90:2: error: format =91%llx=92 =
expects argument of type =91long long unsigned int=92, but argument 3 has t=
ype =91phys_addr_t=92 [-Werror=3Dformat]
>> cc1: all warnings being treated as errors
>> make[2]: *** [arch/powerpc/platforms/embedded6xx/wii.o] Error 1
>> make[1]: *** [arch/powerpc/platforms/embedded6xx] Error 2
>> make: *** [arch/powerpc/platforms] Error 2
>>
>> Signed-off-by: Shuah Khan <shuah.khan@hp.com>
>> CC: stable@vger.kernel.org 3.0.y
>> ---
>>  arch/powerpc/platforms/embedded6xx/wii.c |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/embedded6xx/wii.c b/arch/powerpc/pla=
tforms/embedded6xx/wii.c
>> index 1b5dc1a..13d58dd 100644
>> --- a/arch/powerpc/platforms/embedded6xx/wii.c
>> +++ b/arch/powerpc/platforms/embedded6xx/wii.c
>> @@ -85,9 +85,9 @@ void __init wii_memory_fixups(void)
>>       wii_hole_start =3D p[0].base + p[0].size;
>>       wii_hole_size =3D p[1].base - wii_hole_start;
>>
>> -     pr_info("MEM1: <%08llx %08llx>\n", p[0].base, p[0].size);
>> +     pr_info("MEM1: <%08ulx %08ulx>\n", p[0].base, p[0].size);
>
> "%08ulx" is the conversion specification "%08u" followed by literal
> "lx".  If phys_addr_t is always defined as unsigned long or always
> unsigned int for this platform then you could use "%08lx" or "%08x"
> respectively.  But usually the right thing to do is to cast the
> arguments of type phys_addr_t to unsigned long long (or, equivalently,
> u64).
>
> Ben.

Thanks. I should have gone with my first instinct of typecasting :)  Will
resend the patch with typecast instead of the change I made in a couple
of days.

Thanks,
-- Shuah

^ permalink raw reply

* Re: [RFC] Add IBM Blue Gene/Q Platform
From: Jimi Xenidis @ 2012-12-08 22:22 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Andrew T Tauferner, Kumar Gala, Jay Bryant, Josh Boyer,
	Todd Inglett, linuxppc-dev
In-Reply-To: <F49F5B26-671F-4D3F-BAC3-67F852E03798@pobox.com>


On Dec 7, 2012, at 7:38 AM, Jimi Xenidis <jimix@pobox.com> wrote:

>=20
> On Dec 6, 2012, at 11:54 PM, Michael Neuling <mikey@neuling.org> =
wrote:
>=20
>>> commit 279c0615917b959a652e81f4ad0d886e2d426d85
>>> Author: Jimi Xenidis <jimix@pobox.com>
>>> Date:   Wed Dec 5 13:43:22 2012 -0500
>>>=20
>>>   powerpc/book3e: IBM Blue Gene/Q Quad Processing eXtention (QPX)
>>>=20
>>>   This enables kernel support for the QPX extention and is intended =
for
>>>   processors that support it, usually an IBM Blue Gene processor.
>>>   Turning it on does not effect other processors but it does add =
code
>>>   and will quadruple the per thread save and restore area for the =
FPU
>>>   (hense the name).  If you have enabled VSX it will only double the
>>>   space.
>>>=20
>>>   Signed-off-by: Jimi Xenidis <jimix@pobox.com>
>>=20

<<snip>>
>>=20
>>=20
>>=20
>> +BEGIN_FTR_SECTION							=
\
>> +	SAVE_32VSRS(n,c,base);						=
\
>> +END_FTR_SECTION_IFSET(CPU_FTR_VSX);					=
\
>> +BEGIN_FTR_SECTION							=
\
>> +	SAVE_32QRS(n,c,base);						=
\
>> +END_FTR_SECTION_IFSET(CPU_FTR_QPX);=09
>>=20
>> I don't think we want to do this.  We are going to end up with 64
>> NOPS here somewhere.
>=20
> Excellent point, NOPs are cheap on most processors but not A2 and a =
lot of embedded, I can wrap some branches with the FTR instead.
> Do you have a concern on the code size?

Thought about it a bit and came up with this solution for =
arch/powerpc/kernel/fpu.S.
This should address the following issues
 - MSR_VSX vs MSR_VEC
 - Big chunks of NOPs in the code path
 - Less FTR space fixups at boot time.
 - IMNHSO easier to read especially when disassembled

I did consider using the LR and BLR, but the !CONFIG_SMP case only adds =
one more special block and uses a different register set.
Also if this is agreeable I would like us to consider removing the =
*_32FPVSRS* macros entirely and put the FTR tests in the actual code.
This would allow us to use #ifdefs and reduce the amount of code that =
actually gets compiled.

Thoughts?

diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index e0ada05..5964067 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -25,30 +25,81 @@
 #include <asm/asm-offsets.h>
 #include <asm/ptrace.h>
=20
-#ifdef CONFIG_VSX
-#define __REST_32FPVSRS(n,c,base)					=
\
-BEGIN_FTR_SECTION							=
\
-	b	2f;							=
\
-END_FTR_SECTION_IFSET(CPU_FTR_VSX);					=
\
-	REST_32FPRS(n,base);						=
\
-	b	3f;							=
\
-2:	REST_32VSRS(n,c,base);						=
\
-3:
-
-#define __SAVE_32FPVSRS(n,c,base)					=
\
-BEGIN_FTR_SECTION							=
\
-	b	2f;							=
\
-END_FTR_SECTION_IFSET(CPU_FTR_VSX);					=
\
-	SAVE_32FPRS(n,base);						=
\
-	b	3f;							=
\
-2:	SAVE_32VSRS(n,c,base);						=
\
-3:
-#else
-#define __REST_32FPVSRS(n,b,base)	REST_32FPRS(n, base)
-#define __SAVE_32FPVSRS(n,b,base)	SAVE_32FPRS(n, base)
-#endif
-#define REST_32FPVSRS(n,c,base) =
__REST_32FPVSRS(n,__REG_##c,__REG_##base)
-#define SAVE_32FPVSRS(n,c,base) =
__SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
+
+/*
+ * Restore subroutines, R4 is scratch and R5 is base
+ */
+vsx_restore:
+	REST_32VSRS(0, __REG_R4, __REG_R5)
+	b after_restore
+qpx_restore:
+	REST_32QRS(0, __REG_R4, __REG_R5)
+	b after_restore
+fpu_restore:
+	REST_32FPRS(0, __REG_R5)
+	b after_restore
+
+#define REST_32FPVSRS(n, c, base)		\
+BEGIN_FTR_SECTION				\
+	b vsx_restore;				\
+END_FTR_SECTION_IFSET(CPU_FTR_VSX)		\
+BEGIN_FTR_SECTION				\
+	b qpx_restore;				\
+END_FTR_SECTION_IFSET(CPU_FTR_QPX)		\
+	b fpu_restore;				\
+after_restore:
+
+/*
+ * Save subroutines, R4 is scratch and R3 is base
+ */
+vsx_save:
+	SAVE_32VSRS(0, __REG_R4, __REG_R3)
+	b after_save
+qpx_save:
+	SAVE_32QRS(0, __REG_R4, __REG_R3)
+	b after_save
+fpu_save:
+	SAVE_32FPRS(0, __REG_R3)
+	b after_save
+
+#define SAVE_32FPVSRS(n, c, base)		\
+BEGIN_FTR_SECTION				\
+	b vsx_save;				\
+END_FTR_SECTION_IFSET(CPU_FTR_VSX)		\
+BEGIN_FTR_SECTION				\
+	b qpx_save;				\
+END_FTR_SECTION_IFSET(CPU_FTR_QPX)		\
+	b fpu_save;				\
+after_save:
+
+#ifndef CONFIG_SMP
+/*
+ * we need an extra save set for the !CONFIG_SMP case, see below
+ * Scratch it R5 and base is R4
+ */
+vsx_save_nosmp:
+	SAVE_32VSRS(0,R5,R4)
+	b after_save_nosmp
+
+qpx_save_nosmp:
+	SAVE_32QRS(0,R5,R4)
+	b after_save_nosmp
+
+fpu_save_nosmp:
+	SAVE_32FPRS(0,R4)
+	b after_save_nosmp
+
+#define SAVE_32FPVSRS_NOSMP(n,c,base)		\
+BEGIN_FTR_SECTION				\
+	b vsx_save_nosmp;			\
+END_FTR_SECTION_IFSET(CPU_FTR_VSX)		\
+BEGIN_FTR_SECTION				\
+	b qpx_save_nosmp;			\
+END_FTR_SECTION_IFSET(CPU_FTR_QPX)		\
+	b fpu_save_nosmp;			\
+after_save_nosmp:
+
+#endif /* !CONFIG_SMP */
=20
 /*
  * This task wants to use the FPU now.
@@ -65,6 +116,11 @@ BEGIN_FTR_SECTION
 	oris	r5,r5,MSR_VSX@h
 END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 #endif
+#ifdef CONFIG_PPC_QPX
+BEGIN_FTR_SECTION
+	oris	r5,r5,MSR_VEC@h
+END_FTR_SECTION_IFSET(CPU_FTR_QPX)
+#endif
 	SYNC
 	MTMSRD(r5)			/* enable use of fpu now */
 	isync
@@ -81,7 +137,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 	beq	1f
 	toreal(r4)
 	addi	r4,r4,THREAD		/* want =
last_task_used_math->thread */
-	SAVE_32FPVSRS(0, R5, R4)
+	SAVE_32FPVSRS_NOSMP(0, R5, R4)
 	mffs	fr0
 	stfd	fr0,THREAD_FPSCR(r4)
 	PPC_LL	r5,PT_REGS(r4)
@@ -94,7 +150,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 #endif /* CONFIG_SMP */
 	/* enable use of FP after return */
 #ifdef CONFIG_PPC32
-	mfspr	r5,SPRN_SPRG_THREAD		/* current task's THREAD =
(phys) */
+	mfspr	r5,SPRN_SPRG_THREAD	/* current task's THREAD (phys) =
*/
 	lwz	r4,THREAD_FPEXC_MODE(r5)
 	ori	r9,r9,MSR_FP		/* enable FP for current */
 	or	r9,r9,r4
@@ -132,6 +188,11 @@ BEGIN_FTR_SECTION
 	oris	r5,r5,MSR_VSX@h
 END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 #endif
+#ifdef CONFIG_PPC_QPX
+BEGIN_FTR_SECTION
+	oris	r5,r5,MSR_VEC@h
+END_FTR_SECTION_IFSET(CPU_FTR_QPX)
+#endif
 	SYNC_601
 	ISYNC_601
 	MTMSRD(r5)			/* enable use of fpu now */
@@ -148,10 +209,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 	beq	1f
 	PPC_LL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
 	li	r3,MSR_FP|MSR_FE0|MSR_FE1
-#ifdef CONFIG_VSX
+#if defined(CONFIG_VSX) || defined(CONFIG_PPC_QPX)
 BEGIN_FTR_SECTION
 	oris	r3,r3,MSR_VSX@h
-END_FTR_SECTION_IFSET(CPU_FTR_VSX)
+END_FTR_SECTION_IFSET(CPU_FTR_VSX | CPU_FTR_QPX)
 #endif
 	andc	r4,r4,r3		/* disable FP for previous task =
*/
 	PPC_STL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)

-jx


>=20
>>=20
>> I'd like to see this patch broken into different parts.
>=20
> I'm not sure how _this_ patch:
>  =
<https://github.com/jimix/linux-bgq/commit/279c0615917b959a652e81f4ad0d886=
e2d426d85>
> could be broken up, please advise.
>=20
>>=20
>> Also, have you boot tested this change on a VSX enabled box?
>=20
> I can try, I may bug you for help.
> Is there a commonly test (or apps) I should run?
>=20
> -jx
>=20
>=20
>>=20
>> Mikey
>=20

^ permalink raw reply related

* Re: [PATCH] powerpc: fix wii_memory_fixups() compile error on 3.0.y tree
From: Ben Hutchings @ 2012-12-08 17:22 UTC (permalink / raw)
  To: shuah.khan; +Cc: stable, paulus, shuahkhan, Greg KH, linuxppc-dev
In-Reply-To: <1354932445.7407.15.camel@lorien2>

[-- Attachment #1: Type: text/plain, Size: 2702 bytes --]

On Fri, 2012-12-07 at 19:07 -0700, Shuah Khan wrote:
> Fix wii_memory_fixups() the following compile error on 3.0.y tree with 
> wii_defconfig on 3.0.y tree.
> 
>   CC      arch/powerpc/platforms/embedded6xx/wii.o
> arch/powerpc/platforms/embedded6xx/wii.c: In function ‘wii_memory_fixups’:
> arch/powerpc/platforms/embedded6xx/wii.c:88:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘phys_addr_t’ [-Werror=format]
> arch/powerpc/platforms/embedded6xx/wii.c:88:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘phys_addr_t’ [-Werror=format]
> arch/powerpc/platforms/embedded6xx/wii.c:90:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘phys_addr_t’ [-Werror=format]
> arch/powerpc/platforms/embedded6xx/wii.c:90:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘phys_addr_t’ [-Werror=format]
> cc1: all warnings being treated as errors
> make[2]: *** [arch/powerpc/platforms/embedded6xx/wii.o] Error 1
> make[1]: *** [arch/powerpc/platforms/embedded6xx] Error 2
> make: *** [arch/powerpc/platforms] Error 2
> 
> Signed-off-by: Shuah Khan <shuah.khan@hp.com>
> CC: stable@vger.kernel.org 3.0.y
> ---
>  arch/powerpc/platforms/embedded6xx/wii.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/embedded6xx/wii.c b/arch/powerpc/platforms/embedded6xx/wii.c
> index 1b5dc1a..13d58dd 100644
> --- a/arch/powerpc/platforms/embedded6xx/wii.c
> +++ b/arch/powerpc/platforms/embedded6xx/wii.c
> @@ -85,9 +85,9 @@ void __init wii_memory_fixups(void)
>  	wii_hole_start = p[0].base + p[0].size;
>  	wii_hole_size = p[1].base - wii_hole_start;
>  
> -	pr_info("MEM1: <%08llx %08llx>\n", p[0].base, p[0].size);
> +	pr_info("MEM1: <%08ulx %08ulx>\n", p[0].base, p[0].size);

"%08ulx" is the conversion specification "%08u" followed by literal
"lx".  If phys_addr_t is always defined as unsigned long or always
unsigned int for this platform then you could use "%08lx" or "%08x"
respectively.  But usually the right thing to do is to cast the
arguments of type phys_addr_t to unsigned long long (or, equivalently,
u64).

Ben.

>  	pr_info("HOLE: <%08lx %08lx>\n", wii_hole_start, wii_hole_size);
> -	pr_info("MEM2: <%08llx %08llx>\n", p[1].base, p[1].size);
> +	pr_info("MEM2: <%08ulx %08ulx>\n", p[1].base, p[1].size);
>  
>  	p[0].size += wii_hole_size + p[1].size;
>  

-- 
Ben Hutchings
Beware of programmers who carry screwdrivers. - Leonard Brandwein

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [GIT PULL 00/32] perf/core improvements and fixes
From: Ingo Molnar @ 2012-12-08 14:27 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Feng Tang, Peter Zijlstra, Frederic Weisbecker, Stephane Eranian,
	David Howells, linuxppc-dev, Paul Mackerras, Jiri Olsa,
	Robert Richter, Borislav Petkov, acme, Sukadev Bhattiprolu,
	Peter Zijlstra, Corey Ashford, Namhyung Kim, Anton Blanchard,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Mike Galbraith, linux-kernel, Pekka Enberg, David Ahern,
	Linus Torvalds
In-Reply-To: <1353248997-30763-1-git-send-email-acme@infradead.org>


Note that I had to do a number of conflict resolutions between 
perf/urgent (now upstream) and perf/core, related to UAPI fixes:

 commit f0b9abfb044649bc452fb2fb975ff2fd599cc6a3
 Merge: adc1ef1 1b3c393
 Author: Ingo Molnar <mingo@kernel.org>
 Date:   Sat Dec 8 15:25:06 2012 +0100

    Merge branch 'linus' into perf/core
    
    Conflicts:
        tools/perf/Makefile
        tools/perf/builtin-test.c
        tools/perf/perf.h
        tools/perf/tests/parse-events.c
        tools/perf/util/evsel.h
    
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

I think I managed to resolve them all correctly - but please 
double check the end result nevertheless.

Thanks,

	Ingo

^ permalink raw reply

* Re: [GIT PULL 00/32] perf/core improvements and fixes
From: Ingo Molnar @ 2012-12-08 14:20 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Feng Tang, Peter Zijlstra, Frederic Weisbecker, Stephane Eranian,
	David Howells, linuxppc-dev, Paul Mackerras, Jiri Olsa,
	Robert Richter, Borislav Petkov, acme, Sukadev Bhattiprolu,
	Peter Zijlstra, Corey Ashford, Namhyung Kim, Anton Blanchard,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Mike Galbraith, linux-kernel, Pekka Enberg, David Ahern,
	Linus Torvalds
In-Reply-To: <1353248997-30763-1-git-send-email-acme@infradead.org>


* Arnaldo Carvalho de Melo <acme@infradead.org> wrote:

> Hi Ingo,
> 
> 	Please consider pulling,
> 
> - Arnaldo
> 
> The following changes since commit ffadcf090d468e9c4938b718649f38dd10cfdb02:
> 
>   perf annotate: Handle XBEGIN like a jump (2012-10-31 12:18:26 -0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux perf/core
> 
> for you to fetch changes up to 07ac002f2fcc74c5be47b656d9201d5de84dc53d:
> 
>   perf evsel: Introduce is_group_member method (2012-11-14 16:53:45 -0300)
> 
> ----------------------------------------------------------------
> perf/core improvements and fixes
> 
> . UAPI fixes, from David Howels
> 
> . Separate perf tests into multiple objects, one per test, from Jiri Olsa.
> 
> . Fixes to /proc/pid/maps parsing, preparatory to supporting data maps,
>   from Namhyung Kim
> 
> . Fix compile error for non-NEWT builds, from Namhyung Kim
> 
> . Implement ui_progress for GTK, from Namhyung Kim
> 
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> 
> ----------------------------------------------------------------
> Andi Kleen (1):
>       perf tools: Add arbitary aliases and support names with -
> 
> Arnaldo Carvalho de Melo (13):
>       perf diff: Start moving to support matching more than two hists
>       perf diff: Move hists__match to the hists lib
>       perf hists: Introduce hists__link
>       perf diff: Use hists__link when not pairing just with baseline
>       perf machine: Move more methods to machine.[ch]
>       tools lib traceevent: Add __maybe_unused to unused parameters
>       tools lib traceevent: Avoid comparisions between signed/unsigned
>       tools lib traceevent: No need to check for < 0 on an unsigned enum
>       tools lib traceevent: Handle INVALID_ARG_TYPE errno in pevent_strerror
>       tools lib traceevent: Use 'const' in variables pointing to const strings
>       perf tools: Stop using 'self' in pstack
>       perf hists: Initialize all of he->stat with zeroes
>       perf evsel: Introduce is_group_member method
> 
> Daniel Walter (1):
>       tracing: Replace strict_strto* with kstrto*
> 
> David Howells (3):
>       tools: Define a Makefile function to do subdir processing
>       tools: Honour the O= flag when tool build called from a higher Makefile
>       tools: Pass the target in descend
> 
> David Sharp (2):
>       tracing: Trivial cleanup
>       tracing: Reset ring buffer when changing trace_clocks
> 
> Feng Tang (1):
>       perf browser: Don't show scripts menu for 'perf top'
> 
> Hiraku Toyooka (1):
>       tracing: Change tracer's integer flags to bool
> 
> Ingo Molnar (2):
>       Merge tag 'perf-core-for-mingo' of git://git.kernel.org/.../acme/linux into perf/core
>       Merge branch 'tip/perf/core-2' of git://git.kernel.org/.../rostedt/linux-trace into perf/core
> 
> Jiri Olsa (47):
>       perf tools: Remove BINDIR define from exec_cmd.o compilation
>       perf tests: Move test objects into 'tests' directory
>       perf tests: Add framework for automated perf_event_attr tests
>       perf tests: Add attr record basic test
>       perf tests: Add attr tests under builtin test command
>       perf tests: Add attr record group test
>       perf tests: Add attr record event syntax group test
>       perf tests: Add attr record freq test
>       perf tests: Add attr record count test
>       perf tests: Add attr record graph test
>       perf tests: Add attr record period test
>       perf tests: Add attr record no samples test
>       perf tests: Add attr record no-inherit test
>       perf tests: Add attr record data test
>       perf tests: Add attr record raw test
>       perf tests: Add attr record no delay test
>       perf tests: Add attr record branch any test
>       perf tests: Add attr record branch filter tests
>       perf tests: Add attr stat no-inherit test
>       perf tests: Add attr stat group test
>       perf tests: Add attr stat event syntax group test
>       perf tests: Add attr stat default test
>       perf tests: Add attr stat default test
>       perf tests: Add documentation for attr tests
>       perf tests: Add missing attr stat basic test
>       perf tests: Factor attr tests WRITE_ASS macro
>       perf tests: Fix attr watermark field name typo
>       perf tests: Removing 'optional' field
>       perf tests: Move attr.py temp dir cleanup into finally section
>       perf tools: Add LIBDW_DIR Makefile variable to for alternate libdw
>       perf tests: Move test__vmlinux_matches_kallsyms into separate object
>       perf tests: Move test__open_syscall_event into separate object
>       perf tests: Move test__open_syscall_event_on_all_cpus into separate object
>       perf tests: Move test__basic_mmap into separate object
>       perf tests: Move test__PERF_RECORD into separate object
>       perf tests: Move test__rdpmc into separate object
>       perf tests: Move perf_evsel__roundtrip_name_test into separate object
>       perf tests: Move perf_evsel__tp_sched_test into separate object
>       perf tests: Move test__syscall_open_tp_fields into separate object
>       perf tests: Move pmu tests into separate object
>       perf tests: Final cleanup for builtin-test move
>       perf tests: Check for mkstemp return value in dso-data test
>       perf tools: Fix attributes for '{}' defined event groups
>       perf tools: Fix 'disabled' attribute config for record command
>       perf tools: Ensure single disable call per event in record comand
>       perf tools: Omit group members from perf_evlist__disable/enable
>       perf tools: Add basic event modifier sanity check
> 
> Michal Hocko (1):
>       linux/kernel.h: Remove duplicate trace_printk declaration
> 
> Namhyung Kim (18):
>       perf tools: Use normalized arch name for searching objdump path
>       perf tools: Introduce struct hist_browser_timer
>       perf report: Postpone objdump check until annotation requested
>       perf machine: Set kernel data mapping length
>       perf tools: Fix detection of stack area
>       perf hists: Free branch_info when freeing hist_entry
>       perf tools: Don't try to lookup objdump for live mode
>       perf annotate: Whitespace fixups
>       perf annotate: Don't try to follow jump target on PLT symbols
>       perf annotate: Merge same lines in summary view
>       perf tools: Fix compile error on NO_NEWT=1 build
>       perf tools: Add gtk.<command> config option for launching GTK browser
>       perf tools: Use sscanf for parsing /proc/pid/maps
>       perf ui tui: Move progress.c under ui/tui directory
>       perf ui: Introduce generic ui_progress helper
>       perf ui gtk: Implement ui_progress functions
>       perf ui: Add ui_progress__finish()
>       perf ui: Always compile browser setup code
> 
> Slava Pestov (1):
>       ring-buffer: Add a 'dropped events' counter
> 
> Steven Rostedt (11):
>       tracing: Allow tracers to start at core initcall
>       tracing: Expand ring buffer when trace_printk() is used
>       tracing: Enable comm recording if trace_printk() is used
>       tracing: Have tracing_sched_wakeup_trace() use standard unlock_commit
>       tracing: Cache comms only after an event occurred
>       tracing: Separate open function from set_event and available_events
>       tracing: Remove unused function unregister_tracer()
>       tracing: Make tracing_enabled be equal to tracing_on
>       tracing: Remove deprecated tracing_enabled file
>       tracing: Use irq_work for wake ups and remove *_nowake_*() functions
>       tracing: Add trace_options kernel command line parameter
> 
> Sukadev Bhattiprolu (1):
>       perf powerpc: Use uapi/unistd.h to fix build error
> 
> Vaibhav Nagarnaik (1):
>       tracing: Cleanup unnecessary function declarations
> 
> Yoshihiro YUNOMAE (1):
>       ring-buffer: Change unsigned long type of ring_buffer_oldest_event_ts() to u64
> 
> Zheng Liu (1):
>       perf test: fix a build error on builtin-test
> 
>  Documentation/kernel-parameters.txt                |   16 +
>  Makefile                                           |    6 +-
>  include/linux/ftrace_event.h                       |   14 +-
>  include/linux/kernel.h                             |    7 +-
>  include/linux/ring_buffer.h                        |    3 +-
>  include/trace/ftrace.h                             |    3 +-
>  include/trace/syscall.h                            |   23 -
>  kernel/trace/Kconfig                               |    1 +
>  kernel/trace/ftrace.c                              |    6 +-
>  kernel/trace/ring_buffer.c                         |   51 +-
>  kernel/trace/trace.c                               |  372 ++---
>  kernel/trace/trace.h                               |   14 +-
>  kernel/trace/trace_branch.c                        |    4 +-
>  kernel/trace/trace_events.c                        |   51 +-
>  kernel/trace/trace_events_filter.c                 |    4 +-
>  kernel/trace/trace_functions.c                     |    5 +-
>  kernel/trace/trace_functions_graph.c               |    6 +-
>  kernel/trace/trace_irqsoff.c                       |   14 +-
>  kernel/trace/trace_kprobe.c                        |   10 +-
>  kernel/trace/trace_probe.c                         |   14 +-
>  kernel/trace/trace_sched_switch.c                  |    4 +-
>  kernel/trace/trace_sched_wakeup.c                  |   10 +-
>  kernel/trace/trace_selftest.c                      |   13 +-
>  kernel/trace/trace_syscalls.c                      |   61 +-
>  kernel/trace/trace_uprobe.c                        |    2 +-
>  tools/Makefile                                     |   24 +-
>  tools/lib/traceevent/event-parse.c                 |   22 +-
>  tools/perf/Makefile                                |   50 +-
>  tools/perf/arch/common.c                           |   47 +-
>  tools/perf/builtin-annotate.c                      |    2 +-
>  tools/perf/builtin-diff.c                          |   48 +-
>  tools/perf/builtin-record.c                        |   26 +-
>  tools/perf/builtin-report.c                        |   11 +-
>  tools/perf/builtin-stat.c                          |   12 +-
>  tools/perf/builtin-test.c                          | 1559 --------------------
>  tools/perf/builtin-top.c                           |   10 +-
>  tools/perf/perf.c                                  |   17 +-
>  tools/perf/perf.h                                  |   18 +-
>  tools/perf/tests/attr.c                            |  175 +++
>  tools/perf/tests/attr.py                           |  322 ++++
>  tools/perf/tests/attr/README                       |   64 +
>  tools/perf/tests/attr/base-record                  |   39 +
>  tools/perf/tests/attr/base-stat                    |   39 +
>  tools/perf/tests/attr/test-record-basic            |    5 +
>  tools/perf/tests/attr/test-record-branch-any       |    8 +
>  .../perf/tests/attr/test-record-branch-filter-any  |    8 +
>  .../tests/attr/test-record-branch-filter-any_call  |    8 +
>  .../tests/attr/test-record-branch-filter-any_ret   |    8 +
>  tools/perf/tests/attr/test-record-branch-filter-hv |    8 +
>  .../tests/attr/test-record-branch-filter-ind_call  |    8 +
>  tools/perf/tests/attr/test-record-branch-filter-k  |    8 +
>  tools/perf/tests/attr/test-record-branch-filter-u  |    8 +
>  tools/perf/tests/attr/test-record-count            |    8 +
>  tools/perf/tests/attr/test-record-data             |    8 +
>  tools/perf/tests/attr/test-record-freq             |    6 +
>  tools/perf/tests/attr/test-record-graph-default    |    6 +
>  tools/perf/tests/attr/test-record-graph-dwarf      |   10 +
>  tools/perf/tests/attr/test-record-graph-fp         |    6 +
>  tools/perf/tests/attr/test-record-group            |   18 +
>  tools/perf/tests/attr/test-record-group1           |   19 +
>  tools/perf/tests/attr/test-record-no-delay         |    9 +
>  tools/perf/tests/attr/test-record-no-inherit       |    7 +
>  tools/perf/tests/attr/test-record-no-samples       |    6 +
>  tools/perf/tests/attr/test-record-period           |    7 +
>  tools/perf/tests/attr/test-record-raw              |    7 +
>  tools/perf/tests/attr/test-stat-basic              |    6 +
>  tools/perf/tests/attr/test-stat-default            |   64 +
>  tools/perf/tests/attr/test-stat-detailed-1         |  101 ++
>  tools/perf/tests/attr/test-stat-detailed-2         |  155 ++
>  tools/perf/tests/attr/test-stat-detailed-3         |  173 +++
>  tools/perf/tests/attr/test-stat-group              |   15 +
>  tools/perf/tests/attr/test-stat-group1             |   15 +
>  tools/perf/tests/attr/test-stat-no-inherit         |    7 +
>  tools/perf/tests/builtin-test.c                    |  173 +++
>  .../{util/dso-test-data.c => tests/dso-data.c}     |    8 +-
>  tools/perf/tests/evsel-roundtrip-name.c            |  114 ++
>  tools/perf/tests/evsel-tp-sched.c                  |   84 ++
>  tools/perf/tests/mmap-basic.c                      |  162 ++
>  tools/perf/tests/open-syscall-all-cpus.c           |  120 ++
>  tools/perf/tests/open-syscall-tp-fields.c          |  117 ++
>  tools/perf/tests/open-syscall.c                    |   66 +
>  .../parse-events-test.c => tests/parse-events.c}   |   23 +-
>  tools/perf/tests/perf-record.c                     |  312 ++++
>  tools/perf/tests/pmu.c                             |  178 +++
>  tools/perf/tests/rdpmc.c                           |  175 +++
>  tools/perf/tests/tests.h                           |   22 +
>  tools/perf/tests/util.c                            |   30 +
>  tools/perf/tests/vmlinux-kallsyms.c                |  230 +++
>  tools/perf/ui/browsers/annotate.c                  |   39 +-
>  tools/perf/ui/browsers/hists.c                     |   63 +-
>  tools/perf/ui/gtk/browser.c                        |    4 +-
>  tools/perf/ui/gtk/gtk.h                            |    1 +
>  tools/perf/ui/gtk/progress.c                       |   59 +
>  tools/perf/ui/gtk/setup.c                          |    2 +
>  tools/perf/ui/gtk/util.c                           |   11 -
>  tools/perf/ui/hist.c                               |   10 +-
>  tools/perf/ui/progress.c                           |   44 +-
>  tools/perf/ui/progress.h                           |   10 +
>  tools/perf/ui/tui/progress.c                       |   42 +
>  tools/perf/ui/tui/setup.c                          |    1 +
>  tools/perf/ui/ui.h                                 |   28 +
>  tools/perf/util/annotate.c                         |   69 +-
>  tools/perf/util/annotate.h                         |    9 +-
>  tools/perf/util/cache.h                            |   39 +-
>  tools/perf/util/debug.h                            |    1 +
>  tools/perf/util/dso.c                              |    1 +
>  tools/perf/util/event.c                            |   74 +-
>  tools/perf/util/evlist.c                           |   10 +-
>  tools/perf/util/evsel.c                            |   52 +-
>  tools/perf/util/evsel.h                            |    9 +-
>  tools/perf/util/hist.c                             |   99 ++
>  tools/perf/util/hist.h                             |   36 +-
>  tools/perf/util/machine.c                          |  205 ++-
>  tools/perf/util/machine.h                          |  131 +-
>  tools/perf/util/map.c                              |  181 +--
>  tools/perf/util/map.h                              |   93 --
>  tools/perf/util/parse-events.c                     |   24 +
>  tools/perf/util/parse-events.h                     |    1 -
>  tools/perf/util/parse-events.l                     |    4 +-
>  tools/perf/util/pmu.c                              |  185 +--
>  tools/perf/util/pmu.h                              |    4 +
>  tools/perf/util/pstack.c                           |   46 +-
>  tools/perf/util/session.c                          |    1 +
>  tools/perf/util/session.h                          |    5 +-
>  tools/perf/util/sort.h                             |   27 +-
>  tools/perf/util/symbol.c                           |    1 +
>  tools/perf/util/symbol.h                           |   21 -
>  tools/scripts/Makefile.include                     |   23 +-
>  128 files changed, 4649 insertions(+), 2751 deletions(-)
>  delete mode 100644 tools/perf/builtin-test.c
>  create mode 100644 tools/perf/tests/attr.c
>  create mode 100644 tools/perf/tests/attr.py
>  create mode 100644 tools/perf/tests/attr/README
>  create mode 100644 tools/perf/tests/attr/base-record
>  create mode 100644 tools/perf/tests/attr/base-stat
>  create mode 100644 tools/perf/tests/attr/test-record-basic
>  create mode 100644 tools/perf/tests/attr/test-record-branch-any
>  create mode 100644 tools/perf/tests/attr/test-record-branch-filter-any
>  create mode 100644 tools/perf/tests/attr/test-record-branch-filter-any_call
>  create mode 100644 tools/perf/tests/attr/test-record-branch-filter-any_ret
>  create mode 100644 tools/perf/tests/attr/test-record-branch-filter-hv
>  create mode 100644 tools/perf/tests/attr/test-record-branch-filter-ind_call
>  create mode 100644 tools/perf/tests/attr/test-record-branch-filter-k
>  create mode 100644 tools/perf/tests/attr/test-record-branch-filter-u
>  create mode 100644 tools/perf/tests/attr/test-record-count
>  create mode 100644 tools/perf/tests/attr/test-record-data
>  create mode 100644 tools/perf/tests/attr/test-record-freq
>  create mode 100644 tools/perf/tests/attr/test-record-graph-default
>  create mode 100644 tools/perf/tests/attr/test-record-graph-dwarf
>  create mode 100644 tools/perf/tests/attr/test-record-graph-fp
>  create mode 100644 tools/perf/tests/attr/test-record-group
>  create mode 100644 tools/perf/tests/attr/test-record-group1
>  create mode 100644 tools/perf/tests/attr/test-record-no-delay
>  create mode 100644 tools/perf/tests/attr/test-record-no-inherit
>  create mode 100644 tools/perf/tests/attr/test-record-no-samples
>  create mode 100644 tools/perf/tests/attr/test-record-period
>  create mode 100644 tools/perf/tests/attr/test-record-raw
>  create mode 100644 tools/perf/tests/attr/test-stat-basic
>  create mode 100644 tools/perf/tests/attr/test-stat-default
>  create mode 100644 tools/perf/tests/attr/test-stat-detailed-1
>  create mode 100644 tools/perf/tests/attr/test-stat-detailed-2
>  create mode 100644 tools/perf/tests/attr/test-stat-detailed-3
>  create mode 100644 tools/perf/tests/attr/test-stat-group
>  create mode 100644 tools/perf/tests/attr/test-stat-group1
>  create mode 100644 tools/perf/tests/attr/test-stat-no-inherit
>  create mode 100644 tools/perf/tests/builtin-test.c
>  rename tools/perf/{util/dso-test-data.c => tests/dso-data.c} (95%)
>  create mode 100644 tools/perf/tests/evsel-roundtrip-name.c
>  create mode 100644 tools/perf/tests/evsel-tp-sched.c
>  create mode 100644 tools/perf/tests/mmap-basic.c
>  create mode 100644 tools/perf/tests/open-syscall-all-cpus.c
>  create mode 100644 tools/perf/tests/open-syscall-tp-fields.c
>  create mode 100644 tools/perf/tests/open-syscall.c
>  rename tools/perf/{util/parse-events-test.c => tests/parse-events.c} (97%)
>  create mode 100644 tools/perf/tests/perf-record.c
>  create mode 100644 tools/perf/tests/pmu.c
>  create mode 100644 tools/perf/tests/rdpmc.c
>  create mode 100644 tools/perf/tests/tests.h
>  create mode 100644 tools/perf/tests/util.c
>  create mode 100644 tools/perf/tests/vmlinux-kallsyms.c
>  create mode 100644 tools/perf/ui/gtk/progress.c
>  create mode 100644 tools/perf/ui/tui/progress.c

Pulled, thanks a lot Arnaldo!

	Ingo

^ permalink raw reply

* RE: [PATCH] Revert "crypto: caam - Updated SEC-4.0 device tree binding for ERA information."
From: Garg Vakul-B16394 @ 2012-12-08  4:30 UTC (permalink / raw)
  To: linux-crypto@vger.kernel.org, linuxppc-dev@ozlabs.org,
	devicetree-discuss@lists.ozlabs.org
In-Reply-To: <1354870649-3656-1-git-send-email-vakul@freescale.com>

Kindly ignore this.
I would send another patch updating device tree binding directly with the n=
ew proposed property.

> -----Original Message-----
> From: devicetree-discuss [mailto:devicetree-discuss-
> bounces+vakul=3Dfreescale.com@lists.ozlabs.org] On Behalf Of Vakul Garg
> Sent: Friday, December 07, 2012 2:27 PM
> To: linux-crypto@vger.kernel.org; linuxppc-dev@ozlabs.org; devicetree-
> discuss@lists.ozlabs.org
> Subject: [PATCH] Revert "crypto: caam - Updated SEC-4.0 device tree
> binding for ERA information."
>=20
> This reverts commit a2c0911c09190125f52c9941b9d187f601c2f7be.
>=20
> Signed-off-by: Vakul Garg <vakul@freescale.com>
> ---
> Instead of adding SEC era information in crypto node's compatible, a new
> property 'fsl,sec-era' is being introduced into crypto node.
>=20
>  .../devicetree/bindings/crypto/fsl-sec4.txt        |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
>=20
> diff --git a/Documentation/devicetree/bindings/crypto/fsl-sec4.txt
> b/Documentation/devicetree/bindings/crypto/fsl-sec4.txt
> index fc9ce6f..bd7ce12 100644
> --- a/Documentation/devicetree/bindings/crypto/fsl-sec4.txt
> +++ b/Documentation/devicetree/bindings/crypto/fsl-sec4.txt
> @@ -54,8 +54,7 @@ PROPERTIES
>     - compatible
>        Usage: required
>        Value type: <string>
> -      Definition: Must include "fsl,sec-v4.0". Also includes SEC
> -           ERA versions (optional) with which the device is compatible.
> +      Definition: Must include "fsl,sec-v4.0"
>=20
>     - #address-cells
>         Usage: required
> @@ -107,7 +106,7 @@ PROPERTIES
>=20
>  EXAMPLE
>  	crypto@300000 {
> -		compatible =3D "fsl,sec-v4.0", "fsl,sec-era-v2.0";
> +		compatible =3D "fsl,sec-v4.0";
>  		#address-cells =3D <1>;
>  		#size-cells =3D <1>;
>  		reg =3D <0x300000 0x10000>;
> --
> 1.7.7.6
>=20
>=20
> _______________________________________________
> devicetree-discuss mailing list
> devicetree-discuss@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/devicetree-discuss

^ permalink raw reply

* [PATCH] powerpc: fix wii_memory_fixups() compile error on 3.0.y tree
From: Shuah Khan @ 2012-12-08  2:07 UTC (permalink / raw)
  To: benh, paulus, Greg KH; +Cc: linuxppc-dev, shuahkhan, stable

Fix wii_memory_fixups() the following compile error on 3.0.y tree with 
wii_defconfig on 3.0.y tree.

  CC      arch/powerpc/platforms/embedded6xx/wii.o
arch/powerpc/platforms/embedded6xx/wii.c: In function ‘wii_memory_fixups’:
arch/powerpc/platforms/embedded6xx/wii.c:88:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘phys_addr_t’ [-Werror=format]
arch/powerpc/platforms/embedded6xx/wii.c:88:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘phys_addr_t’ [-Werror=format]
arch/powerpc/platforms/embedded6xx/wii.c:90:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘phys_addr_t’ [-Werror=format]
arch/powerpc/platforms/embedded6xx/wii.c:90:2: error: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘phys_addr_t’ [-Werror=format]
cc1: all warnings being treated as errors
make[2]: *** [arch/powerpc/platforms/embedded6xx/wii.o] Error 1
make[1]: *** [arch/powerpc/platforms/embedded6xx] Error 2
make: *** [arch/powerpc/platforms] Error 2

Signed-off-by: Shuah Khan <shuah.khan@hp.com>
CC: stable@vger.kernel.org 3.0.y
---
 arch/powerpc/platforms/embedded6xx/wii.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/embedded6xx/wii.c b/arch/powerpc/platforms/embedded6xx/wii.c
index 1b5dc1a..13d58dd 100644
--- a/arch/powerpc/platforms/embedded6xx/wii.c
+++ b/arch/powerpc/platforms/embedded6xx/wii.c
@@ -85,9 +85,9 @@ void __init wii_memory_fixups(void)
 	wii_hole_start = p[0].base + p[0].size;
 	wii_hole_size = p[1].base - wii_hole_start;
 
-	pr_info("MEM1: <%08llx %08llx>\n", p[0].base, p[0].size);
+	pr_info("MEM1: <%08ulx %08ulx>\n", p[0].base, p[0].size);
 	pr_info("HOLE: <%08lx %08lx>\n", wii_hole_start, wii_hole_size);
-	pr_info("MEM2: <%08llx %08llx>\n", p[1].base, p[1].size);
+	pr_info("MEM2: <%08ulx %08ulx>\n", p[1].base, p[1].size);
 
 	p[0].size += wii_hole_size + p[1].size;
 
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH] powerpc: POWER7 optimised memcpy using VMX and enhanced prefetch
From: Jimi Xenidis @ 2012-12-07 23:20 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Kumar Gala, paulus, linuxppc-dev
In-Reply-To: <20120531162209.04bd9bdc@kryten>


On May 31, 2012, at 1:22 AM, Anton Blanchard <anton@samba.org> wrote:

>=20
> Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
> instructions.

<<snip>>

>=20
> Index: linux-build/arch/powerpc/lib/Makefile
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- linux-build.orig/arch/powerpc/lib/Makefile	2012-05-30 =
15:27:30.000000000 +1000
> +++ linux-build/arch/powerpc/lib/Makefile	2012-05-31 =
09:12:27.574372864 +1000
> @@ -17,7 +17,8 @@ obj-$(CONFIG_HAS_IOMEM)	+=3D devres.o
> obj-$(CONFIG_PPC64)	+=3D copypage_64.o copyuser_64.o \
> 			   memcpy_64.o usercopy_64.o mem_64.o string.o \
> 			   checksum_wrappers_64.o hweight_64.o \
> -			   copyuser_power7.o string_64.o =
copypage_power7.o
> +			   copyuser_power7.o string_64.o =
copypage_power7.o \
> +			   memcpy_power7.o

Hi,
I know this is a little late, but shouldn't these power7 specific =
thingies be in "obj-$(CONFIG_PPC_BOOK3S_64)".
The reason I ask is that my compiler pukes on "dcbtst" and as I deal =
with that I wanted to point this out.

-jx


> obj-$(CONFIG_XMON)	+=3D sstep.o ldstfp.o
> obj-$(CONFIG_KPROBES)	+=3D sstep.o ldstfp.o
> obj-$(CONFIG_HAVE_HW_BREAKPOINT)	+=3D sstep.o ldstfp.o
> Index: linux-build/arch/powerpc/lib/memcpy_64.S
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- linux-build.orig/arch/powerpc/lib/memcpy_64.S	2012-05-30 =
09:39:59.000000000 +1000
> +++ linux-build/arch/powerpc/lib/memcpy_64.S	2012-05-31 =
09:12:00.093876936 +1000
> @@ -11,7 +11,11 @@
>=20
> 	.align	7
> _GLOBAL(memcpy)
> +BEGIN_FTR_SECTION
> 	std	r3,48(r1)	/* save destination pointer for return =
value */
> +FTR_SECTION_ELSE
> +	b	memcpy_power7
> +ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
> 	PPC_MTOCRF(0x01,r5)
> 	cmpldi	cr1,r5,16
> 	neg	r6,r3		# LS 3 bits =3D # bytes to 8-byte dest =
bdry
> Index: linux-build/arch/powerpc/lib/memcpy_power7.S
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-build/arch/powerpc/lib/memcpy_power7.S	2012-05-31 =
15:28:03.495781127 +1000
> @@ -0,0 +1,650 @@
> +/*
> + * This program is free software; you can redistribute it and/or =
modify
> + * it under the terms of the GNU General Public License as published =
by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA =
02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2012
> + *
> + * Author: Anton Blanchard <anton@au.ibm.com>
> + */
> +#include <asm/ppc_asm.h>
> +
> +#define STACKFRAMESIZE	256
> +#define STK_REG(i)	(112 + ((i)-14)*8)
> +
> +_GLOBAL(memcpy_power7)
> +#ifdef CONFIG_ALTIVEC
> +	cmpldi	r5,16
> +	cmpldi	cr1,r5,4096
> +
> +	std	r3,48(r1)
> +
> +	blt	.Lshort_copy
> +	bgt	cr1,.Lvmx_copy
> +#else
> +	cmpldi	r5,16
> +
> +	std	r3,48(r1)
> +
> +	blt	.Lshort_copy
> +#endif
> +
> +.Lnonvmx_copy:
> +	/* Get the source 8B aligned */
> +	neg	r6,r4
> +	mtocrf	0x01,r6
> +	clrldi	r6,r6,(64-3)
> +
> +	bf	cr7*4+3,1f
> +	lbz	r0,0(r4)
> +	addi	r4,r4,1
> +	stb	r0,0(r3)
> +	addi	r3,r3,1
> +
> +1:	bf	cr7*4+2,2f
> +	lhz	r0,0(r4)
> +	addi	r4,r4,2
> +	sth	r0,0(r3)
> +	addi	r3,r3,2
> +
> +2:	bf	cr7*4+1,3f
> +	lwz	r0,0(r4)
> +	addi	r4,r4,4
> +	stw	r0,0(r3)
> +	addi	r3,r3,4
> +
> +3:	sub	r5,r5,r6
> +	cmpldi	r5,128
> +	blt	5f
> +
> +	mflr	r0
> +	stdu	r1,-STACKFRAMESIZE(r1)
> +	std	r14,STK_REG(r14)(r1)
> +	std	r15,STK_REG(r15)(r1)
> +	std	r16,STK_REG(r16)(r1)
> +	std	r17,STK_REG(r17)(r1)
> +	std	r18,STK_REG(r18)(r1)
> +	std	r19,STK_REG(r19)(r1)
> +	std	r20,STK_REG(r20)(r1)
> +	std	r21,STK_REG(r21)(r1)
> +	std	r22,STK_REG(r22)(r1)
> +	std	r0,STACKFRAMESIZE+16(r1)
> +
> +	srdi	r6,r5,7
> +	mtctr	r6
> +
> +	/* Now do cacheline (128B) sized loads and stores. */
> +	.align	5
> +4:
> +	ld	r0,0(r4)
> +	ld	r6,8(r4)
> +	ld	r7,16(r4)
> +	ld	r8,24(r4)
> +	ld	r9,32(r4)
> +	ld	r10,40(r4)
> +	ld	r11,48(r4)
> +	ld	r12,56(r4)
> +	ld	r14,64(r4)
> +	ld	r15,72(r4)
> +	ld	r16,80(r4)
> +	ld	r17,88(r4)
> +	ld	r18,96(r4)
> +	ld	r19,104(r4)
> +	ld	r20,112(r4)
> +	ld	r21,120(r4)
> +	addi	r4,r4,128
> +	std	r0,0(r3)
> +	std	r6,8(r3)
> +	std	r7,16(r3)
> +	std	r8,24(r3)
> +	std	r9,32(r3)
> +	std	r10,40(r3)
> +	std	r11,48(r3)
> +	std	r12,56(r3)
> +	std	r14,64(r3)
> +	std	r15,72(r3)
> +	std	r16,80(r3)
> +	std	r17,88(r3)
> +	std	r18,96(r3)
> +	std	r19,104(r3)
> +	std	r20,112(r3)
> +	std	r21,120(r3)
> +	addi	r3,r3,128
> +	bdnz	4b
> +
> +	clrldi	r5,r5,(64-7)
> +
> +	ld	r14,STK_REG(r14)(r1)
> +	ld	r15,STK_REG(r15)(r1)
> +	ld	r16,STK_REG(r16)(r1)
> +	ld	r17,STK_REG(r17)(r1)
> +	ld	r18,STK_REG(r18)(r1)
> +	ld	r19,STK_REG(r19)(r1)
> +	ld	r20,STK_REG(r20)(r1)
> +	ld	r21,STK_REG(r21)(r1)
> +	ld	r22,STK_REG(r22)(r1)
> +	addi	r1,r1,STACKFRAMESIZE
> +
> +	/* Up to 127B to go */
> +5:	srdi	r6,r5,4
> +	mtocrf	0x01,r6
> +
> +6:	bf	cr7*4+1,7f
> +	ld	r0,0(r4)
> +	ld	r6,8(r4)
> +	ld	r7,16(r4)
> +	ld	r8,24(r4)
> +	ld	r9,32(r4)
> +	ld	r10,40(r4)
> +	ld	r11,48(r4)
> +	ld	r12,56(r4)
> +	addi	r4,r4,64
> +	std	r0,0(r3)
> +	std	r6,8(r3)
> +	std	r7,16(r3)
> +	std	r8,24(r3)
> +	std	r9,32(r3)
> +	std	r10,40(r3)
> +	std	r11,48(r3)
> +	std	r12,56(r3)
> +	addi	r3,r3,64
> +
> +	/* Up to 63B to go */
> +7:	bf	cr7*4+2,8f
> +	ld	r0,0(r4)
> +	ld	r6,8(r4)
> +	ld	r7,16(r4)
> +	ld	r8,24(r4)
> +	addi	r4,r4,32
> +	std	r0,0(r3)
> +	std	r6,8(r3)
> +	std	r7,16(r3)
> +	std	r8,24(r3)
> +	addi	r3,r3,32
> +
> +	/* Up to 31B to go */
> +8:	bf	cr7*4+3,9f
> +	ld	r0,0(r4)
> +	ld	r6,8(r4)
> +	addi	r4,r4,16
> +	std	r0,0(r3)
> +	std	r6,8(r3)
> +	addi	r3,r3,16
> +
> +9:	clrldi	r5,r5,(64-4)
> +
> +	/* Up to 15B to go */
> +.Lshort_copy:
> +	mtocrf	0x01,r5
> +	bf	cr7*4+0,12f
> +	lwz	r0,0(r4)	/* Less chance of a reject with word ops =
*/
> +	lwz	r6,4(r4)
> +	addi	r4,r4,8
> +	stw	r0,0(r3)
> +	stw	r6,4(r3)
> +	addi	r3,r3,8
> +
> +12:	bf	cr7*4+1,13f
> +	lwz	r0,0(r4)
> +	addi	r4,r4,4
> +	stw	r0,0(r3)
> +	addi	r3,r3,4
> +
> +13:	bf	cr7*4+2,14f
> +	lhz	r0,0(r4)
> +	addi	r4,r4,2
> +	sth	r0,0(r3)
> +	addi	r3,r3,2
> +
> +14:	bf	cr7*4+3,15f
> +	lbz	r0,0(r4)
> +	stb	r0,0(r3)
> +
> +15:	ld	r3,48(r1)
> +	blr
> +
> +.Lunwind_stack_nonvmx_copy:
> +	addi	r1,r1,STACKFRAMESIZE
> +	b	.Lnonvmx_copy
> +
> +#ifdef CONFIG_ALTIVEC
> +.Lvmx_copy:
> +	mflr	r0
> +	std	r4,56(r1)
> +	std	r5,64(r1)
> +	std	r0,16(r1)
> +	stdu	r1,-STACKFRAMESIZE(r1)
> +	bl	.enter_vmx_copy
> +	cmpwi	r3,0
> +	ld	r0,STACKFRAMESIZE+16(r1)
> +	ld	r3,STACKFRAMESIZE+48(r1)
> +	ld	r4,STACKFRAMESIZE+56(r1)
> +	ld	r5,STACKFRAMESIZE+64(r1)
> +	mtlr	r0
> +
> +	/*
> +	 * We prefetch both the source and destination using enhanced =
touch
> +	 * instructions. We use a stream ID of 0 for the load side and
> +	 * 1 for the store side.
> +	 */
> +	clrrdi	r6,r4,7
> +	clrrdi	r9,r3,7
> +	ori	r9,r9,1		/* stream=3D1 */
> +
> +	srdi	r7,r5,7		/* length in cachelines, capped at 0x3FF =
*/
> +	cmpldi	cr1,r7,0x3FF
> +	ble	cr1,1f
> +	li	r7,0x3FF
> +1:	lis	r0,0x0E00	/* depth=3D7 */
> +	sldi	r7,r7,7
> +	or	r7,r7,r0
> +	ori	r10,r7,1	/* stream=3D1 */
> +
> +	lis	r8,0x8000	/* GO=3D1 */
> +	clrldi	r8,r8,32
> +
> +.machine push
> +.machine "power4"
> +	dcbt	r0,r6,0b01000
> +	dcbt	r0,r7,0b01010
> +	dcbtst	r0,r9,0b01000
> +	dcbtst	r0,r10,0b01010
> +	eieio
> +	dcbt	r0,r8,0b01010	/* GO */
> +.machine pop
> +
> +	beq	.Lunwind_stack_nonvmx_copy
> +
> +	/*
> +	 * If source and destination are not relatively aligned we use a
> +	 * slower permute loop.
> +	 */
> +	xor	r6,r4,r3
> +	rldicl.	r6,r6,0,(64-4)
> +	bne	.Lvmx_unaligned_copy
> +
> +	/* Get the destination 16B aligned */
> +	neg	r6,r3
> +	mtocrf	0x01,r6
> +	clrldi	r6,r6,(64-4)
> +
> +	bf	cr7*4+3,1f
> +	lbz	r0,0(r4)
> +	addi	r4,r4,1
> +	stb	r0,0(r3)
> +	addi	r3,r3,1
> +
> +1:	bf	cr7*4+2,2f
> +	lhz	r0,0(r4)
> +	addi	r4,r4,2
> +	sth	r0,0(r3)
> +	addi	r3,r3,2
> +
> +2:	bf	cr7*4+1,3f
> +	lwz	r0,0(r4)
> +	addi	r4,r4,4
> +	stw	r0,0(r3)
> +	addi	r3,r3,4
> +
> +3:	bf	cr7*4+0,4f
> +	ld	r0,0(r4)
> +	addi	r4,r4,8
> +	std	r0,0(r3)
> +	addi	r3,r3,8
> +
> +4:	sub	r5,r5,r6
> +
> +	/* Get the desination 128B aligned */
> +	neg	r6,r3
> +	srdi	r7,r6,4
> +	mtocrf	0x01,r7
> +	clrldi	r6,r6,(64-7)
> +
> +	li	r9,16
> +	li	r10,32
> +	li	r11,48
> +
> +	bf	cr7*4+3,5f
> +	lvx	vr1,r0,r4
> +	addi	r4,r4,16
> +	stvx	vr1,r0,r3
> +	addi	r3,r3,16
> +
> +5:	bf	cr7*4+2,6f
> +	lvx	vr1,r0,r4
> +	lvx	vr0,r4,r9
> +	addi	r4,r4,32
> +	stvx	vr1,r0,r3
> +	stvx	vr0,r3,r9
> +	addi	r3,r3,32
> +
> +6:	bf	cr7*4+1,7f
> +	lvx	vr3,r0,r4
> +	lvx	vr2,r4,r9
> +	lvx	vr1,r4,r10
> +	lvx	vr0,r4,r11
> +	addi	r4,r4,64
> +	stvx	vr3,r0,r3
> +	stvx	vr2,r3,r9
> +	stvx	vr1,r3,r10
> +	stvx	vr0,r3,r11
> +	addi	r3,r3,64
> +
> +7:	sub	r5,r5,r6
> +	srdi	r6,r5,7
> +
> +	std	r14,STK_REG(r14)(r1)
> +	std	r15,STK_REG(r15)(r1)
> +	std	r16,STK_REG(r16)(r1)
> +
> +	li	r12,64
> +	li	r14,80
> +	li	r15,96
> +	li	r16,112
> +
> +	mtctr	r6
> +
> +	/*
> +	 * Now do cacheline sized loads and stores. By this stage the
> +	 * cacheline stores are also cacheline aligned.
> +	 */
> +	.align	5
> +8:
> +	lvx	vr7,r0,r4
> +	lvx	vr6,r4,r9
> +	lvx	vr5,r4,r10
> +	lvx	vr4,r4,r11
> +	lvx	vr3,r4,r12
> +	lvx	vr2,r4,r14
> +	lvx	vr1,r4,r15
> +	lvx	vr0,r4,r16
> +	addi	r4,r4,128
> +	stvx	vr7,r0,r3
> +	stvx	vr6,r3,r9
> +	stvx	vr5,r3,r10
> +	stvx	vr4,r3,r11
> +	stvx	vr3,r3,r12
> +	stvx	vr2,r3,r14
> +	stvx	vr1,r3,r15
> +	stvx	vr0,r3,r16
> +	addi	r3,r3,128
> +	bdnz	8b
> +
> +	ld	r14,STK_REG(r14)(r1)
> +	ld	r15,STK_REG(r15)(r1)
> +	ld	r16,STK_REG(r16)(r1)
> +
> +	/* Up to 127B to go */
> +	clrldi	r5,r5,(64-7)
> +	srdi	r6,r5,4
> +	mtocrf	0x01,r6
> +
> +	bf	cr7*4+1,9f
> +	lvx	vr3,r0,r4
> +	lvx	vr2,r4,r9
> +	lvx	vr1,r4,r10
> +	lvx	vr0,r4,r11
> +	addi	r4,r4,64
> +	stvx	vr3,r0,r3
> +	stvx	vr2,r3,r9
> +	stvx	vr1,r3,r10
> +	stvx	vr0,r3,r11
> +	addi	r3,r3,64
> +
> +9:	bf	cr7*4+2,10f
> +	lvx	vr1,r0,r4
> +	lvx	vr0,r4,r9
> +	addi	r4,r4,32
> +	stvx	vr1,r0,r3
> +	stvx	vr0,r3,r9
> +	addi	r3,r3,32
> +
> +10:	bf	cr7*4+3,11f
> +	lvx	vr1,r0,r4
> +	addi	r4,r4,16
> +	stvx	vr1,r0,r3
> +	addi	r3,r3,16
> +
> +	/* Up to 15B to go */
> +11:	clrldi	r5,r5,(64-4)
> +	mtocrf	0x01,r5
> +	bf	cr7*4+0,12f
> +	ld	r0,0(r4)
> +	addi	r4,r4,8
> +	std	r0,0(r3)
> +	addi	r3,r3,8
> +
> +12:	bf	cr7*4+1,13f
> +	lwz	r0,0(r4)
> +	addi	r4,r4,4
> +	stw	r0,0(r3)
> +	addi	r3,r3,4
> +
> +13:	bf	cr7*4+2,14f
> +	lhz	r0,0(r4)
> +	addi	r4,r4,2
> +	sth	r0,0(r3)
> +	addi	r3,r3,2
> +
> +14:	bf	cr7*4+3,15f
> +	lbz	r0,0(r4)
> +	stb	r0,0(r3)
> +
> +15:	addi	r1,r1,STACKFRAMESIZE
> +	ld	r3,48(r1)
> +	b	.exit_vmx_copy		/* tail call optimise */
> +
> +.Lvmx_unaligned_copy:
> +	/* Get the destination 16B aligned */
> +	neg	r6,r3
> +	mtocrf	0x01,r6
> +	clrldi	r6,r6,(64-4)
> +
> +	bf	cr7*4+3,1f
> +	lbz	r0,0(r4)
> +	addi	r4,r4,1
> +	stb	r0,0(r3)
> +	addi	r3,r3,1
> +
> +1:	bf	cr7*4+2,2f
> +	lhz	r0,0(r4)
> +	addi	r4,r4,2
> +	sth	r0,0(r3)
> +	addi	r3,r3,2
> +
> +2:	bf	cr7*4+1,3f
> +	lwz	r0,0(r4)
> +	addi	r4,r4,4
> +	stw	r0,0(r3)
> +	addi	r3,r3,4
> +
> +3:	bf	cr7*4+0,4f
> +	lwz	r0,0(r4)	/* Less chance of a reject with word ops =
*/
> +	lwz	r7,4(r4)
> +	addi	r4,r4,8
> +	stw	r0,0(r3)
> +	stw	r7,4(r3)
> +	addi	r3,r3,8
> +
> +4:	sub	r5,r5,r6
> +
> +	/* Get the desination 128B aligned */
> +	neg	r6,r3
> +	srdi	r7,r6,4
> +	mtocrf	0x01,r7
> +	clrldi	r6,r6,(64-7)
> +
> +	li	r9,16
> +	li	r10,32
> +	li	r11,48
> +
> +	lvsl	vr16,0,r4	/* Setup permute control vector */
> +	lvx	vr0,0,r4
> +	addi	r4,r4,16
> +
> +	bf	cr7*4+3,5f
> +	lvx	vr1,r0,r4
> +	vperm	vr8,vr0,vr1,vr16
> +	addi	r4,r4,16
> +	stvx	vr8,r0,r3
> +	addi	r3,r3,16
> +	vor	vr0,vr1,vr1
> +
> +5:	bf	cr7*4+2,6f
> +	lvx	vr1,r0,r4
> +	vperm	vr8,vr0,vr1,vr16
> +	lvx	vr0,r4,r9
> +	vperm	vr9,vr1,vr0,vr16
> +	addi	r4,r4,32
> +	stvx	vr8,r0,r3
> +	stvx	vr9,r3,r9
> +	addi	r3,r3,32
> +
> +6:	bf	cr7*4+1,7f
> +	lvx	vr3,r0,r4
> +	vperm	vr8,vr0,vr3,vr16
> +	lvx	vr2,r4,r9
> +	vperm	vr9,vr3,vr2,vr16
> +	lvx	vr1,r4,r10
> +	vperm	vr10,vr2,vr1,vr16
> +	lvx	vr0,r4,r11
> +	vperm	vr11,vr1,vr0,vr16
> +	addi	r4,r4,64
> +	stvx	vr8,r0,r3
> +	stvx	vr9,r3,r9
> +	stvx	vr10,r3,r10
> +	stvx	vr11,r3,r11
> +	addi	r3,r3,64
> +
> +7:	sub	r5,r5,r6
> +	srdi	r6,r5,7
> +
> +	std	r14,STK_REG(r14)(r1)
> +	std	r15,STK_REG(r15)(r1)
> +	std	r16,STK_REG(r16)(r1)
> +
> +	li	r12,64
> +	li	r14,80
> +	li	r15,96
> +	li	r16,112
> +
> +	mtctr	r6
> +
> +	/*
> +	 * Now do cacheline sized loads and stores. By this stage the
> +	 * cacheline stores are also cacheline aligned.
> +	 */
> +	.align	5
> +8:
> +	lvx	vr7,r0,r4
> +	vperm	vr8,vr0,vr7,vr16
> +	lvx	vr6,r4,r9
> +	vperm	vr9,vr7,vr6,vr16
> +	lvx	vr5,r4,r10
> +	vperm	vr10,vr6,vr5,vr16
> +	lvx	vr4,r4,r11
> +	vperm	vr11,vr5,vr4,vr16
> +	lvx	vr3,r4,r12
> +	vperm	vr12,vr4,vr3,vr16
> +	lvx	vr2,r4,r14
> +	vperm	vr13,vr3,vr2,vr16
> +	lvx	vr1,r4,r15
> +	vperm	vr14,vr2,vr1,vr16
> +	lvx	vr0,r4,r16
> +	vperm	vr15,vr1,vr0,vr16
> +	addi	r4,r4,128
> +	stvx	vr8,r0,r3
> +	stvx	vr9,r3,r9
> +	stvx	vr10,r3,r10
> +	stvx	vr11,r3,r11
> +	stvx	vr12,r3,r12
> +	stvx	vr13,r3,r14
> +	stvx	vr14,r3,r15
> +	stvx	vr15,r3,r16
> +	addi	r3,r3,128
> +	bdnz	8b
> +
> +	ld	r14,STK_REG(r14)(r1)
> +	ld	r15,STK_REG(r15)(r1)
> +	ld	r16,STK_REG(r16)(r1)
> +
> +	/* Up to 127B to go */
> +	clrldi	r5,r5,(64-7)
> +	srdi	r6,r5,4
> +	mtocrf	0x01,r6
> +
> +	bf	cr7*4+1,9f
> +	lvx	vr3,r0,r4
> +	vperm	vr8,vr0,vr3,vr16
> +	lvx	vr2,r4,r9
> +	vperm	vr9,vr3,vr2,vr16
> +	lvx	vr1,r4,r10
> +	vperm	vr10,vr2,vr1,vr16
> +	lvx	vr0,r4,r11
> +	vperm	vr11,vr1,vr0,vr16
> +	addi	r4,r4,64
> +	stvx	vr8,r0,r3
> +	stvx	vr9,r3,r9
> +	stvx	vr10,r3,r10
> +	stvx	vr11,r3,r11
> +	addi	r3,r3,64
> +
> +9:	bf	cr7*4+2,10f
> +	lvx	vr1,r0,r4
> +	vperm	vr8,vr0,vr1,vr16
> +	lvx	vr0,r4,r9
> +	vperm	vr9,vr1,vr0,vr16
> +	addi	r4,r4,32
> +	stvx	vr8,r0,r3
> +	stvx	vr9,r3,r9
> +	addi	r3,r3,32
> +
> +10:	bf	cr7*4+3,11f
> +	lvx	vr1,r0,r4
> +	vperm	vr8,vr0,vr1,vr16
> +	addi	r4,r4,16
> +	stvx	vr8,r0,r3
> +	addi	r3,r3,16
> +
> +	/* Up to 15B to go */
> +11:	clrldi	r5,r5,(64-4)
> +	addi	r4,r4,-16	/* Unwind the +16 load offset */
> +	mtocrf	0x01,r5
> +	bf	cr7*4+0,12f
> +	lwz	r0,0(r4)	/* Less chance of a reject with word ops =
*/
> +	lwz	r6,4(r4)
> +	addi	r4,r4,8
> +	stw	r0,0(r3)
> +	stw	r6,4(r3)
> +	addi	r3,r3,8
> +
> +12:	bf	cr7*4+1,13f
> +	lwz	r0,0(r4)
> +	addi	r4,r4,4
> +	stw	r0,0(r3)
> +	addi	r3,r3,4
> +
> +13:	bf	cr7*4+2,14f
> +	lhz	r0,0(r4)
> +	addi	r4,r4,2
> +	sth	r0,0(r3)
> +	addi	r3,r3,2
> +
> +14:	bf	cr7*4+3,15f
> +	lbz	r0,0(r4)
> +	stb	r0,0(r3)
> +
> +15:	addi	r1,r1,STACKFRAMESIZE
> +	ld	r3,48(r1)
> +	b	.exit_vmx_copy		/* tail call optimise */
> +#endif /* CONFiG_ALTIVEC */
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox