LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] arch: powerpc: kvm: add signed type cast for comparation
From: Alexander Graf @ 2013-08-28 14:24 UTC (permalink / raw)
  To: Chen Gang
  Cc: Gleb Natapov, kvm, kvm-ppc, Paul Mackerras, pbonzini,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <51FF3D0B.6050609@asianux.com>


On 05.08.2013, at 07:50, Chen Gang wrote:

> On 08/05/2013 12:34 PM, Paul Mackerras wrote:
>> On Mon, Jul 22, 2013 at 02:32:35PM +0800, Chen Gang wrote:
>>>> 'rmls' is 'unsigned long', lpcr_rmls() will return negative number when
>>>> failure occurs, so it need a type cast for comparing.
>>>> 
>>>> 'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number
>>>> when failure occurs, so it need a type cast for comparing.
>>>> 
>>>> 
>>>> Signed-off-by: Chen Gang <gang.chen@asianux.com>
>> Looks right, thanks.
>> 
>> Acked-by: Paul Mackerras <paulus@samba.org>
>> 
>> 
> 
> Thank you very much.

Thanks, applied to kvm-ppc-queue.


Alex

^ permalink raw reply

* [PATCH] powerpc/mpc8xx: Clearer Oops message for Software Emulation Exception
From: Christophe Leroy @ 2013-08-28 14:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras; +Cc: linuxppc-dev, linux-kernel

This patch modifies the Oops message in case of Software Emulation Exception.
The existing message is quite confusing because it refers to FPU Emulation
while most often the issue is due to either a non supported instruction
(not necessarily FPU related) or a stale instruction due to HW issues.
The new message tries to be more generic in order to make the user understand
that the Oops is due to something wrong with an instruction, not necessarily
due to an FPU instruction.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>

diff -ur linux-3.11-rc6/arch/powerpc/kernel/traps.c linux/arch/powerpc/kernel/traps.c
--- linux-3.11-rc6/arch/powerpc/kernel/traps.c	2013-08-25 15:20:33.000000000 +0200
+++ linux/arch/powerpc/kernel/traps.c	2013-08-25 15:31:29.000000000 +0200
@@ -1476,7 +1476,8 @@

 	if (!user_mode(regs)) {
 		debugger(regs);
-		die("Kernel Mode Software FPU Emulation", regs, SIGFPE);
+		die("Kernel Mode Unimplemented Instruction or SW FPU Emulation",
+			regs, SIGFPE);
 	}

 #ifdef CONFIG_MATH_EMULATION

^ permalink raw reply

* Re: [PATCH v4 00/31] add COMMON_CLK support for PowerPC MPC512x
From: Gerhard Sittig @ 2013-08-28 13:50 UTC (permalink / raw)
  To: linuxppc-dev, Anatolij Gustschin, Mike Turquette,
	linux-arm-kernel, devicetree
  Cc: Detlev Zundel, Wolfram Sang, Greg Kroah-Hartman, Rob Herring,
	Mark Brown, Marc Kleine-Budde, David Woodhouse,
	Wolfgang Grandegger, Mauro Carvalho Chehab
In-Reply-To: <1375821851-31609-1-git-send-email-gsi@denx.de>

[ summary for the busy or the impatient:
  this is a status update on the series
  - peripheral driver cleanup considered appropriate for v3.12
  - common clock support introduction isn't ready yet
  - which in turn holds subsequent parts
  - while the overall shape of the series is looking good ]

On Tue, Aug 06, 2013 at 22:43 +0200, Gerhard Sittig wrote:
> 
> this series
> - fixes several drivers that are used in the MPC512x platform (UART,
>   SPI, ethernet, PCI, USB, CAN, NAND flash, video capture) in how they
>   handle clocks (appropriately acquire and setup them, hold references
>   during use, release clocks after use)
> - introduces support for the common clock framework (CCF, COMMON_CLK
>   Kconfig option) in the PowerPC based MPC512x platform, which brings
>   device tree based clock lookup as well
> 
> although the series does touch several subsystems -- tty (serial), spi,
> net (can, fs_enet), mtd (nfc), usb, i2c, media (viu), and dts -- all of
> the patches are strictly clock related or trivial
> 
> it appears most appropriate to take this series through either the clk
> or the powerpc trees after it has passed review and other subsystem
> maintainers ACKed the clock setup related driver modifications

Since the status of this series was questioned recently, I felt
that I should officially and publicly provide a status update in
the absence of a v5 submission update.

The series has undergone some review and has received changes as
concerns were raised and feedback was provided.  While I consider
the nature and frequency of the changes totally appropriate --
each revision addressed all of the issues raised, and did so in
an appropriate manner, but could not forsee what else would be
raised upon re-submission.  Actually not sending another version
before _all_ concerns are addressed appropriately is what held
back submission of v5.  See the phase overview below for details.


Adding the cleanup of existing code before the introduction of
new features did widen the scope of the series, yet has heavily
improved the series, and the feedback was gratefully accepted and
thoroughly got addressed.

Actually this driver cleanup, which only was introduced after
initial submission upon Mark's request, could be considered the
most desirable part of the series at this very point in time.
And as I write this, the patches of the "peripheral driver
cleanup" phase are being picked up for v3.12 after they have
become stable in the review iterations.


Further extension of test coverage for the series after
submission of v4 has led to minimal fixes in CAN, USB, and PCI,
and has revealed one problem in multi platform configurations
which currently is the only remaining blocker for phase 2 and
subsequent steps.  While phase 1 with its obvious cleanup is
stable and has become desirable and acceptable and currently is
being picked up.


The current status of the v4 series in detail is:

Phase 1, patches 01-14/31, peripheral driver cleanup and DTS
improvement:  has addressed all concerns raised, and can be
applied via any subtree in any order since the parts are
independent from each other, with a few minor additions

- USB 03/31 received another adjustment of the clock lookup 'dev'
  parameter, the applied version works in all three cases of the
  PPC_CLOCK implementation where clock names are global, the CCF
  implementation with clkdev registration (during migration), and
  the CCF implementation with device tree based clock lookup (the
  end result of the series); the v4 patch wasn't broken but just
  in need of an addendum before/within phase 3, which now was
  folded into phase 1

- PCI 09/31 had a compile error on 85xx/86xx due to a
  copy'n'paste bug in an error path; since the (fixed) patch
  still remains a NOP for now and within the whole series, I have
  suggested to leave this patch for v3.12, and to address the
  remaining issue of the PCI driver patch being incomplete later,
  see the followup for 09/31 for details (what gets added in a
  future version is another comment in the PCI driver and a
  workaround in the clock provider backend, because in the given
  implementation the peripheral driver cannot appropriately
  acquire its clock item on some platforms)

- CAN 11/31 could save one more instruction by adding another
  jump label in the error path instead of explicit undo of a
  setup step, Marc's suggestion was implemented and has been
  applied

So all parts of phase 1 (with the exception of the PCI driver
change which is and remains a NOP) were applied, and followup
patches for fixup were avoided.  Nothing was broken, no breakage
was introduced, it's all about improvements.

Phase 2, patches 15-18/31, introduction of CCF support for
MPC512x:  works correctly for MPC512x and doesn't break other
platforms, but won't work in multi platform configurations with
MPC52xx (PPC_CLOCK and COMMON_CLK will collide in the linker),
shall not be considered for v3.12, multi platform needs to get
sorted out before consideration for v3.13 (and is the only known
issue of the series feature- or policy-wise)

Phase 3, patches 20,21,23-28/31, adoption of peripheral drivers
to the CCF world:  is complete feature-wise and recently has
received even more test coverage than before, remaining fixes got
folded into phase 1, patches of phase 3 depend on CCF support
which gets introduced in phase 2, and the "workaround removal"
aspects of phase 3 will explicitly be moved to phase 4 while the
content remains unaffected (mere split and re-order)

Phase 4, patches 19,22,29-31/31, removal of migration support
after complete adoption:  is complete feature-wise, but partial
removal of workarounds and compatibility from phase 3 shall move
explicitly to phase 4, to more strictly tell those phases apart
and for collision free application via individual subtrees if
application through a single tree cannot be done, so a mere
re-ordering remains to get communicated while nothing changes in
the content (re-ordering the sequence as well as verifying that
the patches in phase 3 are independent from each other has
already been done internally)


To summarize:
- The series is in a good shape, one multi platform issue needs
  to get addressed, everything else either is already there or
  just needs to get communicated.
- Phase 1 with the obvious cleanup is being considered for v3.12,
  and patches have been queued in their respective subtrees.
- Phase 2 will become acceptable when the multi platform
  configuration has been sorted out.  Each platform works in
  itself, just not the combination of 52xx and 512x, and actually
  MPC52xx could be considered the out-lier here (is the only
  remaining user of PPC_CLOCK, and does so with a dummy
  implementation in the absence of a real provider).
- Phases 3 and 4 are "complete" but depend on phase 2.  What
  remains is a re-sort of the CCF adjustment and the migration
  support removal aspects.
  
Thanks to those involved in the feedback and application so far!
In my eyes, changes have been few, and necessary, and always an
improvement.  Regardless of which potential for further
improvement remains, which just happens to be way outside of the
scope of the series (power consumption aspects that neither have
been addressed nor prepared before, or CCF support for other
PowerPC based platforms maybe).


virtually yours
Gerhard Sittig
-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr. 5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-0 Fax: +49-8142-66989-80  Email: office@denx.de

^ permalink raw reply

* Re: [PATCH v8 2/3] DMA: Freescale: Add new 8-channel DMA engine device tree nodes
From: Mark Rutland @ 2013-08-28 12:51 UTC (permalink / raw)
  To: Hongbo Zhang
  Cc: devicetree@vger.kernel.org, ian.campbell@citrix.com, Pawel Moll,
	swarren@wwwdotorg.org, vinod.koul@intel.com,
	linux-kernel@vger.kernel.org, rob.herring@calxeda.com,
	djbw@fb.com, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <521D9E89.7040700@freescale.com>

On Wed, Aug 28, 2013 at 07:54:01AM +0100, Hongbo Zhang wrote:
> On 08/27/2013 07:35 PM, Mark Rutland wrote:
> > On Tue, Aug 27, 2013 at 11:42:02AM +0100, hongbo.zhang@freescale.com wrote:
> >> From: Hongbo Zhang <hongbo.zhang@freescale.com>
> >>
> >> Freescale QorIQ T4 and B4 introduce new 8-channel DMA engines, this patch adds
> >> the device tree nodes for them.
> >>
> >> Signed-off-by: Hongbo Zhang <hongbo.zhang@freescale.com>
> >> ---
> >>   .../devicetree/bindings/powerpc/fsl/dma.txt        |   66 ++++++++++++++++
> >>   arch/powerpc/boot/dts/fsl/b4si-post.dtsi           |    4 +-
> >>   arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi          |   81 ++++++++++++++++++++
> >>   arch/powerpc/boot/dts/fsl/elo3-dma-1.dtsi          |   81 ++++++++++++++++++++
> >>   arch/powerpc/boot/dts/fsl/t4240si-post.dtsi        |    4 +-
> >>   5 files changed, 232 insertions(+), 4 deletions(-)
> >>   create mode 100644 arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi
> >>   create mode 100644 arch/powerpc/boot/dts/fsl/elo3-dma-1.dtsi
> >>
> >> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
> >> index ddf17af..10fd031 100644
> >> --- a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
> >> +++ b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
> >> @@ -126,6 +126,72 @@ Example:
> >>                  };
> >>          };
> >>
> >> +** Freescale Elo3 DMA Controller
> >> +   This is EloPlus controller with 8 channels, used in Freescale Txxx and Bxxx
> >> +   series chips, such as t1040, t4240, b4860.
> >> +
> >> +Required properties:
> >> +
> >> +- compatible        : must include "fsl,elo3-dma"
> >> +- reg               : <registers specifier for DMA general status reg>
> >> +- ranges            : describes the mapping between the address space of the
> >> +                      DMA channels and the address space of the DMA controller
> >> +
> >> +- DMA channel nodes:
> >> +        - compatible        : must include "fsl,eloplus-dma-channel"
> >> +        - reg               : <registers specifier for channel>
> >> +        - interrupts        : <interrupt specifier for DMA channel IRQ>
> >> +        - interrupt-parent  : optional, if needed for interrupt mapping
> >> +
> >> +Example:
> >> +dma@100300 {
> >> +       #address-cells = <1>;
> >> +       #size-cells = <1>;
> >> +       compatible = "fsl,elo3-dma";
> >> +       reg = <0x100300 0x4 0x100600 0x4>;
> > Is that one reg entry where #size-cells=2 and #address-cells=2?
> >
> > That's what the binding implies (given it only describes a single reg
> > entry).
> >
> > if it's two entries, we should make that explicit (both in the binding
> > and example):
> >
> > 	reg = <0x100300 0x4>,
> > 	      <0x100600 0x4>;
> Yes they are two entries, I will change it this way.

Ok. Could you make sure you document what the two reg entries correspond
to? That's not clear from "<registers specifier for channel>".

> >> +       ranges = <0x0 0x100100 0x500>;
> > If it is one reg entry then the example ranges property isn't big enough
> > to contain the parent-bus-address.
> They are two reg entries, so the range is big enough.

Ok.

> >
> >> +       dma-channel@0 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x0 0x80>;
> >> +               interrupts = <28 2 0 0>;
> >> +       };
> >> +       dma-channel@80 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x80 0x80>;
> >> +               interrupts = <29 2 0 0>;
> >> +       };
> >> +       dma-channel@100 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x100 0x80>;
> >> +               interrupts = <30 2 0 0>;
> >> +       };
> >> +       dma-channel@180 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x180 0x80>;
> >> +               interrupts = <31 2 0 0>;
> >> +       };
> >> +       dma-channel@300 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x300 0x80>;
> >> +               interrupts = <76 2 0 0>;
> >> +       };
> >> +       dma-channel@380 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x380 0x80>;
> >> +               interrupts = <77 2 0 0>;
> >> +       };
> >> +       dma-channel@400 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x400 0x80>;
> >> +               interrupts = <78 2 0 0>;
> >> +       };
> >> +       dma-channel@480 {
> >> +               compatible = "fsl,eloplus-dma-channel";
> >> +               reg = <0x480 0x80>;
> >> +               interrupts = <79 2 0 0>;
> >> +       };
> >> +};
> >> +
> >>   Note on DMA channel compatible properties: The compatible property must say
> >>   "fsl,elo-dma-channel" or "fsl,eloplus-dma-channel" to be used by the Elo DMA
> >>   driver (fsldma).  Any DMA channel used by fsldma cannot be used by another
> >> diff --git a/arch/powerpc/boot/dts/fsl/b4si-post.dtsi b/arch/powerpc/boot/dts/fsl/b4si-post.dtsi
> >> index 7399154..ea53ea1 100644
> >> --- a/arch/powerpc/boot/dts/fsl/b4si-post.dtsi
> >> +++ b/arch/powerpc/boot/dts/fsl/b4si-post.dtsi
> >> @@ -223,13 +223,13 @@
> >>                  reg = <0xe2000 0x1000>;
> >>          };
> >>
> >> -/include/ "qoriq-dma-0.dtsi"
> >> +/include/ "elo3-dma-0.dtsi"
> >>          dma@100300 {
> >>                  fsl,iommu-parent = <&pamu0>;
> >>                  fsl,liodn-reg = <&guts 0x580>; /* DMA1LIODNR */
> >>          };
> >>
> >> -/include/ "qoriq-dma-1.dtsi"
> >> +/include/ "elo3-dma-1.dtsi"
> >>          dma@101300 {
> >>                  fsl,iommu-parent = <&pamu0>;
> >>                  fsl,liodn-reg = <&guts 0x584>; /* DMA2LIODNR */
> >> diff --git a/arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi b/arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi
> >> new file mode 100644
> >> index 0000000..69a3277
> >> --- /dev/null
> >> +++ b/arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi
> >> @@ -0,0 +1,81 @@
> >> +/*
> >> + * QorIQ DMA device tree stub [ controller @ offset 0x100000 ]
> > Copy-pasted?
> >
> > Presumably should be "Elo3 DMA devicetree stub", or similar?
> >
> > Similarly for elo3-dma-1.dtsi.
> Yes copy-pasted, but QorIQ isn't wrong, it is name of Freescale series 
> chips.
> To be more specific, I'd like to use "QorIQ Elo3 DMA devicetree stub"

That sounds good to me.

Cheers,
Mark.

^ permalink raw reply

* Re: [PATCH v8 1/3] DMA: Freescale: revise device tree binding document
From: Mark Rutland @ 2013-08-28 12:48 UTC (permalink / raw)
  To: Hongbo Zhang
  Cc: devicetree@vger.kernel.org, ian.campbell@citrix.com, Pawel Moll,
	swarren@wwwdotorg.org, vinod.koul@intel.com,
	linux-kernel@vger.kernel.org, rob.herring@calxeda.com,
	djbw@fb.com, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <521DB26F.8010501@freescale.com>

On Wed, Aug 28, 2013 at 09:18:55AM +0100, Hongbo Zhang wrote:
> On 08/27/2013 07:25 PM, Mark Rutland wrote:
> > On Tue, Aug 27, 2013 at 11:42:01AM +0100, hongbo.zhang@freescale.com wrote:
> >> From: Hongbo Zhang <hongbo.zhang@freescale.com>
> >>
> >> This patch updates the discription of each type of DMA controller and its
> >> channels, it is preparation for adding another new DMA controller binding, it
> >> also fixes some defects of indent for text alignment at the same time.
> >>
> >> Signed-off-by: Hongbo Zhang <hongbo.zhang@freescale.com>
> >> ---
> >>   .../devicetree/bindings/powerpc/fsl/dma.txt        |   62 +++++++++-----------
> >>   1 file changed, 27 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
> >> index 2a4b4bc..ddf17af 100644
> >> --- a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
> >> +++ b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
> >> @@ -1,33 +1,29 @@
> >> -* Freescale 83xx DMA Controller
> >> +* Freescale DMA Controllers
> >>   
> >> -Freescale PowerPC 83xx have on chip general purpose DMA controllers.
> >> +** Freescale Elo DMA Controller
> >> +   This is a little-endian DMA controller, used in Freescale mpc83xx series
> >> +   chips such as mpc8315, mpc8349, mpc8379 etc.
> >>   
> >>   Required properties:
> >>   
> >> -- compatible        : compatible list, contains 2 entries, first is
> >> -		 "fsl,CHIP-dma", where CHIP is the processor
> >> -		 (mpc8349, mpc8360, etc.) and the second is
> >> -		 "fsl,elo-dma"
> >> -- reg               : <registers mapping for DMA general status reg>
> >> -- ranges		: Should be defined as specified in 1) to describe the
> >> -		  DMA controller channels.
> >> +- compatible        : must include "fsl,elo-dma"
> > We should list the other values that may be in the list also, unless
> > they are really of no consequence, in which case their presence in dt is
> > questionable.
> Hmm.  Stephen questioned here too, it seems this is a default rule.
> Although Scott@freescale had explained our thoughts, I'd like to edit 
> this item like this:
> 
> "must include "fsl,eloplus-dma", and a "fsl,CHIP-dma" is optional, where 
> CHIP is the processor name"
> 
> We don't list all the chip name because we have tens of them and we 
> cannot list all of them, and it is unnecessary to list them because we 
> even don't use "fsl,CHIP-dma" in the new driver, add "fsl,CHIP-dma" here 
> just make it questionable when it presents in example and  old dts files.
> 
> I remove the examples in bracket "(mpc8349, mpc8360, etc.)" because we 
> can see the real example below.
> I don't say" if "fsl,CHIP-dma" presents, it should be the first one, and 
> the "fsl,eloplus-dma" should be the second" because it is common rule.
> the description language should be clear and concise too I think.

Actually, you've convinced me for the form as you originally converted
it (must include "fsl,elo-dma"), given that the other strings aren't
used to give information anywhere and "fsl,CHIP-dma" doesn't fully
define a valid string.

> >> +- reg               : <registers specifier for DMA general status reg>
> >> +- ranges            : describes the mapping between the address space of the
> >> +                      DMA channels and the address space of the DMA controller
> >>   - cell-index        : controller index.  0 for controller @ 0x8100
> >> -- interrupts        : <interrupt mapping for DMA IRQ>
> >> +- interrupts        : <interrupt specifier for DMA IRQ>
> >>   - interrupt-parent  : optional, if needed for interrupt mapping
> >>   
> >> -
> >>   - DMA channel nodes:
> >> -        - compatible        : compatible list, contains 2 entries, first is
> >> -			 "fsl,CHIP-dma-channel", where CHIP is the processor
> >> -			 (mpc8349, mpc8350, etc.) and the second is
> >> -			 "fsl,elo-dma-channel". However, see note below.
> >> -        - reg               : <registers mapping for channel>
> >> +        - compatible        : must include "fsl,elo-dma-channel"
> >> +                              However, see note below.
> > Again, I think we should list the other entries that may be in the list.
> > Otherwise it's not clear what the binding defines. Similarly for the
> > other compatible list definitions below...
> >
> >> +        - reg               : <registers specifier for channel>
> >>           - cell-index        : dma channel index starts at 0.
> > I realise you haven't changed it, but it's unclear what the cell-index
> > property is (and somewhat confusingly there seem to be multiple
> > defnitions). It might be worth clarifying it while performing the other
> > cleanup.
> not clear with your point "multiple definitions", we really have 
> multiple dma channels for one dma controller.
> cell-index is used as channel index, this is an old method used by old 
> driver, my patch didn't touch this part.

Sorry, I'd misunderstood the cell-index property. More noise from me.

Given that, this looks fine to me.

Acked-by: Mark Rutland <mark.rutland@arm.com>

> >>   
> >>   Optional properties:
> >> -        - interrupts        : <interrupt mapping for DMA channel IRQ>
> >> -			  (on 83xx this is expected to be identical to
> >> -			   the interrupts property of the parent node)
> >> +        - interrupts        : <interrupt specifier for DMA channel IRQ>
> >> +                              (on 83xx this is expected to be identical to
> >> +                              the interrupts property of the parent node)
> >>           - interrupt-parent  : optional, if needed for interrupt mapping
> >>   
> >>   Example:
> >> @@ -70,30 +66,26 @@ Example:
> >>   		};
> >>   	};
> >>   
> >> -* Freescale 85xx/86xx DMA Controller
> >> -
> >> -Freescale PowerPC 85xx/86xx have on chip general purpose DMA controllers.
> >> +** Freescale EloPlus DMA Controller
> >> +   This is DMA controller with extended addresses and chaining, mainly used in
> >> +   Freescale mpc85xx/86xx, Pxxx and BSC series chips, such as mpc8540, mpc8641
> >> +   p4080, bsc9131 etc.
> >>   
> >>   Required properties:
> >>   
> >> -- compatible        : compatible list, contains 2 entries, first is
> >> -		 "fsl,CHIP-dma", where CHIP is the processor
> >> -		 (mpc8540, mpc8540, etc.) and the second is
> >> -		 "fsl,eloplus-dma"
> >> -- reg               : <registers mapping for DMA general status reg>
> >> +- compatible        : must include "fsl,eloplus-dma"
> >> +- reg               : <registers specifier for DMA general status reg>
> >>   - cell-index        : controller index.  0 for controller @ 0x21000,
> >>                                            1 for controller @ 0xc000
> >> -- ranges		: Should be defined as specified in 1) to describe the
> >> -		  DMA controller channels.
> >> +- ranges            : describes the mapping between the address space of the
> >> +                      DMA channels and the address space of the DMA controller
> >>   
> >>   - DMA channel nodes:
> >> -        - compatible        : compatible list, contains 2 entries, first is
> >> -			 "fsl,CHIP-dma-channel", where CHIP is the processor
> >> -			 (mpc8540, mpc8560, etc.) and the second is
> >> -			 "fsl,eloplus-dma-channel". However, see note below.
> >> +        - compatible        : must include "fsl,eloplus-dma-channel"
> >> +                              However, see note below.
> >>           - cell-index        : dma channel index starts at 0.
> >> -        - reg               : <registers mapping for channel>
> >> -        - interrupts        : <interrupt mapping for DMA channel IRQ>
> >> +        - reg               : <registers specifier for channel>
> >> +        - interrupts        : <interrupt specifier for DMA channel IRQ>
> >>           - interrupt-parent  : optional, if needed for interrupt mapping
> >>   
> >>   Example:
> >> -- 
> >> 1.7.9.5
> > Thanks,
> > Mark.
> >
> 
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v4 09/31] powerpc/fsl-pci: improve clock API use
From: Gerhard Sittig @ 2013-08-28 12:08 UTC (permalink / raw)
  To: linuxppc-dev, Anatolij Gustschin, linux-arm-kernel; +Cc: Paul Mackerras
In-Reply-To: <1375821851-31609-10-git-send-email-gsi@denx.de>

[ re-created the Cc: list, this is about the PCI clock exclusively ]

Of all the "preparation" patches in the series (parts 01-14/31,
forming the "peripheral driver cleanup" phase before the
introduction of CCF support), this patch remains the last to get
picked up.

But I'd suggest to leave this patch for now (for v3.12, it's
rather late).  Either ignore this message and the patch, or see
below for why application isn't required now, and an update of
this patch is needed and will be appropriate for v3.13.

I'm sorry for the confusion, the potentially perceived
instability is a result of both widening the series' scope after
initial submission as well as a recent extension of test coverage
after the scope has been widened.  Thank you for your patience!

On Tue, Aug 06, 2013 at 22:43 +0200, Gerhard Sittig wrote:
> 
> make the Freescale PCI driver get, prepare and enable the PCI clock
> during probe(); the clock gets put upon device close by the devm approach
> 
> clock lookup is non-fatal as not all platforms may provide clock specs
> in their device tree, but failure to enable specified clocks are fatal
> 
> the driver appears to not have a remove() routine, so no reference to
> the clock is kept during use, and the clock isn't released (the devm
> approach will put the clock, but it won't get disabled or unprepared)
> 
> Signed-off-by: Gerhard Sittig <gsi@denx.de>
> ---
>  arch/powerpc/sysdev/fsl_pci.c |   22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
> index 46ac1dd..549ff08 100644
> --- a/arch/powerpc/sysdev/fsl_pci.c
> +++ b/arch/powerpc/sysdev/fsl_pci.c
> @@ -17,6 +17,8 @@
> ...

What this patch 09/31 does is add a non-fatal device tree based
clock lookup in the fsl_pci_probe() routine, to acquire the PCI
clock item appropriately if there is a provider and a DT spec.

The patch in v4 has a bug, which has an obvious fix while an
update wasn't sent yet, for neither the patch nor the series.
There is one more known issue in the series (not with
functionality but with policy, specifically in a multi platform
configuration), while I don't want to resend the series while
known issues are pending.  But this is not the problem here.

First of all the patch is a NOP in the forseeable future.  It
won't harm yet its content isn't urgently needed either, to
unbreak stuff or to support upcoming features that were
communicated before.

Further analysis has shown that the patch is incomplete.

The 85xx and 86xx platforms will pass through the fsl_pci_probe()
routine.  That these platforms don't have OF clock providers is
not a problem, the patch will remain a NOP then.  Its function
will kick in when these platforms may grow clock providers
(things will transparently keep working, this was the actual
intent of the patch).  Since the series is about 512x CCF
support, the patch will remain a NOP throughout the whole series,
but won't harm either.

The 83xx and 512x platforms in contrast _don't_ pass through the
fsl_pci_probe() routine, instead they call mpc83xx_add_bridge()
from within the .setup_arch() callback in platform initialization
code, which iterates over the compatible OF nodes, and runs at a
point in time where the platform's clock provider has not yet
been setup and thus is not available.  In this situation any
clock lookup will fail, which is not fatal during PCI setup yet
won't acquire the clock item and thus will have the common
infrastructure disable the "unused" clock much later.

There is a workaround for this lack of proper clock acquisition
in the peripheral driver.  The clock provider needs to pre-enable
the PCI clock item upon its initialization, because the
peripheral driver can't when it initializes.  Checking the same
condition in the provider's pre-enable workaround which the
.setup_arch() routine is checking before the add_bridge() calls
(the presence of compatible nodes) results in correct operation
as well as most appropriate resource use (clock enabled when PCI
hardware was attached to, and clock disabled in the absence of
PCI hardware or driver attachment).

So the update of this patch 09/31 will contain
- the fix for the copy'n'paste bug in the probe() routine
- an appropriate comment in the add_bridge() routine
- no change in its nature, the idea remains unaffected

The backend (clock provider) will contain the pre-enable
workaround for the PCI clock item.

As a result, the 83xx, 85xx, and 86xx platforms won't see any
change (there is a NOP in probe() and a comment in add_bridge(),
neither of which break any operation).  The 512x platform will
have proper PCI operation in the presence of common clock
support.  Should 8xxx platforms grow CCF support later, they will
transparently keep working (85xx, 86xx), or may add the same
simple yet appropriate workaround (83xx).

So the outline is there, the approach is straight forward and
easily can get implemented, and the resulting code will work for
all platforms while there is no potential for breakage.  The PCI
driver will improve, and all is well. :)  There is no need for
action for v3.12, and v3.13 can include the improvement.

virtually yours
Gerhard Sittig
-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr. 5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-0 Fax: +49-8142-66989-80  Email: office@denx.de

^ permalink raw reply

* [PATCH][RFC][v2] pci: fsl: rework PCIe driver compatible with Layerscape
From: Minghuan Lian @ 2013-08-28 10:42 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, Minghuan Lian, Zang Roy-R61911

The Freescale's Layerscape series processors will use ARM cores.
The LS1's PCIe controllers is the same as T4240's. So it's better
the PCIe controller driver can support PowerPC and ARM
simultaneously. This patch is for this purpose. It derives
the common functions from arch/powerpc/sysdev/fsl_pci.c to
drivers/pci/host/pcie-fsl.c and leaves several platform-dependent
functions which should be implemented in platform files.

Signed-off-by: Minghuan Lian <Minghuan.Lian@freescale.com>
---
Based on upstream master 3.11-rc7
The function has been tested on MPC8315ERDB MPC8572DS P5020DS P3041DS
and T4240QDS boards 

Change log:
v2:
1. Use 'pci' instead of 'pcie' in new file name and file contents. 
2. Use iowrite32be()/iowrite32() instead of out_be32/le32()
3. Fix ppc_md.dma_set_mask setting
4. Synchronizes host->first_busno and pci->first_busno.
5. Fix PCI IO space settings
6. Some small changes according to Scott's comments.


 arch/powerpc/Kconfig                               |   1 +
 arch/powerpc/sysdev/fsl_pci.c                      | 610 ++++-------------
 arch/powerpc/sysdev/fsl_pci.h                      |  91 ---
 drivers/edac/mpc85xx_edac.c                        |  10 -
 drivers/pci/host/Kconfig                           |   4 +
 drivers/pci/host/Makefile                          |   1 +
 drivers/pci/host/pci-fsl.c                         | 736 +++++++++++++++++++++
 .../sysdev/fsl_pci.h => include/linux/fsl/pci.h    | 107 ++-
 8 files changed, 932 insertions(+), 628 deletions(-)
 create mode 100644 drivers/pci/host/pci-fsl.c
 copy arch/powerpc/sysdev/fsl_pci.h => include/linux/fsl/pci.h (67%)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9cf59816d..f78484c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -671,6 +671,7 @@ config FSL_SOC
 
 config FSL_PCI
  	bool
+	select PCI_FSL if FSL_SOC_BOOKE || PPC_86xx
 	select PPC_INDIRECT_PCI
 	select PCI_QUIRKS
 
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 46ac1dd..b3ff28b 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -1,7 +1,7 @@
 /*
  * MPC83xx/85xx/86xx PCI/PCIE support routing.
  *
- * Copyright 2007-2012 Freescale Semiconductor, Inc.
+ * Copyright 2007-2013 Freescale Semiconductor, Inc.
  * Copyright 2008-2009 MontaVista Software, Inc.
  *
  * Initial author: Xianghua Xiao <x.xiao@freescale.com>
@@ -26,6 +26,7 @@
 #include <linux/memblock.h>
 #include <linux/log2.h>
 #include <linux/slab.h>
+#include <linux/fsl/pci.h>
 
 #include <asm/io.h>
 #include <asm/prom.h>
@@ -54,60 +55,17 @@ static void quirk_fsl_pcie_header(struct pci_dev *dev)
 	return;
 }
 
-static int fsl_indirect_read_config(struct pci_bus *, unsigned int,
-				    int, int, u32 *);
-
-static int fsl_pcie_check_link(struct pci_controller *hose)
-{
-	u32 val = 0;
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, quirk_fsl_pcie_header);
 
-	if (hose->indirect_type & PPC_INDIRECT_TYPE_FSL_CFG_REG_LINK) {
-		if (hose->ops->read == fsl_indirect_read_config) {
-			struct pci_bus bus;
-			bus.number = 0;
-			bus.sysdata = hose;
-			bus.ops = hose->ops;
-			indirect_read_config(&bus, 0, PCIE_LTSSM, 4, &val);
-		} else
-			early_read_config_dword(hose, 0, 0, PCIE_LTSSM, &val);
-		if (val < PCIE_LTSSM_L0)
-			return 1;
-	} else {
-		struct ccsr_pci __iomem *pci = hose->private_data;
-		/* for PCIe IP rev 3.0 or greater use CSR0 for link state */
-		val = (in_be32(&pci->pex_csr0) & PEX_CSR0_LTSSM_MASK)
-				>> PEX_CSR0_LTSSM_SHIFT;
-		if (val != PEX_CSR0_LTSSM_L0)
-			return 1;
-	}
+#if defined(CONFIG_FSL_SOC_BOOKE) || defined(CONFIG_PPC_86xx)
 
-	return 0;
-}
+#define MAX_PHYS_ADDR_BITS	40
 
-static int fsl_indirect_read_config(struct pci_bus *bus, unsigned int devfn,
-				    int offset, int len, u32 *val)
+u64 fsl_pci64_dma_offset(void)
 {
-	struct pci_controller *hose = pci_bus_to_host(bus);
-
-	if (fsl_pcie_check_link(hose))
-		hose->indirect_type |= PPC_INDIRECT_TYPE_NO_PCIE_LINK;
-	else
-		hose->indirect_type &= ~PPC_INDIRECT_TYPE_NO_PCIE_LINK;
-
-	return indirect_read_config(bus, devfn, offset, len, val);
+	return 1ull << MAX_PHYS_ADDR_BITS;
 }
 
-#if defined(CONFIG_FSL_SOC_BOOKE) || defined(CONFIG_PPC_86xx)
-
-static struct pci_ops fsl_indirect_pcie_ops =
-{
-	.read = fsl_indirect_read_config,
-	.write = indirect_write_config,
-};
-
-#define MAX_PHYS_ADDR_BITS	40
-static u64 pci64_dma_offset = 1ull << MAX_PHYS_ADDR_BITS;
-
 static int fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask)
 {
 	if (!dev->dma_mask || !dma_supported(dev, dma_mask))
@@ -121,300 +79,43 @@ static int fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask)
 	if ((dev->bus == &pci_bus_type) &&
 	    dma_mask >= DMA_BIT_MASK(MAX_PHYS_ADDR_BITS)) {
 		set_dma_ops(dev, &dma_direct_ops);
-		set_dma_offset(dev, pci64_dma_offset);
+		set_dma_offset(dev, fsl_pci64_dma_offset());
 	}
 
 	*dev->dma_mask = dma_mask;
 	return 0;
 }
 
-static int setup_one_atmu(struct ccsr_pci __iomem *pci,
-	unsigned int index, const struct resource *res,
-	resource_size_t offset)
+struct fsl_pci *fsl_sys_to_pci(void *sys)
 {
-	resource_size_t pci_addr = res->start - offset;
-	resource_size_t phys_addr = res->start;
-	resource_size_t size = resource_size(res);
-	u32 flags = 0x80044000; /* enable & mem R/W */
-	unsigned int i;
+	struct pci_controller *hose = sys;
+	struct fsl_pci *pci = hose->private_data;
 
-	pr_debug("PCI MEM resource start 0x%016llx, size 0x%016llx.\n",
-		(u64)res->start, (u64)size);
+	/* Update the first bus number */
+	if (pci->first_busno != hose->first_busno)
+		pci->first_busno = hose->first_busno;
 
-	if (res->flags & IORESOURCE_PREFETCH)
-		flags |= 0x10000000; /* enable relaxed ordering */
-
-	for (i = 0; size > 0; i++) {
-		unsigned int bits = min(ilog2(size),
-					__ffs(pci_addr | phys_addr));
-
-		if (index + i >= 5)
-			return -1;
-
-		out_be32(&pci->pow[index + i].potar, pci_addr >> 12);
-		out_be32(&pci->pow[index + i].potear, (u64)pci_addr >> 44);
-		out_be32(&pci->pow[index + i].powbar, phys_addr >> 12);
-		out_be32(&pci->pow[index + i].powar, flags | (bits - 1));
-
-		pci_addr += (resource_size_t)1U << bits;
-		phys_addr += (resource_size_t)1U << bits;
-		size -= (resource_size_t)1U << bits;
-	}
-
-	return i;
+	return pci;
 }
 
-/* atmu setup for fsl pci/pcie controller */
-static void setup_pci_atmu(struct pci_controller *hose)
+struct pci_bus *fsl_fake_pci_bus(struct fsl_pci *pci, int busnr)
 {
-	struct ccsr_pci __iomem *pci = hose->private_data;
-	int i, j, n, mem_log, win_idx = 3, start_idx = 1, end_idx = 4;
-	u64 mem, sz, paddr_hi = 0;
-	u64 offset = 0, paddr_lo = ULLONG_MAX;
-	u32 pcicsrbar = 0, pcicsrbar_sz;
-	u32 piwar = PIWAR_EN | PIWAR_PF | PIWAR_TGI_LOCAL |
-			PIWAR_READ_SNOOP | PIWAR_WRITE_SNOOP;
-	const char *name = hose->dn->full_name;
-	const u64 *reg;
-	int len;
-
-	if (early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP)) {
-		if (in_be32(&pci->block_rev1) >= PCIE_IP_REV_2_2) {
-			win_idx = 2;
-			start_idx = 0;
-			end_idx = 3;
-		}
-	}
-
-	/* Disable all windows (except powar0 since it's ignored) */
-	for(i = 1; i < 5; i++)
-		out_be32(&pci->pow[i].powar, 0);
-	for (i = start_idx; i < end_idx; i++)
-		out_be32(&pci->piw[i].piwar, 0);
-
-	/* Setup outbound MEM window */
-	for(i = 0, j = 1; i < 3; i++) {
-		if (!(hose->mem_resources[i].flags & IORESOURCE_MEM))
-			continue;
-
-		paddr_lo = min(paddr_lo, (u64)hose->mem_resources[i].start);
-		paddr_hi = max(paddr_hi, (u64)hose->mem_resources[i].end);
-
-		/* We assume all memory resources have the same offset */
-		offset = hose->mem_offset[i];
-		n = setup_one_atmu(pci, j, &hose->mem_resources[i], offset);
-
-		if (n < 0 || j >= 5) {
-			pr_err("Ran out of outbound PCI ATMUs for resource %d!\n", i);
-			hose->mem_resources[i].flags |= IORESOURCE_DISABLED;
-		} else
-			j += n;
-	}
-
-	/* Setup outbound IO window */
-	if (hose->io_resource.flags & IORESOURCE_IO) {
-		if (j >= 5) {
-			pr_err("Ran out of outbound PCI ATMUs for IO resource\n");
-		} else {
-			pr_debug("PCI IO resource start 0x%016llx, size 0x%016llx, "
-				 "phy base 0x%016llx.\n",
-				 (u64)hose->io_resource.start,
-				 (u64)resource_size(&hose->io_resource),
-				 (u64)hose->io_base_phys);
-			out_be32(&pci->pow[j].potar, (hose->io_resource.start >> 12));
-			out_be32(&pci->pow[j].potear, 0);
-			out_be32(&pci->pow[j].powbar, (hose->io_base_phys >> 12));
-			/* Enable, IO R/W */
-			out_be32(&pci->pow[j].powar, 0x80088000
-				| (ilog2(hose->io_resource.end
-				- hose->io_resource.start + 1) - 1));
-		}
-	}
-
-	/* convert to pci address space */
-	paddr_hi -= offset;
-	paddr_lo -= offset;
-
-	if (paddr_hi == paddr_lo) {
-		pr_err("%s: No outbound window space\n", name);
-		return;
-	}
-
-	if (paddr_lo == 0) {
-		pr_err("%s: No space for inbound window\n", name);
-		return;
-	}
+	static struct pci_bus bus;
+	static struct pci_controller hose;
 
-	/* setup PCSRBAR/PEXCSRBAR */
-	early_write_config_dword(hose, 0, 0, PCI_BASE_ADDRESS_0, 0xffffffff);
-	early_read_config_dword(hose, 0, 0, PCI_BASE_ADDRESS_0, &pcicsrbar_sz);
-	pcicsrbar_sz = ~pcicsrbar_sz + 1;
+	bus.number = busnr;
+	bus.sysdata = &hose;
+	hose.private_data = pci;
+	bus.ops = pci->ops;
 
-	if (paddr_hi < (0x100000000ull - pcicsrbar_sz) ||
-		(paddr_lo > 0x100000000ull))
-		pcicsrbar = 0x100000000ull - pcicsrbar_sz;
-	else
-		pcicsrbar = (paddr_lo - pcicsrbar_sz) & -pcicsrbar_sz;
-	early_write_config_dword(hose, 0, 0, PCI_BASE_ADDRESS_0, pcicsrbar);
-
-	paddr_lo = min(paddr_lo, (u64)pcicsrbar);
-
-	pr_info("%s: PCICSRBAR @ 0x%x\n", name, pcicsrbar);
-
-	/* Setup inbound mem window */
-	mem = memblock_end_of_DRAM();
-
-	/*
-	 * The msi-address-64 property, if it exists, indicates the physical
-	 * address of the MSIIR register.  Normally, this register is located
-	 * inside CCSR, so the ATMU that covers all of CCSR is used. But if
-	 * this property exists, then we normally need to create a new ATMU
-	 * for it.  For now, however, we cheat.  The only entity that creates
-	 * this property is the Freescale hypervisor, and the address is
-	 * specified in the partition configuration.  Typically, the address
-	 * is located in the page immediately after the end of DDR.  If so, we
-	 * can avoid allocating a new ATMU by extending the DDR ATMU by one
-	 * page.
-	 */
-	reg = of_get_property(hose->dn, "msi-address-64", &len);
-	if (reg && (len == sizeof(u64))) {
-		u64 address = be64_to_cpup(reg);
-
-		if ((address >= mem) && (address < (mem + PAGE_SIZE))) {
-			pr_info("%s: extending DDR ATMU to cover MSIIR", name);
-			mem += PAGE_SIZE;
-		} else {
-			/* TODO: Create a new ATMU for MSIIR */
-			pr_warn("%s: msi-address-64 address of %llx is "
-				"unsupported\n", name, address);
-		}
-	}
-
-	sz = min(mem, paddr_lo);
-	mem_log = ilog2(sz);
-
-	/* PCIe can overmap inbound & outbound since RX & TX are separated */
-	if (early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP)) {
-		/* Size window to exact size if power-of-two or one size up */
-		if ((1ull << mem_log) != mem) {
-			if ((1ull << mem_log) > mem)
-				pr_info("%s: Setting PCI inbound window "
-					"greater than memory size\n", name);
-			mem_log++;
-		}
-
-		piwar |= ((mem_log - 1) & PIWAR_SZ_MASK);
-
-		/* Setup inbound memory window */
-		out_be32(&pci->piw[win_idx].pitar,  0x00000000);
-		out_be32(&pci->piw[win_idx].piwbar, 0x00000000);
-		out_be32(&pci->piw[win_idx].piwar,  piwar);
-		win_idx--;
-
-		hose->dma_window_base_cur = 0x00000000;
-		hose->dma_window_size = (resource_size_t)sz;
-
-		/*
-		 * if we have >4G of memory setup second PCI inbound window to
-		 * let devices that are 64-bit address capable to work w/o
-		 * SWIOTLB and access the full range of memory
-		 */
-		if (sz != mem) {
-			mem_log = ilog2(mem);
-
-			/* Size window up if we dont fit in exact power-of-2 */
-			if ((1ull << mem_log) != mem)
-				mem_log++;
-
-			piwar = (piwar & ~PIWAR_SZ_MASK) | (mem_log - 1);
-
-			/* Setup inbound memory window */
-			out_be32(&pci->piw[win_idx].pitar,  0x00000000);
-			out_be32(&pci->piw[win_idx].piwbear,
-					pci64_dma_offset >> 44);
-			out_be32(&pci->piw[win_idx].piwbar,
-					pci64_dma_offset >> 12);
-			out_be32(&pci->piw[win_idx].piwar,  piwar);
-
-			/*
-			 * install our own dma_set_mask handler to fixup dma_ops
-			 * and dma_offset
-			 */
-			ppc_md.dma_set_mask = fsl_pci_dma_set_mask;
-
-			pr_info("%s: Setup 64-bit PCI DMA window\n", name);
-		}
-	} else {
-		u64 paddr = 0;
-
-		/* Setup inbound memory window */
-		out_be32(&pci->piw[win_idx].pitar,  paddr >> 12);
-		out_be32(&pci->piw[win_idx].piwbar, paddr >> 12);
-		out_be32(&pci->piw[win_idx].piwar,  (piwar | (mem_log - 1)));
-		win_idx--;
-
-		paddr += 1ull << mem_log;
-		sz -= 1ull << mem_log;
-
-		if (sz) {
-			mem_log = ilog2(sz);
-			piwar |= (mem_log - 1);
-
-			out_be32(&pci->piw[win_idx].pitar,  paddr >> 12);
-			out_be32(&pci->piw[win_idx].piwbar, paddr >> 12);
-			out_be32(&pci->piw[win_idx].piwar,  piwar);
-			win_idx--;
-
-			paddr += 1ull << mem_log;
-		}
-
-		hose->dma_window_base_cur = 0x00000000;
-		hose->dma_window_size = (resource_size_t)paddr;
-	}
-
-	if (hose->dma_window_size < mem) {
-#ifndef CONFIG_SWIOTLB
-		pr_err("%s: ERROR: Memory size exceeds PCI ATMU ability to "
-			"map - enable CONFIG_SWIOTLB to avoid dma errors.\n",
-			 name);
-#endif
-		/* adjusting outbound windows could reclaim space in mem map */
-		if (paddr_hi < 0xffffffffull)
-			pr_warning("%s: WARNING: Outbound window cfg leaves "
-				"gaps in memory map. Adjusting the memory map "
-				"could reduce unnecessary bounce buffering.\n",
-				name);
-
-		pr_info("%s: DMA window size is 0x%llx\n", name,
-			(u64)hose->dma_window_size);
-	}
-}
-
-static void __init setup_pci_cmd(struct pci_controller *hose)
-{
-	u16 cmd;
-	int cap_x;
-
-	early_read_config_word(hose, 0, 0, PCI_COMMAND, &cmd);
-	cmd |= PCI_COMMAND_SERR | PCI_COMMAND_MASTER | PCI_COMMAND_MEMORY
-		| PCI_COMMAND_IO;
-	early_write_config_word(hose, 0, 0, PCI_COMMAND, cmd);
-
-	cap_x = early_find_capability(hose, 0, 0, PCI_CAP_ID_PCIX);
-	if (cap_x) {
-		int pci_x_cmd = cap_x + PCI_X_CMD;
-		cmd = PCI_X_CMD_MAX_SPLIT | PCI_X_CMD_MAX_READ
-			| PCI_X_CMD_ERO | PCI_X_CMD_DPERR_E;
-		early_write_config_word(hose, 0, 0, pci_x_cmd, cmd);
-	} else {
-		early_write_config_byte(hose, 0, 0, PCI_LATENCY_TIMER, 0x80);
-	}
+	return &bus;
 }
 
 void fsl_pcibios_fixup_bus(struct pci_bus *bus)
 {
 	struct pci_controller *hose = pci_bus_to_host(bus);
-	int i, is_pcie = 0, no_link;
+	int i, is_pcie, no_link;
+	struct fsl_pci *pci = fsl_sys_to_pci(hose);
 
 	/* The root complex bridge comes up with bogus resources,
 	 * we copy the PHB ones in.
@@ -424,9 +125,8 @@ void fsl_pcibios_fixup_bus(struct pci_bus *bus)
 	 * tricky.
 	 */
 
-	if (fsl_pcie_bus_fixup)
-		is_pcie = early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP);
-	no_link = !!(hose->indirect_type & PPC_INDIRECT_TYPE_NO_PCIE_LINK);
+	is_pcie = pci->is_pcie;
+	no_link = fsl_pci_check_link(pci);
 
 	if (bus->parent == hose->bus && (is_pcie || no_link)) {
 		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; ++i) {
@@ -448,115 +148,95 @@ void fsl_pcibios_fixup_bus(struct pci_bus *bus)
 	}
 }
 
-int __init fsl_add_bridge(struct platform_device *pdev, int is_primary)
+int fsl_pci_exclude_device(struct fsl_pci *pci, u8 bus, u8 devfn)
 {
-	int len;
-	struct pci_controller *hose;
-	struct resource rsrc;
-	const int *bus_range;
-	u8 hdr_type, progif;
-	struct device_node *dev;
-	struct ccsr_pci __iomem *pci;
-
-	dev = pdev->dev.of_node;
+	struct pci_controller *hose = pci->sys;
 
-	if (!of_device_is_available(dev)) {
-		pr_warning("%s: disabled\n", dev->full_name);
-		return -ENODEV;
-	}
+	if (!hose)
+		return PCIBIOS_SUCCESSFUL;
 
-	pr_debug("Adding PCI host bridge %s\n", dev->full_name);
+	if (ppc_md.pci_exclude_device)
+		if (ppc_md.pci_exclude_device(hose, bus, devfn))
+				return PCIBIOS_DEVICE_NOT_FOUND;
 
-	/* Fetch host bridge registers address */
-	if (of_address_to_resource(dev, 0, &rsrc)) {
-		printk(KERN_WARNING "Can't get pci register base!");
-		return -ENOMEM;
-	}
+	return PCIBIOS_SUCCESSFUL;
+}
 
-	/* Get bus range if any */
-	bus_range = of_get_property(dev, "bus-range", &len);
-	if (bus_range == NULL || len < 2 * sizeof(int))
-		printk(KERN_WARNING "Can't get bus-range for %s, assume"
-			" bus 0\n", dev->full_name);
+int fsl_pci_sys_register(struct fsl_pci *pci)
+{
+	struct pci_controller *hose;
 
 	pci_add_flags(PCI_REASSIGN_ALL_BUS);
-	hose = pcibios_alloc_controller(dev);
+	hose = pcibios_alloc_controller(pci->dn);
 	if (!hose)
 		return -ENOMEM;
 
 	/* set platform device as the parent */
-	hose->parent = &pdev->dev;
-	hose->first_busno = bus_range ? bus_range[0] : 0x0;
-	hose->last_busno = bus_range ? bus_range[1] : 0xff;
-
-	pr_debug("PCI memory map start 0x%016llx, size 0x%016llx\n",
-		 (u64)rsrc.start, (u64)resource_size(&rsrc));
+	hose->private_data = pci;
+	hose->parent = pci->dev;
+	hose->first_busno = pci->first_busno;
+	hose->last_busno = pci->last_busno;
+	hose->ops = pci->ops;
+
+#ifdef CONFIG_PPC32
+	/* On 32 bits, limit I/O space to 16MB */
+	if (pci->pci_io_size > 0x01000000)
+		pci->pci_io_size = 0x01000000;
+
+	/* 32 bits needs to map IOs here */
+	hose->io_base_virt = ioremap(pci->io_base_phys + pci->io_resource.start,
+				     pci->pci_io_size);
+
+	/* Expect trouble if pci_addr is not 0 */
+	if (fsl_pci_primary == pci->dn)
+		isa_io_base = (unsigned long)hose->io_base_virt;
+#endif /* CONFIG_PPC32 */
+
+	hose->pci_io_size = pci->io_resource.start + pci->pci_io_size;
+	hose->io_base_phys = pci->io_base_phys;
+	hose->io_resource = pci->io_resource;
+
+	memcpy(hose->mem_offset, pci->mem_offset, sizeof(hose->mem_offset));
+	memcpy(hose->mem_resources, pci->mem_resources,
+		sizeof(hose->mem_resources));
+	hose->dma_window_base_cur = pci->dma_window_base_cur;
+	hose->dma_window_size = pci->dma_window_size;
+
+	pci->sys = hose;
 
-	pci = hose->private_data = ioremap(rsrc.start, resource_size(&rsrc));
-	if (!hose->private_data)
-		goto no_bridge;
-
-	setup_indirect_pci(hose, rsrc.start, rsrc.start + 0x4,
-			   PPC_INDIRECT_TYPE_BIG_ENDIAN);
-
-	if (in_be32(&pci->block_rev1) < PCIE_IP_REV_3_0)
-		hose->indirect_type |= PPC_INDIRECT_TYPE_FSL_CFG_REG_LINK;
-
-	if (early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP)) {
-		/* use fsl_indirect_read_config for PCIe */
-		hose->ops = &fsl_indirect_pcie_ops;
-		/* For PCIE read HEADER_TYPE to identify controler mode */
-		early_read_config_byte(hose, 0, 0, PCI_HEADER_TYPE, &hdr_type);
-		if ((hdr_type & 0x7f) != PCI_HEADER_TYPE_BRIDGE)
-			goto no_bridge;
-
-	} else {
-		/* For PCI read PROG to identify controller mode */
-		early_read_config_byte(hose, 0, 0, PCI_CLASS_PROG, &progif);
-		if ((progif & 1) == 1)
-			goto no_bridge;
-	}
-
-	setup_pci_cmd(hose);
-
-	/* check PCI express link status */
-	if (early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP)) {
-		hose->indirect_type |= PPC_INDIRECT_TYPE_EXT_REG |
-			PPC_INDIRECT_TYPE_SURPRESS_PRIMARY_BUS;
-		if (fsl_pcie_check_link(hose))
-			hose->indirect_type |= PPC_INDIRECT_TYPE_NO_PCIE_LINK;
-	}
-
-	printk(KERN_INFO "Found FSL PCI host bridge at 0x%016llx. "
-		"Firmware bus number: %d->%d\n",
-		(unsigned long long)rsrc.start, hose->first_busno,
-		hose->last_busno);
+	/*
+	 * Install our own dma_set_mask handler to fixup dma_ops
+	 * and dma_offset when memory is more than dma window size
+	 */
+	if (pci->is_pcie && memblock_end_of_DRAM() > hose->dma_window_size)
+		ppc_md.dma_set_mask = fsl_pci_dma_set_mask;
 
-	pr_debug(" ->Hose at 0x%p, cfg_addr=0x%p,cfg_data=0x%p\n",
-		hose, hose->cfg_addr, hose->cfg_data);
+#ifdef CONFIG_SWIOTLB
+	/*
+	 * if we couldn't map all of DRAM via the dma windows
+	 * we need SWIOTLB to handle buffers located outside of
+	 * dma capable memory region
+	 */
+	if (memblock_end_of_DRAM() - 1 > hose->dma_window_base_cur +
+			hose->dma_window_size)
+		ppc_swiotlb_enable = 1;
+#endif
 
-	/* Interpret the "ranges" property */
-	/* This also maps the I/O region and sets isa_io/mem_base */
-	pci_process_bridge_OF_ranges(hose, dev, is_primary);
+	mpc85xx_pci_err_probe(to_platform_device(pci->dev));
+	return 0;
+}
 
-	/* Setup PEX window registers */
-	setup_pci_atmu(hose);
+void fsl_pci_sys_remove(struct fsl_pci *pci)
+{
+	struct pci_controller *hose = pci->sys;
 
-	return 0;
+	if (!hose)
+		return;
 
-no_bridge:
-	iounmap(hose->private_data);
-	/* unmap cfg_data & cfg_addr separately if not on same page */
-	if (((unsigned long)hose->cfg_data & PAGE_MASK) !=
-	    ((unsigned long)hose->cfg_addr & PAGE_MASK))
-		iounmap(hose->cfg_data);
-	iounmap(hose->cfg_addr);
 	pcibios_free_controller(hose);
-	return -ENODEV;
 }
-#endif /* CONFIG_FSL_SOC_BOOKE || CONFIG_PPC_86xx */
 
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, quirk_fsl_pcie_header);
+#endif
 
 #if defined(CONFIG_PPC_83xx) || defined(CONFIG_PPC_MPC512x)
 struct mpc83xx_pcie_priv {
@@ -693,6 +373,19 @@ static struct pci_ops mpc83xx_pcie_ops = {
 	.write = mpc83xx_pcie_write_config,
 };
 
+static int mpc83xx_pcie_check_link(struct pci_controller *hose)
+{
+	u32 val = 0;
+
+#define PCIE_LTSSM	0x0404		/* PCIE Link Training and Status */
+#define PCIE_LTSSM_L0	0x16		/* L0 state */
+
+	early_read_config_dword(hose, 0, 0, PCIE_LTSSM, &val);
+	if (val < PCIE_LTSSM_L0)
+		return 1;
+	return 0;
+}
+
 static int __init mpc83xx_pcie_setup(struct pci_controller *hose,
 				     struct resource *reg)
 {
@@ -727,7 +420,7 @@ static int __init mpc83xx_pcie_setup(struct pci_controller *hose,
 	out_le32(pcie->cfg_type0 + PEX_OUTWIN0_TAH, 0);
 	out_le32(pcie->cfg_type0 + PEX_OUTWIN0_TAL, 0);
 
-	if (fsl_pcie_check_link(hose))
+	if (mpc83xx_pcie_check_link(hose))
 		hose->indirect_type |= PPC_INDIRECT_TYPE_NO_PCIE_LINK;
 
 	return 0;
@@ -869,7 +562,7 @@ u64 fsl_pci_immrbar_base(struct pci_controller *hose)
 }
 
 #if defined(CONFIG_FSL_SOC_BOOKE) || defined(CONFIG_PPC_86xx)
-static const struct of_device_id pci_ids[] = {
+const struct of_device_id fsl_pci_ids[] = {
 	{ .compatible = "fsl,mpc8540-pci", },
 	{ .compatible = "fsl,mpc8548-pcie", },
 	{ .compatible = "fsl,mpc8610-pci", },
@@ -906,7 +599,7 @@ void fsl_pci_assign_primary(void)
 		of_node_put(np);
 		np = fsl_pci_primary;
 
-		if (of_match_node(pci_ids, np) && of_device_is_available(np))
+		if (of_match_node(fsl_pci_ids, np) && of_device_is_available(np))
 			return;
 	}
 
@@ -915,7 +608,7 @@ void fsl_pci_assign_primary(void)
 	 * designate one as primary.  This can go away once
 	 * various bugs with primary-less systems are fixed.
 	 */
-	for_each_matching_node(np, pci_ids) {
+	for_each_matching_node(np, fsl_pci_ids) {
 		if (of_device_is_available(np)) {
 			fsl_pci_primary = np;
 			of_node_put(np);
@@ -924,81 +617,4 @@ void fsl_pci_assign_primary(void)
 	}
 }
 
-static int fsl_pci_probe(struct platform_device *pdev)
-{
-	int ret;
-	struct device_node *node;
-#ifdef CONFIG_SWIOTLB
-	struct pci_controller *hose;
-#endif
-
-	node = pdev->dev.of_node;
-	ret = fsl_add_bridge(pdev, fsl_pci_primary == node);
-
-#ifdef CONFIG_SWIOTLB
-	if (ret == 0) {
-		hose = pci_find_hose_for_OF_device(pdev->dev.of_node);
-
-		/*
-		 * if we couldn't map all of DRAM via the dma windows
-		 * we need SWIOTLB to handle buffers located outside of
-		 * dma capable memory region
-		 */
-		if (memblock_end_of_DRAM() - 1 > hose->dma_window_base_cur +
-				hose->dma_window_size)
-			ppc_swiotlb_enable = 1;
-	}
-#endif
-
-	mpc85xx_pci_err_probe(pdev);
-
-	return 0;
-}
-
-#ifdef CONFIG_PM
-static int fsl_pci_resume(struct device *dev)
-{
-	struct pci_controller *hose;
-	struct resource pci_rsrc;
-
-	hose = pci_find_hose_for_OF_device(dev->of_node);
-	if (!hose)
-		return -ENODEV;
-
-	if (of_address_to_resource(dev->of_node, 0, &pci_rsrc)) {
-		dev_err(dev, "Get pci register base failed.");
-		return -ENODEV;
-	}
-
-	setup_pci_atmu(hose);
-
-	return 0;
-}
-
-static const struct dev_pm_ops pci_pm_ops = {
-	.resume = fsl_pci_resume,
-};
-
-#define PCI_PM_OPS (&pci_pm_ops)
-
-#else
-
-#define PCI_PM_OPS NULL
-
-#endif
-
-static struct platform_driver fsl_pci_driver = {
-	.driver = {
-		.name = "fsl-pci",
-		.pm = PCI_PM_OPS,
-		.of_match_table = pci_ids,
-	},
-	.probe = fsl_pci_probe,
-};
-
-static int __init fsl_pci_init(void)
-{
-	return platform_driver_register(&fsl_pci_driver);
-}
-arch_initcall(fsl_pci_init);
 #endif
diff --git a/arch/powerpc/sysdev/fsl_pci.h b/arch/powerpc/sysdev/fsl_pci.h
index 72b5625..42f3ab6 100644
--- a/arch/powerpc/sysdev/fsl_pci.h
+++ b/arch/powerpc/sysdev/fsl_pci.h
@@ -14,97 +14,6 @@
 #ifndef __POWERPC_FSL_PCI_H
 #define __POWERPC_FSL_PCI_H
 
-struct platform_device;
-
-#define PCIE_LTSSM	0x0404		/* PCIE Link Training and Status */
-#define PCIE_LTSSM_L0	0x16		/* L0 state */
-#define PCIE_IP_REV_2_2		0x02080202 /* PCIE IP block version Rev2.2 */
-#define PCIE_IP_REV_3_0		0x02080300 /* PCIE IP block version Rev3.0 */
-#define PIWAR_EN		0x80000000	/* Enable */
-#define PIWAR_PF		0x20000000	/* prefetch */
-#define PIWAR_TGI_LOCAL		0x00f00000	/* target - local memory */
-#define PIWAR_READ_SNOOP	0x00050000
-#define PIWAR_WRITE_SNOOP	0x00005000
-#define PIWAR_SZ_MASK          0x0000003f
-
-/* PCI/PCI Express outbound window reg */
-struct pci_outbound_window_regs {
-	__be32	potar;	/* 0x.0 - Outbound translation address register */
-	__be32	potear;	/* 0x.4 - Outbound translation extended address register */
-	__be32	powbar;	/* 0x.8 - Outbound window base address register */
-	u8	res1[4];
-	__be32	powar;	/* 0x.10 - Outbound window attributes register */
-	u8	res2[12];
-};
-
-/* PCI/PCI Express inbound window reg */
-struct pci_inbound_window_regs {
-	__be32	pitar;	/* 0x.0 - Inbound translation address register */
-	u8	res1[4];
-	__be32	piwbar;	/* 0x.8 - Inbound window base address register */
-	__be32	piwbear;	/* 0x.c - Inbound window base extended address register */
-	__be32	piwar;	/* 0x.10 - Inbound window attributes register */
-	u8	res2[12];
-};
-
-/* PCI/PCI Express IO block registers for 85xx/86xx */
-struct ccsr_pci {
-	__be32	config_addr;		/* 0x.000 - PCI/PCIE Configuration Address Register */
-	__be32	config_data;		/* 0x.004 - PCI/PCIE Configuration Data Register */
-	__be32	int_ack;		/* 0x.008 - PCI Interrupt Acknowledge Register */
-	__be32	pex_otb_cpl_tor;	/* 0x.00c - PCIE Outbound completion timeout register */
-	__be32	pex_conf_tor;		/* 0x.010 - PCIE configuration timeout register */
-	__be32	pex_config;		/* 0x.014 - PCIE CONFIG Register */
-	__be32	pex_int_status;		/* 0x.018 - PCIE interrupt status */
-	u8	res2[4];
-	__be32	pex_pme_mes_dr;		/* 0x.020 - PCIE PME and message detect register */
-	__be32	pex_pme_mes_disr;	/* 0x.024 - PCIE PME and message disable register */
-	__be32	pex_pme_mes_ier;	/* 0x.028 - PCIE PME and message interrupt enable register */
-	__be32	pex_pmcr;		/* 0x.02c - PCIE power management command register */
-	u8	res3[3016];
-	__be32	block_rev1;	/* 0x.bf8 - PCIE Block Revision register 1 */
-	__be32	block_rev2;	/* 0x.bfc - PCIE Block Revision register 2 */
-
-/* PCI/PCI Express outbound window 0-4
- * Window 0 is the default window and is the only window enabled upon reset.
- * The default outbound register set is used when a transaction misses
- * in all of the other outbound windows.
- */
-	struct pci_outbound_window_regs pow[5];
-	u8	res14[96];
-	struct pci_inbound_window_regs	pmit;	/* 0xd00 - 0xd9c Inbound MSI */
-	u8	res6[96];
-/* PCI/PCI Express inbound window 3-0
- * inbound window 1 supports only a 32-bit base address and does not
- * define an inbound window base extended address register.
- */
-	struct pci_inbound_window_regs piw[4];
-
-	__be32	pex_err_dr;		/* 0x.e00 - PCI/PCIE error detect register */
-	u8	res21[4];
-	__be32	pex_err_en;		/* 0x.e08 - PCI/PCIE error interrupt enable register */
-	u8	res22[4];
-	__be32	pex_err_disr;		/* 0x.e10 - PCI/PCIE error disable register */
-	u8	res23[12];
-	__be32	pex_err_cap_stat;	/* 0x.e20 - PCI/PCIE error capture status register */
-	u8	res24[4];
-	__be32	pex_err_cap_r0;		/* 0x.e28 - PCIE error capture register 0 */
-	__be32	pex_err_cap_r1;		/* 0x.e2c - PCIE error capture register 0 */
-	__be32	pex_err_cap_r2;		/* 0x.e30 - PCIE error capture register 0 */
-	__be32	pex_err_cap_r3;		/* 0x.e34 - PCIE error capture register 0 */
-	u8	res_e38[200];
-	__be32	pdb_stat;		/* 0x.f00 - PCIE Debug Status */
-	u8	res_f04[16];
-	__be32	pex_csr0;		/* 0x.f14 - PEX Control/Status register 0*/
-#define PEX_CSR0_LTSSM_MASK	0xFC
-#define PEX_CSR0_LTSSM_SHIFT	2
-#define PEX_CSR0_LTSSM_L0	0x11
-	__be32	pex_csr1;		/* 0x.f18 - PEX Control/Status register 1*/
-	u8	res_f1c[228];
-
-};
-
-extern int fsl_add_bridge(struct platform_device *pdev, int is_primary);
 extern void fsl_pcibios_fixup_bus(struct pci_bus *bus);
 extern int mpc83xx_add_bridge(struct device_node *dev);
 u64 fsl_pci_immrbar_base(struct pci_controller *hose);
diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
index 3eb32f6..ae603c1 100644
--- a/drivers/edac/mpc85xx_edac.c
+++ b/drivers/edac/mpc85xx_edac.c
@@ -239,7 +239,6 @@ int mpc85xx_pci_err_probe(struct platform_device *op)
 	pdata = pci->pvt_info;
 	pdata->name = "mpc85xx_pci_err";
 	pdata->irq = NO_IRQ;
-	dev_set_drvdata(&op->dev, pci);
 	pci->dev = &op->dev;
 	pci->mod_name = EDAC_MOD_STR;
 	pci->ctl_name = pdata->name;
@@ -259,15 +258,6 @@ int mpc85xx_pci_err_probe(struct platform_device *op)
 
 	/* we only need the error registers */
 	r.start += 0xe00;
-
-	if (!devm_request_mem_region(&op->dev, r.start, resource_size(&r),
-					pdata->name)) {
-		printk(KERN_ERR "%s: Error while requesting mem region\n",
-		       __func__);
-		res = -EBUSY;
-		goto err;
-	}
-
 	pdata->pci_vbase = devm_ioremap(&op->dev, r.start, resource_size(&r));
 	if (!pdata->pci_vbase) {
 		printk(KERN_ERR "%s: Unable to setup PCI err regs\n", __func__);
diff --git a/drivers/pci/host/Kconfig b/drivers/pci/host/Kconfig
index 1184ff6..454484b 100644
--- a/drivers/pci/host/Kconfig
+++ b/drivers/pci/host/Kconfig
@@ -14,4 +14,8 @@ config PCI_EXYNOS
 	select PCIEPORTBUS
 	select PCIE_DW
 
+config PCI_FSL
+	bool "Freescale PCI/PCIe controller"
+	depends on FSL_SOC_BOOKE || PPC_86xx
+
 endmenu
diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
index 086d850..b6d4564 100644
--- a/drivers/pci/host/Makefile
+++ b/drivers/pci/host/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
 obj-$(CONFIG_PCIE_DW) += pcie-designware.o
+obj-$(CONFIG_PCI_FSL) += pci-fsl.o
diff --git a/drivers/pci/host/pci-fsl.c b/drivers/pci/host/pci-fsl.c
new file mode 100644
index 0000000..a20d57c
--- /dev/null
+++ b/drivers/pci/host/pci-fsl.c
@@ -0,0 +1,736 @@
+/*
+ * 85xx/86xx/LS PCI/PCIE common driver support
+ *
+ * Copyright 2013 Freescale Semiconductor, Inc.
+ *
+ * Moved from arch/power/sysdev/fsl_pci.c
+ *
+ * This program is free software; you can redistribute  it and/or modify it
+ * under  the terms of  the GNU General  Public License as published by the
+ * Free Software Foundation;  either version 2 of the  License, or (at your
+ * option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/log2.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_pci.h>
+#include <linux/pci_regs.h>
+#include <linux/platform_device.h>
+#include <linux/resource.h>
+#include <linux/types.h>
+#include <linux/memblock.h>
+#include <linux/fsl/pci.h>
+
+/* Indirect type */
+#define INDIRECT_TYPE_EXT_REG			0x00000002
+#define INDIRECT_TYPE_SURPRESS_PRIMARY_BUS	0x00000004
+#define INDIRECT_TYPE_NO_PCIE_LINK		0x00000008
+#define INDIRECT_TYPE_BIG_ENDIAN		0x00000010
+#define INDIRECT_TYPE_FSL_CFG_REG_LINK		0x00000040
+
+u64 __weak fsl_pci64_dma_offset(void)
+{
+	return 0;
+}
+
+struct fsl_pci * __weak fsl_sys_to_pci(void *sys)
+{
+	return NULL;
+}
+
+struct pci_bus * __weak fsl_fake_pci_bus(struct fsl_pci *pci, int busnr)
+{
+	return NULL;
+}
+
+int __weak fsl_pci_exclude_device(struct fsl_pci *pci, u8 bus, u8 devfn)
+{
+	return PCIBIOS_SUCCESSFUL;
+}
+
+static int fsl_pci_read_config(struct fsl_pci *pci, int bus, int devfn,
+				int offset, int len, u32 *val)
+{
+	u32 bus_no, reg, data;
+
+	if (pci->indirect_type & INDIRECT_TYPE_NO_PCIE_LINK) {
+		if (bus != pci->first_busno)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		if (devfn != 0)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+	}
+
+	if (fsl_pci_exclude_device(pci, bus, devfn))
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	bus_no = (bus == pci->first_busno) ? pci->self_busno : bus;
+
+	if (pci->indirect_type & INDIRECT_TYPE_EXT_REG)
+		reg = ((offset & 0xf00) << 16) | (offset & 0xfc);
+	else
+		reg = offset & 0xfc;
+
+	if (pci->indirect_type & INDIRECT_TYPE_BIG_ENDIAN)
+		iowrite32be(0x80000000 | (bus_no << 16) | (devfn << 8) | reg,
+			    &pci->regs->config_addr);
+	else
+		iowrite32(0x80000000 | (bus_no << 16) | (devfn << 8) | reg,
+			  &pci->regs->config_addr);
+
+	/*
+	 * Note: the caller has already checked that offset is
+	 * suitably aligned and that len is 1, 2 or 4.
+	 */
+	data = ioread32(&pci->regs->config_data);
+	switch (len) {
+	case 1:
+		*val = (data >> (8 * (offset & 3))) & 0xff;
+		break;
+	case 2:
+		*val = (data >> (8 * (offset & 3))) & 0xffff;
+		break;
+	default:
+		*val = data;
+		break;
+	}
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
+static int fsl_pci_write_config(struct fsl_pci *pci, int bus, int devfn,
+				 int offset, int len, u32 val)
+{
+	void __iomem *cfg_data;
+	u32 bus_no, reg;
+
+	if (pci->indirect_type & INDIRECT_TYPE_NO_PCIE_LINK) {
+		if (bus != pci->first_busno)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		if (devfn != 0)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+	}
+
+	if (fsl_pci_exclude_device(pci, bus, devfn))
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	bus_no = (bus == pci->first_busno) ?
+			pci->self_busno : bus;
+
+	if (pci->indirect_type & INDIRECT_TYPE_EXT_REG)
+		reg = ((offset & 0xf00) << 16) | (offset & 0xfc);
+	else
+		reg = offset & 0xfc;
+
+	if (pci->indirect_type & INDIRECT_TYPE_BIG_ENDIAN)
+		iowrite32be(0x80000000 | (bus_no << 16) | (devfn << 8) | reg,
+			    &pci->regs->config_addr);
+	else
+		iowrite32(0x80000000 | (bus_no << 16) | (devfn << 8) | reg,
+			  &pci->regs->config_addr);
+
+	/* suppress setting of PCI_PRIMARY_BUS */
+	if (pci->indirect_type & INDIRECT_TYPE_SURPRESS_PRIMARY_BUS)
+		if ((offset == PCI_PRIMARY_BUS) &&
+		    (bus == pci->first_busno))
+			val &= 0xffffff00;
+
+	/*
+	 * Note: the caller has already checked that offset is
+	 * suitably aligned and that len is 1, 2 or 4.
+	 */
+	cfg_data = ((void *) &(pci->regs->config_data)) + (offset & 3);
+	switch (len) {
+	case 1:
+		iowrite8(val, cfg_data);
+		break;
+	case 2:
+		iowrite16(val, cfg_data);
+		break;
+	default:
+		iowrite32(val, cfg_data);
+		break;
+	}
+	return PCIBIOS_SUCCESSFUL;
+}
+
+int fsl_pci_check_link(struct fsl_pci *pci)
+{
+	u32 val = 0;
+
+	if (pci->indirect_type & INDIRECT_TYPE_FSL_CFG_REG_LINK) {
+		fsl_pci_read_config(pci, 0, 0, PCIE_LTSSM, 4, &val);
+		if (val < PCIE_LTSSM_L0)
+			return 1;
+	} else {
+		/* for PCIe IP rev 3.0 or greater use CSR0 for link state */
+		val = (in_be32(&pci->regs->pex_csr0) & PEX_CSR0_LTSSM_MASK)
+				>> PEX_CSR0_LTSSM_SHIFT;
+		if (val != PEX_CSR0_LTSSM_L0)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int fsl_indirect_read_config(struct pci_bus *bus, unsigned int devfn,
+				    int offset, int len, u32 *val)
+{
+	struct fsl_pci *pci = fsl_sys_to_pci(bus->sysdata);
+
+	if (!pci)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	if (fsl_pci_check_link(pci))
+		pci->indirect_type |= INDIRECT_TYPE_NO_PCIE_LINK;
+	else
+		pci->indirect_type &= ~INDIRECT_TYPE_NO_PCIE_LINK;
+
+	return fsl_pci_read_config(pci, bus->number, devfn, offset, len, val);
+}
+
+static int fsl_indirect_write_config(struct pci_bus *bus, unsigned int devfn,
+				     int offset, int len, u32 val)
+{
+	struct fsl_pci *pci = fsl_sys_to_pci(bus->sysdata);
+
+	if (!pci)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	return fsl_pci_write_config(pci, bus->number, devfn,
+				     offset, len, val);
+}
+
+static struct pci_ops fsl_indirect_pci_ops = {
+	.read = fsl_indirect_read_config,
+	.write = fsl_indirect_write_config,
+};
+
+#define EARLY_FSL_PCI_OP(rw, size, type)				\
+int early_fsl_##rw##_config_##size(struct fsl_pci *pci, int bus,	\
+				   int devfn, int offset, type value)	\
+{									\
+	return pci_bus_##rw##_config_##size(fsl_fake_pci_bus(pci, bus),\
+					    devfn, offset, value);	\
+}
+
+EARLY_FSL_PCI_OP(read, byte, u8 *)
+EARLY_FSL_PCI_OP(read, word, u16 *)
+EARLY_FSL_PCI_OP(read, dword, u32 *)
+EARLY_FSL_PCI_OP(write, byte, u8)
+EARLY_FSL_PCI_OP(write, word, u16)
+EARLY_FSL_PCI_OP(write, dword, u32)
+
+static int early_fsl_find_capability(struct fsl_pci *pci,
+				     int busnr, int devfn, int cap)
+{
+	struct pci_bus *bus = fsl_fake_pci_bus(pci, busnr);
+
+	if (!bus)
+		return 0;
+
+	return pci_bus_find_capability(bus, devfn, cap);
+}
+
+static int setup_one_atmu(struct ccsr_pci __iomem *pci,
+			  unsigned int index, const struct resource *res,
+			  resource_size_t offset)
+{
+	resource_size_t pci_addr = res->start - offset;
+	resource_size_t phys_addr = res->start;
+	resource_size_t size = resource_size(res);
+	u32 flags = 0x80044000; /* enable & mem R/W */
+	unsigned int i;
+
+	pr_debug("PCI MEM resource start 0x%016llx, size 0x%016llx.\n",
+		(u64)res->start, (u64)size);
+
+	if (res->flags & IORESOURCE_PREFETCH)
+		flags |= 0x10000000; /* enable relaxed ordering */
+
+	for (i = 0; size > 0; i++) {
+		unsigned int bits = min(ilog2(size),
+					__ffs(pci_addr | phys_addr));
+
+		if (index + i >= 5)
+			return -1;
+
+		iowrite32be(pci_addr >> 12, &pci->pow[index + i].potar);
+		iowrite32be((u64)pci_addr >> 44, &pci->pow[index + i].potear);
+		iowrite32be(phys_addr >> 12, &pci->pow[index + i].powbar);
+		iowrite32be(flags | (bits - 1), &pci->pow[index + i].powar);
+
+		pci_addr += (resource_size_t)1U << bits;
+		phys_addr += (resource_size_t)1U << bits;
+		size -= (resource_size_t)1U << bits;
+	}
+
+	return i;
+}
+
+/* atmu setup for fsl pci/pcie controller */
+static void setup_pci_atmu(struct fsl_pci *pci)
+{
+	int i, j, n, mem_log, win_idx = 3, start_idx = 1, end_idx = 4;
+	u64 mem, sz, paddr_hi = 0;
+	u64 offset = 0, paddr_lo = ULLONG_MAX;
+	u32 pcicsrbar = 0, pcicsrbar_sz;
+	u32 piwar = PIWAR_EN | PIWAR_PF | PIWAR_TGI_LOCAL |
+			PIWAR_READ_SNOOP | PIWAR_WRITE_SNOOP;
+	const u64 *reg;
+	int len;
+
+	if (pci->is_pcie) {
+		if (in_be32(&pci->regs->block_rev1) >= PCIE_IP_REV_2_2) {
+			win_idx = 2;
+			start_idx = 0;
+			end_idx = 3;
+		}
+	}
+
+	/* Disable all windows (except powar0 since it's ignored) */
+	for (i = 1; i < 5; i++)
+		iowrite32be(0, &pci->regs->pow[i].powar);
+	for (i = start_idx; i < end_idx; i++)
+		iowrite32be(0, &pci->regs->piw[i].piwar);
+
+	/* Setup outbound MEM window */
+	for (i = 0, j = 1; i < 3; i++) {
+		if (!(pci->mem_resources[i].flags & IORESOURCE_MEM))
+			continue;
+
+		paddr_lo = min_t(u64, paddr_lo, pci->mem_resources[i].start);
+		paddr_hi = max_t(u64, paddr_hi, pci->mem_resources[i].end);
+
+		/* We assume all memory resources have the same offset */
+		offset = pci->mem_offset[i];
+		n = setup_one_atmu(pci->regs, j, &pci->mem_resources[i],
+				   offset);
+
+		if (n < 0 || j >= 5) {
+			dev_err(pci->dev, "Ran out of outbound PCI ATMUs"
+				" for resource %d!\n", i);
+			pci->mem_resources[i].flags |= IORESOURCE_DISABLED;
+		} else
+			j += n;
+	}
+
+	/* Setup outbound IO window */
+	if (pci->io_resource.flags & IORESOURCE_IO) {
+		if (j >= 5)
+			dev_err(pci->dev,
+				"Ran out of outbound PCI ATMUs for IO resource\n");
+		else {
+			dev_dbg(pci->dev,
+				 "PCI IO resource start 0x%016llx,"
+				 "size 0x%016llx, phy base 0x%016llx.\n",
+				 (u64)pci->io_resource.start,
+				 (u64)resource_size(&pci->io_resource),
+				 (u64)pci->io_base_phys);
+			iowrite32be(pci->io_resource.start >> 12,
+				    &pci->regs->pow[j].potar);
+			iowrite32be(0, &pci->regs->pow[j].potear);
+			iowrite32be(pci->io_base_phys >> 12,
+				    &pci->regs->pow[j].powbar);
+			/* Enable, IO R/W */
+			iowrite32be(0x80088000 |
+				  (ilog2(resource_size(&pci->io_resource)) - 1),
+				  &pci->regs->pow[j].powar);
+		}
+	}
+
+	/* convert to pci address space */
+	paddr_hi -= offset;
+	paddr_lo -= offset;
+
+	if (paddr_hi == paddr_lo) {
+		dev_err(pci->dev, "No outbound window space\n");
+		return;
+	}
+
+	if (paddr_lo == 0) {
+		dev_err(pci->dev, "No space for inbound window\n");
+		return;
+	}
+
+	/* setup PCSRBAR/PEXCSRBAR */
+	early_fsl_write_config_dword(pci, 0, 0, PCI_BASE_ADDRESS_0,
+				     0xffffffff);
+	early_fsl_read_config_dword(pci, 0, 0, PCI_BASE_ADDRESS_0,
+				    &pcicsrbar_sz);
+	pcicsrbar_sz = ~pcicsrbar_sz + 1;
+
+	if (paddr_hi < (0x100000000ull - pcicsrbar_sz) ||
+	    (paddr_lo > 0x100000000ull))
+		pcicsrbar = 0x100000000ull - pcicsrbar_sz;
+	else
+		pcicsrbar = (paddr_lo - pcicsrbar_sz) & -pcicsrbar_sz;
+	early_fsl_write_config_dword(pci, 0, 0, PCI_BASE_ADDRESS_0,
+				     pcicsrbar);
+
+	paddr_lo = min_t(u64, paddr_lo, pcicsrbar);
+
+	dev_info(pci->dev, "PCICSRBAR @ 0x%x\n", pcicsrbar);
+
+	/* Setup inbound mem window */
+	mem = memblock_end_of_DRAM();
+
+	/*
+	 * The msi-address-64 property, if it exists, indicates the physical
+	 * address of the MSIIR register.  Normally, this register is located
+	 * inside CCSR, so the ATMU that covers all of CCSR is used. But if
+	 * this property exists, then we normally need to create a new ATMU
+	 * for it.  For now, however, we cheat.  The only entity that creates
+	 * this property is the Freescale hypervisor, and the address is
+	 * specified in the partition configuration.  Typically, the address
+	 * is located in the page immediately after the end of DDR.  If so, we
+	 * can avoid allocating a new ATMU by extending the DDR ATMU by one
+	 * page.
+	 */
+	reg = of_get_property(pci->dn, "msi-address-64", &len);
+	if (reg && (len == sizeof(u64))) {
+		u64 address = be64_to_cpup(reg);
+
+		if ((address >= mem) && (address < (mem + PAGE_SIZE))) {
+			dev_info(pci->dev,
+				 "extending DDR ATMU to cover MSIIR\n");
+			mem += PAGE_SIZE;
+		} else {
+			/* TODO: Create a new ATMU for MSIIR */
+			dev_warn(pci->dev,
+				 "msi-address-64 address of %llx is "
+				 "unsupported\n", address);
+		}
+	}
+
+	sz = min(mem, paddr_lo);
+	mem_log = ilog2(sz);
+
+	/* PCIe can overmap inbound & outbound since RX & TX are separated */
+	if (pci->is_pcie) {
+		/* Size window to exact size if power-of-two or one size up */
+		if ((1ull << mem_log) != mem) {
+			if ((1ull << mem_log) > mem)
+				dev_info(pci->dev, "Setting PCI inbound window"
+					 "greater than memory size\n");
+			mem_log++;
+		}
+
+		piwar |= ((mem_log - 1) & PIWAR_SZ_MASK);
+
+		/* Setup inbound memory window */
+		iowrite32be(0, &pci->regs->piw[win_idx].pitar);
+		iowrite32be(0, &pci->regs->piw[win_idx].piwbar);
+		iowrite32be(piwar, &pci->regs->piw[win_idx].piwar);
+		win_idx--;
+
+		pci->dma_window_base_cur = 0x00000000;
+		pci->dma_window_size = (resource_size_t)sz;
+
+		/*
+		 * if we have >4G of memory setup second PCI inbound window to
+		 * let devices that are 64-bit address capable to work w/o
+		 * SWIOTLB and access the full range of memory
+		 */
+		if (sz != mem) {
+			mem_log = ilog2(mem);
+
+			/* Size window up if we dont fit in exact power-of-2 */
+			if ((1ull << mem_log) != mem)
+				mem_log++;
+
+				piwar = (piwar & ~PIWAR_SZ_MASK) |
+					(mem_log - 1);
+
+				/* Setup inbound memory window */
+				iowrite32be(0, &pci->regs->piw[win_idx].pitar);
+				iowrite32be(fsl_pci64_dma_offset() >> 44,
+					    &pci->regs->piw[win_idx].piwbear);
+				iowrite32be(fsl_pci64_dma_offset() >> 12,
+					    &pci->regs->piw[win_idx].piwbar);
+				iowrite32be(piwar,
+					    &pci->regs->piw[win_idx].piwar);
+		}
+	} else {
+		u64 paddr = 0;
+
+		/* Setup inbound memory window */
+		iowrite32be(paddr >> 12, &pci->regs->piw[win_idx].pitar);
+		iowrite32be(paddr >> 12, &pci->regs->piw[win_idx].piwbar);
+		iowrite32be((piwar | (mem_log - 1)),
+			    &pci->regs->piw[win_idx].piwar);
+		win_idx--;
+
+		paddr += 1ull << mem_log;
+		sz -= 1ull << mem_log;
+
+		if (sz) {
+			mem_log = ilog2(sz);
+			piwar |= (mem_log - 1);
+
+			iowrite32be(paddr >> 12,
+				    &pci->regs->piw[win_idx].pitar);
+			iowrite32be(paddr >> 12,
+				    &pci->regs->piw[win_idx].piwbar);
+			iowrite32be(piwar,
+				    &pci->regs->piw[win_idx].piwar);
+			win_idx--;
+
+			paddr += 1ull << mem_log;
+		}
+
+		pci->dma_window_base_cur = 0x00000000;
+		pci->dma_window_size = (resource_size_t)paddr;
+	}
+
+	if (pci->dma_window_size < mem) {
+#ifndef CONFIG_SWIOTLB
+		dev_err(pci->dev, "Memory size exceeds PCI ATMU ability to "
+			"map - enable CONFIG_SWIOTLB to avoid dma errors.\n");
+#endif
+		/* adjusting outbound windows could reclaim space in mem map */
+		if (paddr_hi < 0xffffffffull)
+			dev_warn(pci->dev, "Outbound window cfg leaves "
+				"gaps in memory map. Adjusting the memory map "
+				"could reduce unnecessary bounce buffering.\n");
+
+		dev_info(pci->dev, "DMA window size is 0x%llx\n",
+			 (u64)pci->dma_window_size);
+	}
+}
+
+static void __init setup_pci_cmd(struct fsl_pci *pci)
+{
+	u16 cmd;
+	int cap_x;
+
+	early_fsl_read_config_word(pci, 0, 0, PCI_COMMAND, &cmd);
+	cmd |= PCI_COMMAND_SERR | PCI_COMMAND_MASTER |
+	       PCI_COMMAND_MEMORY | PCI_COMMAND_IO;
+	early_fsl_write_config_word(pci, 0, 0, PCI_COMMAND, cmd);
+
+	cap_x = early_fsl_find_capability(pci, 0, 0, PCI_CAP_ID_PCIX);
+	if (cap_x) {
+		int pci_x_cmd = cap_x + PCI_X_CMD;
+		cmd = PCI_X_CMD_MAX_SPLIT | PCI_X_CMD_MAX_READ |
+		      PCI_X_CMD_ERO | PCI_X_CMD_DPERR_E;
+		early_fsl_write_config_word(pci, 0, 0, pci_x_cmd, cmd);
+	} else
+		early_fsl_write_config_byte(pci, 0, 0, PCI_LATENCY_TIMER,
+					    0x80);
+}
+
+static int __init
+fsl_pci_setup(struct platform_device *pdev, struct fsl_pci *pci)
+{
+	struct resource *rsrc;
+	u8 hdr_type, progif;
+	struct device_node *dn;
+	struct of_pci_range range;
+	struct of_pci_range_parser parser;
+	int mem = 0;
+
+	dn = pdev->dev.of_node;
+	pci->dn = dn;
+	pci->dev = &pdev->dev;
+
+	dev_info(&pdev->dev, "Find controller %s\n", dn->full_name);
+
+	/* Fetch host bridge registers address */
+	rsrc = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!rsrc) {
+		dev_err(&pdev->dev, "Can't get pci register base!");
+		return -EINVAL;
+	}
+	dev_info(&pdev->dev, "REG 0x%016llx..0x%016llx\n",
+		 (u64)rsrc->start, (u64)rsrc->end);
+
+	/* Parse pci range resources from device tree */
+	if (of_pci_range_parser_init(&parser, dn)) {
+		dev_err(&pdev->dev, "missing ranges property\n");
+		return -EINVAL;
+	}
+
+	/* Get the I/O and memory ranges from device tree */
+	for_each_of_pci_range(&parser, &range) {
+		unsigned long restype = range.flags & IORESOURCE_TYPE_BITS;
+		if (restype == IORESOURCE_IO) {
+			of_pci_range_to_resource(&range, dn,
+						 &pci->io_resource);
+			pci->io_resource.name = "I/O";
+			pci->io_resource.start = range.pci_addr;
+			pci->io_resource.end = range.pci_addr + range.size - 1;
+			pci->pci_io_size = range.size;
+			pci->io_base_phys = range.cpu_addr - range.pci_addr;
+			dev_info(&pdev->dev,
+				 " IO 0x%016llx..0x%016llx -> 0x%016llx\n",
+				 range.cpu_addr,
+				 range.cpu_addr + range.size - 1,
+				 range.pci_addr);
+		}
+		if (restype == IORESOURCE_MEM) {
+			if (mem >= 3)
+				continue;
+			of_pci_range_to_resource(&range, dn,
+						 &pci->mem_resources[mem]);
+			pci->mem_resources[mem].name = "MEM";
+			pci->mem_offset[mem] = range.cpu_addr - range.pci_addr;
+			dev_info(&pdev->dev,
+				 "MEM 0x%016llx..0x%016llx -> 0x%016llx\n",
+				 (u64)pci->mem_resources[mem].start,
+				 (u64)pci->mem_resources[mem].end,
+				 range.pci_addr);
+		}
+	}
+
+	/* Get bus range */
+	if (of_pci_parse_bus_range(dn, &pci->busn)) {
+		dev_err(&pdev->dev, "failed to parse bus-range property\n");
+		pci->first_busno = 0x0;
+		pci->last_busno = 0xff;
+	} else {
+		pci->first_busno = pci->busn.start;
+		pci->last_busno = pci->busn.end;
+	}
+	dev_info(&pdev->dev, "Firmware bus number %d->%d\n",
+		 pci->first_busno, pci->last_busno);
+
+	pci->regs = devm_ioremap_resource(&pdev->dev, rsrc);
+	if (IS_ERR(pci->regs))
+		return PTR_ERR(pci->regs);
+
+	pci->ops = &fsl_indirect_pci_ops;
+	pci->indirect_type = INDIRECT_TYPE_BIG_ENDIAN;
+
+	if (in_be32(&pci->regs->block_rev1) < PCIE_IP_REV_3_0)
+		pci->indirect_type |= INDIRECT_TYPE_FSL_CFG_REG_LINK;
+
+	pci->is_pcie = early_fsl_find_capability(pci, 0, 0, PCI_CAP_ID_EXP);
+	if (pci->is_pcie) {
+		/* For PCIE read HEADER_TYPE to identify controller mode */
+		early_fsl_read_config_byte(pci, 0, 0, PCI_HEADER_TYPE,
+					   &hdr_type);
+		if ((hdr_type & 0x7f) == PCI_HEADER_TYPE_NORMAL)
+			goto no_bridge;
+	} else {
+		/* For PCI read PROG to identify controller mode */
+		early_fsl_read_config_byte(pci, 0, 0, PCI_CLASS_PROG, &progif);
+		if ((progif & 1) == 1)
+			goto no_bridge;
+	}
+
+	setup_pci_cmd(pci);
+
+	/* check PCI express link status */
+	if (pci->is_pcie) {
+		pci->indirect_type |= INDIRECT_TYPE_EXT_REG |
+				       INDIRECT_TYPE_SURPRESS_PRIMARY_BUS;
+		if (fsl_pci_check_link(pci))
+			pci->indirect_type |= INDIRECT_TYPE_NO_PCIE_LINK;
+	}
+
+	/* Setup PEX window registers */
+	setup_pci_atmu(pci);
+
+	platform_set_drvdata(pdev, pci);
+
+	return 0;
+
+no_bridge:
+	dev_info(&pdev->dev, "It works as EP mode\n");
+	return -EPERM;
+}
+
+static int __init fsl_pci_probe(struct platform_device *pdev)
+{
+	int ret;
+	struct fsl_pci *pci;
+
+	if (!of_device_is_available(pdev->dev.of_node)) {
+		dev_warn(&pdev->dev, "disabled\n");
+		return -ENODEV;
+	}
+
+	if (!fsl_pci_sys_register) {
+		dev_err(&pdev->dev,
+			"no fsl_pci_sys_register implementation\n");
+		return -EPERM;
+	}
+
+	pci = devm_kzalloc(&pdev->dev, sizeof(*pci), GFP_KERNEL);
+	if (!pci) {
+		dev_err(&pdev->dev, "no memory for fsl_pci\n");
+		return -ENOMEM;
+	}
+
+	ret = fsl_pci_setup(pdev, pci);
+	if (ret)
+		return ret;
+
+	ret = fsl_pci_sys_register(pci);
+	if (ret) {
+		dev_err(&pdev->dev, "failed to register pcie to soc\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static int __exit fsl_pci_remove(struct platform_device *pdev)
+{
+	struct fsl_pci *pci = platform_get_drvdata(pdev);
+
+	if (!pci)
+		return -ENODEV;
+
+	if (fsl_pci_sys_remove)
+		fsl_pci_sys_remove(pci);
+
+	return 0;
+}
+
+#ifdef CONFIG_PM
+static int fsl_pci_resume(struct device *dev)
+{
+	struct fsl_pci *pci = dev_get_drvdata(dev);
+
+	if (!pci)
+		return -ENODEV;
+
+	setup_pci_atmu(pci);
+
+	return 0;
+}
+
+static const struct dev_pm_ops pci_pm_ops = {
+	.resume = fsl_pci_resume,
+};
+
+#define PCI_PM_OPS (&pci_pm_ops)
+
+#else
+
+#define PCI_PM_OPS NULL
+
+#endif
+
+static struct platform_driver fsl_pci_driver = {
+	.driver = {
+		.name = "fsl-pci",
+		.pm = PCI_PM_OPS,
+		.of_match_table = fsl_pci_ids,
+	},
+	.probe = fsl_pci_probe,
+	.remove = fsl_pci_remove,
+};
+
+static int __init fsl_pci_init(void)
+{
+	return platform_driver_register(&fsl_pci_driver);
+}
+
+arch_initcall(fsl_pci_init);
diff --git a/arch/powerpc/sysdev/fsl_pci.h b/include/linux/fsl/pci.h
similarity index 67%
copy from arch/powerpc/sysdev/fsl_pci.h
copy to include/linux/fsl/pci.h
index 72b5625..ba72a89 100644
--- a/arch/powerpc/sysdev/fsl_pci.h
+++ b/include/linux/fsl/pci.h
@@ -1,7 +1,9 @@
 /*
- * MPC85xx/86xx PCI Express structure define
+ * MPC85xx/86xx/LS PCI Express structure define
  *
- * Copyright 2007,2011 Freescale Semiconductor, Inc
+ * Copyright 2013 Freescale Semiconductor, Inc
+ *
+ * Moved from arch/powerpc/sysdev/fsl_pci.h
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -11,10 +13,8 @@
  */
 
 #ifdef __KERNEL__
-#ifndef __POWERPC_FSL_PCI_H
-#define __POWERPC_FSL_PCI_H
-
-struct platform_device;
+#ifndef __PCI_H
+#define __PCI_H
 
 #define PCIE_LTSSM	0x0404		/* PCIE Link Training and Status */
 #define PCIE_LTSSM_L0	0x16		/* L0 state */
@@ -47,7 +47,7 @@ struct pci_inbound_window_regs {
 	u8	res2[12];
 };
 
-/* PCI/PCI Express IO block registers for 85xx/86xx */
+/* PCI/PCI Express IO block registers for 85xx/86xx/LS */
 struct ccsr_pci {
 	__be32	config_addr;		/* 0x.000 - PCI/PCIE Configuration Address Register */
 	__be32	config_data;		/* 0x.004 - PCI/PCIE Configuration Data Register */
@@ -104,27 +104,74 @@ struct ccsr_pci {
 
 };
 
-extern int fsl_add_bridge(struct platform_device *pdev, int is_primary);
-extern void fsl_pcibios_fixup_bus(struct pci_bus *bus);
-extern int mpc83xx_add_bridge(struct device_node *dev);
-u64 fsl_pci_immrbar_base(struct pci_controller *hose);
-
-extern struct device_node *fsl_pci_primary;
-
-#ifdef CONFIG_PCI
-void fsl_pci_assign_primary(void);
-#else
-static inline void fsl_pci_assign_primary(void) {}
-#endif
-
-#ifdef CONFIG_EDAC_MPC85XX
-int mpc85xx_pci_err_probe(struct platform_device *op);
-#else
-static inline int mpc85xx_pci_err_probe(struct platform_device *op)
-{
-	return -ENOTSUPP;
-}
-#endif
-
-#endif /* __POWERPC_FSL_PCI_H */
+/*
+ * Structure of a PCI controller (host bridge)
+ */
+struct fsl_pci {
+	struct list_head node;
+	int is_pcie;
+	struct device_node *dn;
+	struct device *dev;
+
+	int first_busno;
+	int last_busno;
+	int self_busno;
+	struct resource busn;
+
+	struct pci_ops *ops;
+	struct ccsr_pci __iomem *regs;
+
+	u32 indirect_type;
+
+	struct resource io_resource;
+	resource_size_t io_base_phys;
+	resource_size_t pci_io_size;
+
+	struct resource mem_resources[3];
+	resource_size_t mem_offset[3];
+
+	int global_number;	/* PCI domain number */
+
+	resource_size_t dma_window_base_cur;
+	resource_size_t dma_window_size;
+
+	void *sys;
+};
+
+/* Return link status 0-> link, 1-> no link*/
+int fsl_pci_check_link(struct fsl_pci *pci);
+
+/* Return PCI64 DMA offset */
+extern u64 fsl_pci64_dma_offset(void);
+
+/*
+ * PCI dts node compatible is platform dependent.
+ * So, ids should be defined in platform files.
+ */
+extern const struct of_device_id fsl_pci_ids[];
+
+/*
+ * Convert platform-dependent pci controller structure to fsl_pci
+ * PowerPC uses structure pci_controller and ARM uses structure pci_sys_data
+ * to describe pci controller.
+ */
+extern struct fsl_pci *fsl_sys_to_pci(void *sys);
+
+/*
+ * To fake a PCI bus
+ * it is called by early_fsl_*(), at that time the platform-dependent
+ * pci controller and pci bus have not been created.
+ */
+extern struct pci_bus *fsl_fake_pci_bus(struct fsl_pci *pci, int busnr);
+
+/* To avoid touching specified devices */
+extern int fsl_pci_exclude_device(struct fsl_pci *pci, u8 bus, u8 devfn);
+
+/* Register PCI/PCIe controller to platform */
+extern int __weak fsl_pci_sys_register(struct fsl_pci *pci);
+
+/* Remove PCI/PCIe controller from platform */
+extern void __weak fsl_pci_sys_remove(struct fsl_pci *pci);
+
+#endif /* __PCI_H */
 #endif /* __KERNEL__ */
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH v9 13/13] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
From: Alexey Kardashevskiy @ 2013-08-28  8:51 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

This adds special support for huge pages (16MB) in real mode.

The reference counting cannot be easily done for such pages in real
mode (when MMU is off) so we added a hash table of huge pages.
It is populated in virtual mode and get_page is called just once
per a huge page. Real mode handlers check if the requested page is
in the hash table, then no reference counting is done, otherwise
an exit to virtual mode happens. The hash table is released at KVM
exit.

At the moment the fastest card available for tests uses up to 9 huge
pages so walking through this hash table does not cost much.
However this can change and we may want to optimize this.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---

Changes:
2013/07/12:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64

2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |  25 ++++++++
 arch/powerpc/kernel/iommu.c         |   6 +-
 arch/powerpc/kvm/book3s_64_vio.c    | 122 ++++++++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_64_vio_hv.c |  32 +++++++++-
 4 files changed, 176 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index c1a039d..b970d26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -31,6 +31,7 @@
 #include <linux/list.h>
 #include <linux/atomic.h>
 #include <linux/tracepoint.h>
+#include <linux/hashtable.h>
 #include <asm/kvm_asm.h>
 #include <asm/processor.h>
 #include <asm/page.h>
@@ -184,9 +185,33 @@ struct kvmppc_spapr_tce_table {
 	struct iommu_group *grp;		/* used for IOMMU groups */
 	struct kvm_create_spapr_tce_iommu_linkage link;
 	struct vfio_group *vfio_grp;		/* used for IOMMU groups */
+	DECLARE_HASHTABLE(hash_tab, ilog2(64));	/* used for IOMMU groups */
+	spinlock_t hugepages_write_lock;	/* used for IOMMU groups */
 	struct page *pages[0];
 };
 
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)	hash_32(gpa >> 24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+	struct hlist_node hash_node;
+	unsigned long gpa;	/* Guest physical address */
+	unsigned long hpa;	/* Host physical address */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+};
+
 struct kvmppc_linear_info {
 	void		*base_virt;
 	unsigned long	 base_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ff0cd90..d0593c9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
 			if (!pg) {
 				ret = -EAGAIN;
 			} else if (PageCompound(pg)) {
-				ret = -EAGAIN;
+				/* Hugepages will be released at KVM exit */
+				ret = 0;
 			} else {
 				if (oldtce & TCE_PCI_WRITE)
 					SetPageDirty(pg);
@@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
 			struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
 			if (!pg) {
 				ret = -EAGAIN;
+			} else if (PageCompound(pg)) {
+				/* Hugepages will be released at KVM exit */
+				ret = 0;
 			} else {
 				if (oldtce & TCE_PCI_WRITE)
 					SetPageDirty(pg);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 95f9e1a..1851778 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -93,6 +93,102 @@ int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
 	return ret;
 }
 
+/*
+ * API to support huge pages in real mode
+ */
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+	spin_lock_init(&tt->hugepages_write_lock);
+	hash_init(tt->hash_tab);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+	int bkt;
+	struct kvmppc_spapr_iommu_hugepage *hp;
+	struct hlist_node *tmp;
+
+	spin_lock(&tt->hugepages_write_lock);
+	hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
+		pr_debug("Release HP liobn=%llx #%u gpa=%lx hpa=%lx size=%ld\n",
+				tt->liobn, bkt, hp->gpa, hp->hpa, hp->size);
+		hlist_del_rcu(&hp->hash_node);
+
+		put_page(hp->page);
+		kfree(hp);
+	}
+	spin_unlock(&tt->hugepages_write_lock);
+}
+
+/* Returns true if a page with GPA is already in the hash table */
+static bool kvmppc_iommu_hugepage_lookup_gpa(struct kvmppc_spapr_tce_table *tt,
+		unsigned long gpa)
+{
+	struct kvmppc_spapr_iommu_hugepage *hp;
+	const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+
+	hash_for_each_possible_rcu(tt->hash_tab, hp, hash_node, key) {
+		if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+			continue;
+
+		return true;
+	}
+
+	return false;
+}
+
+/* Returns true if a page with GPA has been added to the hash table */
+static bool kvmppc_iommu_hugepage_add(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
+		unsigned long hva, unsigned long gpa)
+{
+	struct kvmppc_spapr_iommu_hugepage *hp;
+	const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+	pte_t *ptep;
+	unsigned int shift = 0;
+	static const int is_write = 1;
+
+	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+	WARN_ON(!ptep);
+
+	if (!ptep || (shift <= PAGE_SHIFT))
+		return false;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return false;
+
+	hp->gpa = gpa & ~((1 << shift) - 1);
+	hp->hpa = (pte_pfn(*ptep) << PAGE_SHIFT);
+	hp->size = 1 << shift;
+
+	if (get_user_pages_fast(hva & ~(hp->size - 1), 1,
+			is_write, &hp->page) != 1) {
+		kfree(hp);
+		return false;
+	}
+	hash_add_rcu(tt->hash_tab, &hp->hash_node, key);
+
+	return true;
+}
+
+/** Returns true if a page with GPA is in the hash table or
+ *  has just been added.
+ */
+static bool kvmppc_iommu_hugepage_try_add(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
+		unsigned long hva, unsigned long gpa)
+{
+	bool ret;
+
+	spin_lock(&tt->hugepages_write_lock);
+	ret = kvmppc_iommu_hugepage_lookup_gpa(tt, gpa) ||
+			kvmppc_iommu_hugepage_add(vcpu, tt, hva, gpa);
+	spin_unlock(&tt->hugepages_write_lock);
+
+	return ret;
+}
+
 static long kvmppc_stt_npages(unsigned long window_size)
 {
 	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -106,6 +202,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
+	kvmppc_iommu_hugepages_cleanup(stt);
 	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
 		__free_page(stt->pages[i]);
 	kfree(stt);
@@ -185,6 +282,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
+	kvmppc_iommu_hugepages_init(stt);
 	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
@@ -262,6 +360,7 @@ static long kvmppc_spapr_tce_iommu_link(struct kvm_device *dev,
 
 	/* Add the TCE table descriptor to the descriptor list */
 	mutex_lock(&kvm->lock);
+	kvmppc_iommu_hugepages_init(tt);
 	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
 	mutex_unlock(&kvm->lock);
 
@@ -336,6 +435,7 @@ static void kvmppc_spapr_tce_iommu_destroy(struct kvm_device *dev)
 		mutex_lock(&kvm->lock);
 		list_del(&tt->list);
 
+		kvmppc_iommu_hugepages_cleanup(tt);
 		if (tt->vfio_grp)
 			kvmppc_vfio_group_put_external_user(tt->vfio_grp);
 		iommu_group_put(tt->grp);
@@ -360,6 +460,7 @@ struct kvm_device_ops kvmppc_spapr_tce_iommu_ops = {
  * Also returns host physical address which is to put to TCE table.
  */
 static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
 		unsigned long gpa, struct page **pg, unsigned long *phpa)
 {
 	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
@@ -379,6 +480,17 @@ static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
 		*phpa = __pa((unsigned long) page_address(*pg)) |
 				(hva & ~PAGE_MASK);
 
+	if (PageCompound(*pg)) {
+		/** Check if this GPA is taken care of by the hash table.
+		 *  If this is the case, do not show the caller page struct
+		 *  address as huge pages will be released at KVM exit.
+		 */
+		if (kvmppc_iommu_hugepage_try_add(vcpu, tt, hva, gpa)) {
+			put_page(*pg);
+			*pg = NULL;
+		}
+	}
+
 	return (void *) hva;
 }
 
@@ -416,7 +528,7 @@ long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
 		if (iommu_tce_put_param_check(tbl, ioba, tce))
 			return H_PARAMETER;
 
-		hva = kvmppc_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+		hva = kvmppc_gpa_to_hva_and_get(vcpu, tt, tce, &pg, &hpa);
 		if (hva == ERROR_ADDR)
 			return H_HARDWARE;
 	}
@@ -425,7 +537,7 @@ long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
 		return H_SUCCESS;
 
 	pg = pfn_to_page(hpa >> PAGE_SHIFT);
-	if (pg)
+	if (pg && !PageCompound(pg))
 		put_page(pg);
 
 	return H_HARDWARE;
@@ -467,7 +579,7 @@ static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
 					(i << IOMMU_PAGE_SHIFT), gpa))
 			return H_PARAMETER;
 
-		hva = kvmppc_gpa_to_hva_and_get(vcpu, gpa, &pg,
+		hva = kvmppc_gpa_to_hva_and_get(vcpu, tt, gpa, &pg,
 				&vcpu->arch.tce_tmp_hpas[i]);
 		if (hva == ERROR_ADDR)
 			goto putpages_flush_exit;
@@ -482,7 +594,7 @@ putpages_flush_exit:
 	for (--i; i >= 0; --i) {
 		struct page *pg;
 		pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
-		if (pg)
+		if (pg && !PageCompound(pg))
 			put_page(pg);
 	}
 
@@ -562,7 +674,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
 
-	tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
+	tces = kvmppc_gpa_to_hva_and_get(vcpu, tt, tce_list, &pg, NULL);
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c647990..9488149 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -133,12 +133,30 @@ void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
 EXPORT_SYMBOL_GPL(kvmppc_tce_put);
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static unsigned long kvmppc_rm_hugepage_gpa_to_hpa(
+		struct kvmppc_spapr_tce_table *tt,
+		unsigned long gpa)
+{
+	struct kvmppc_spapr_iommu_hugepage *hp;
+	const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+
+	hash_for_each_possible_rcu_notrace(tt->hash_tab, hp, hash_node, key) {
+		if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+			continue;
+		return hp->hpa + (gpa & (hp->size - 1));
+	}
+
+	return ERROR_ADDR;
+}
+
 /*
  * Converts guest physical address to host physical address.
  * Tries to increase page counter via get_page_unless_zero() and
  * returns ERROR_ADDR if failed.
  */
 static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
 		unsigned long gpa, struct page **pg)
 {
 	struct kvm_memory_slot *memslot;
@@ -147,6 +165,14 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
 	unsigned long gfn = gpa >> PAGE_SHIFT;
 	unsigned shift = 0;
 
+	/* Check if it is a hugepage */
+	hpa = kvmppc_rm_hugepage_gpa_to_hpa(tt, gpa);
+	if (hpa != ERROR_ADDR) {
+		*pg = NULL; /* Tell the caller not to put page */
+		return hpa;
+	}
+
+	/* System page size case */
 	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
 	if (!memslot)
 		return ERROR_ADDR;
@@ -219,7 +245,7 @@ static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
 	if (iommu_tce_put_param_check(tbl, ioba, tce))
 		return H_PARAMETER;
 
-	hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+	hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce, &pg);
 	if (hpa != ERROR_ADDR) {
 		ret = iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
 				&hpa, 1, true);
@@ -256,7 +282,7 @@ static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
 
 	/* Translate TCEs and go get_page() */
 	for (i = 0; i < npages; ++i) {
-		hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+		hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tces[i], &pg);
 		if (hpa == ERROR_ADDR) {
 			vcpu->arch.tce_tmp_num = i;
 			vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
@@ -347,7 +373,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
 
-	tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+	tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce_list, &pg);
 	if (tces == ERROR_ADDR) {
 		ret = H_TOO_HARD;
 		goto put_unlock_exit;
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling
From: Alexey Kardashevskiy @ 2013-08-28  8:50 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
them to user space which saves time on switching to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
of map/unmap requests. The device supports a single attribute which is
a struct with LIOBN and IOMMU fd. When the attribute is set, the device
establishes the connection between KVM and VFIO.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---

Changes:
v9:
* KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
KVM device
* release_spapr_tce_table() is not shared between different TCE types
* reduced the patch size by moving VFIO external API
trampolines to separate patche
* moved documentation from Documentation/virtual/kvm/api.txt to
Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

v8:
* fixed warnings from check_patch.pl

2013/07/11:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
* kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
for this here but the next patch for hugepages support will use it more.

2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 .../virtual/kvm/devices/spapr_tce_iommu.txt        |  37 +++
 arch/powerpc/include/asm/kvm_host.h                |   4 +
 arch/powerpc/kvm/book3s_64_vio.c                   | 310 ++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c                | 122 ++++++++
 arch/powerpc/kvm/powerpc.c                         |   1 +
 include/linux/kvm_host.h                           |   1 +
 virt/kvm/kvm_main.c                                |   5 +
 7 files changed, 477 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
new file mode 100644
index 0000000..4bc8fc3
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
@@ -0,0 +1,37 @@
+SPAPR TCE IOMMU device
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+
+Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
+
+Groups:
+  KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
+  Attributes: single attribute with pair { LIOBN, IOMMU fd}
+
+This is completely made up device which provides API to link
+logical bus number (LIOBN) and IOMMU group. The user space has
+to create a new SPAPR TCE IOMMU device per a logical bus.
+
+LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
+(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
+IOMMU group is a minimal isolated device set which can be passed to
+the user space via VFIO.
+
+Right after creation the device is in uninitlized state and requires
+a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
+The attribute contains liobn, IOMMU fd and flags:
+
+struct kvm_create_spapr_tce_iommu_linkage {
+	__u64 liobn;
+	__u32 fd;
+	__u32 flags;
+};
+
+The user space creates the SPAPR TCE IOMMU device, obtains
+an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
+device. At the moment of setting the attribute, the SPAPR TCE IOMMU
+device links LIOBN to IOMMU group and makes necessary steps
+to make sure that VFIO group will not disappear before KVM destroys.
+
+The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index a23f132..c1a039d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,9 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	struct iommu_group *grp;		/* used for IOMMU groups */
+	struct kvm_create_spapr_tce_iommu_linkage link;
+	struct vfio_group *vfio_grp;		/* used for IOMMU groups */
 	struct page *pages[0];
 };
 
@@ -612,6 +615,7 @@ struct kvm_vcpu_arch {
 	u64 busy_preempt;
 
 	unsigned long *tce_tmp_hpas;	/* TCE cache for TCE_PUT_INDIRECT */
+	unsigned long tce_tmp_num;	/* Number of handled TCEs in cache */
 	enum {
 		TCERM_NONE,
 		TCERM_GETPAGE,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 047b94c..95f9e1a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -29,6 +29,8 @@
 #include <linux/anon_inodes.h>
 #include <linux/module.h>
 #include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/file.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -201,9 +203,164 @@ fail:
 	return ret;
 }
 
-/* Converts guest physical address to host virtual address */
+static int kvmppc_spapr_tce_iommu_create(struct kvm_device *dev, u32 type)
+{
+	return 0;
+}
+
+static long kvmppc_spapr_tce_iommu_link(struct kvm_device *dev,
+		struct kvm_create_spapr_tce_iommu_linkage *args)
+{
+	struct kvm *kvm = dev->kvm;
+	struct kvmppc_spapr_tce_table *tt = NULL;
+	struct iommu_group *grp;
+	struct iommu_table *tbl;
+	struct file *vfio_filp;
+	struct vfio_group *vfio_grp;
+	int ret = -ENXIO, iommu_id;
+
+	/* Check this LIOBN hasn't been previously registered */
+	list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	vfio_filp = fget(args->fd);
+	if (!vfio_filp)
+		return -ENXIO;
+
+	/*
+	 * Lock the group. Fails if group is not viable or
+	 * does not have IOMMU set
+	 */
+	vfio_grp = kvmppc_vfio_group_get_external_user(vfio_filp);
+	if (IS_ERR_VALUE((unsigned long)vfio_grp))
+		goto fput_exit;
+
+	/* Get IOMMU ID, find iommu_group and iommu_table*/
+	iommu_id = kvmppc_vfio_external_user_iommu_id(vfio_grp);
+	if (iommu_id < 0)
+		goto grpput_fput_exit;
+
+	grp = iommu_group_get_by_id(iommu_id);
+	if (!grp)
+		goto grpput_fput_exit;
+
+	tbl = iommu_group_get_iommudata(grp);
+	if (!tbl)
+		goto grpput_fput_exit;
+
+	/* Create a TCE table descriptor and add into the descriptor list */
+	tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+	if (!tt)
+		goto grpput_fput_exit;
+
+	tt->liobn = args->liobn;
+	tt->grp = grp;
+	tt->window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+	tt->vfio_grp = vfio_grp;
+
+	/* Add the TCE table descriptor to the descriptor list */
+	mutex_lock(&kvm->lock);
+	list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+	mutex_unlock(&kvm->lock);
+
+	dev->private = tt;
+	tt->link = *args;
+
+	ret = 0;
+
+	goto fput_exit;
+
+grpput_fput_exit:
+	kvmppc_vfio_group_put_external_user(vfio_grp);
+fput_exit:
+	fput(vfio_filp);
+
+	return ret;
+}
+
+static int kvmppc_spapr_tce_iommu_set_attr(struct kvm_device *dev,
+		struct kvm_device_attr *attr)
+{
+	struct kvmppc_spapr_tce_table *tt = dev->private;
+	struct kvm_create_spapr_tce_iommu_linkage args;
+	void __user *argp = (void __user *) attr->addr;
+
+	switch (attr->group) {
+	case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+		if (tt)
+			return -EBUSY;
+
+		if (copy_from_user(&args, argp, sizeof(args)))
+			return -EFAULT;
+
+		return kvmppc_spapr_tce_iommu_link(dev, &args);
+	}
+	return -ENXIO;
+}
+
+static int kvmppc_spapr_tce_iommu_get_attr(struct kvm_device *dev,
+		struct kvm_device_attr *attr)
+{
+	struct kvmppc_spapr_tce_table *tt = dev->private;
+	void __user *argp = (void __user *) attr->addr;
+
+	switch (attr->group) {
+	case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+		if (!tt)
+			return -EFAULT;
+		if (copy_to_user(&tt->link, argp, sizeof(tt->link)))
+			return -EFAULT;
+		return 0;
+	}
+	return -ENXIO;
+}
+
+static int kvmppc_spapr_tce_iommu_has_attr(struct kvm_device *dev,
+		struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+		return 0;
+	}
+	return -ENXIO;
+}
+
+static void kvmppc_spapr_tce_iommu_destroy(struct kvm_device *dev)
+{
+	struct kvmppc_spapr_tce_table *tt = dev->private;
+	struct kvm *kvm = dev->kvm;
+
+	if (tt) {
+		mutex_lock(&kvm->lock);
+		list_del(&tt->list);
+
+		if (tt->vfio_grp)
+			kvmppc_vfio_group_put_external_user(tt->vfio_grp);
+		iommu_group_put(tt->grp);
+
+		kfree(tt);
+		mutex_unlock(&kvm->lock);
+	}
+	kfree(dev);
+}
+
+struct kvm_device_ops kvmppc_spapr_tce_iommu_ops = {
+	.name = "kvm-spapr-tce-iommu",
+	.create = kvmppc_spapr_tce_iommu_create,
+	.set_attr = kvmppc_spapr_tce_iommu_set_attr,
+	.get_attr = kvmppc_spapr_tce_iommu_get_attr,
+	.has_attr = kvmppc_spapr_tce_iommu_has_attr,
+	.destroy = kvmppc_spapr_tce_iommu_destroy,
+};
+
+/*
+ * Converts guest physical address to host virtual address.
+ * Also returns host physical address which is to put to TCE table.
+ */
 static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
-		unsigned long gpa, struct page **pg)
+		unsigned long gpa, struct page **pg, unsigned long *phpa)
 {
 	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
 	struct kvm_memory_slot *memslot;
@@ -218,9 +375,140 @@ static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
 	if (get_user_pages_fast(hva & PAGE_MASK, 1, is_write, pg) != 1)
 		return ERROR_ADDR;
 
+	if (phpa)
+		*phpa = __pa((unsigned long) page_address(*pg)) |
+				(hva & ~PAGE_MASK);
+
 	return (void *) hva;
 }
 
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	struct page *pg = NULL;
+	unsigned long hpa;
+	void __user *hva;
+	struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+	if (!tbl)
+		return H_RESCINDED;
+
+	/* Clear TCE */
+	if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT,
+				1, false))
+			return H_HARDWARE;
+
+		return H_SUCCESS;
+	}
+
+	/* Put TCE */
+	if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+		/* Retry iommu_tce_build if it failed in real mode */
+		vcpu->arch.tce_rm_fail = TCERM_NONE;
+		hpa = vcpu->arch.tce_tmp_hpas[0];
+	} else {
+		if (iommu_tce_put_param_check(tbl, ioba, tce))
+			return H_PARAMETER;
+
+		hva = kvmppc_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+		if (hva == ERROR_ADDR)
+			return H_HARDWARE;
+	}
+
+	if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT, &hpa, 1, false))
+		return H_SUCCESS;
+
+	pg = pfn_to_page(hpa >> PAGE_SHIFT);
+	if (pg)
+		put_page(pg);
+
+	return H_HARDWARE;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+		unsigned long __user *tces, unsigned long npages)
+{
+	long i = 0, start = 0;
+	struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+	if (!tbl)
+		return H_RESCINDED;
+
+	switch (vcpu->arch.tce_rm_fail) {
+	case TCERM_NONE:
+		break;
+	case TCERM_GETPAGE:
+		start = vcpu->arch.tce_tmp_num;
+		break;
+	case TCERM_PUTTCE:
+		goto put_tces;
+	case TCERM_PUTLIST:
+	default:
+		WARN_ON(1);
+		return H_HARDWARE;
+	}
+
+	for (i = start; i < npages; ++i) {
+		struct page *pg = NULL;
+		unsigned long gpa;
+		void __user *hva;
+
+		if (get_user(gpa, tces + i))
+			return H_HARDWARE;
+
+		if (iommu_tce_put_param_check(tbl, ioba +
+					(i << IOMMU_PAGE_SHIFT), gpa))
+			return H_PARAMETER;
+
+		hva = kvmppc_gpa_to_hva_and_get(vcpu, gpa, &pg,
+				&vcpu->arch.tce_tmp_hpas[i]);
+		if (hva == ERROR_ADDR)
+			goto putpages_flush_exit;
+	}
+
+put_tces:
+	if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+			vcpu->arch.tce_tmp_hpas, npages, false))
+		return H_SUCCESS;
+
+putpages_flush_exit:
+	for (--i; i >= 0; --i) {
+		struct page *pg;
+		pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
+		if (pg)
+			put_page(pg);
+	}
+
+	return H_HARDWARE;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	if (!tbl)
+		return H_RESCINDED;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	if (iommu_free_tces(tbl, entry, npages, false))
+		return H_HARDWARE;
+
+	return H_SUCCESS;
+}
+
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce)
@@ -232,6 +520,10 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->grp)
+		return kvmppc_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
 
@@ -270,13 +562,20 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
 
-	tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+	tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
 	if (tces == ERROR_ADDR)
 		return H_TOO_HARD;
 
 	if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
 		goto put_list_page_exit;
 
+	if (tt->grp) {
+		ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+			tt, ioba, tces, npages);
+		goto put_list_page_exit;
+	}
+
+	/* Emulated IO */
 	for (i = 0; i < npages; ++i) {
 		if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
 			ret = H_PARAMETER;
@@ -315,6 +614,11 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->grp)
+		return kvmppc_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+				tce_value, npages);
+
+	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 2b2ce0a..c647990 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -191,6 +192,111 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
 	return hpa;
 }
 
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt, unsigned long liobn,
+		unsigned long ioba, unsigned long tce)
+{
+	int ret = 0;
+	struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+	unsigned long hpa;
+	struct page *pg = NULL;
+
+	if (!tbl)
+		return H_RESCINDED;
+
+	/* Clear TCE */
+	if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+		if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+			return H_PARAMETER;
+
+		if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, 1, true))
+			return H_TOO_HARD;
+
+		return H_SUCCESS;
+	}
+
+	/* Put TCE */
+	if (iommu_tce_put_param_check(tbl, ioba, tce))
+		return H_PARAMETER;
+
+	hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+	if (hpa != ERROR_ADDR) {
+		ret = iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+				&hpa, 1, true);
+	}
+
+	if (((hpa == ERROR_ADDR) && pg) || ret) {
+		vcpu->arch.tce_tmp_hpas[0] = hpa;
+		vcpu->arch.tce_tmp_num = 0;
+		vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+		return H_TOO_HARD;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+		unsigned long *tces, unsigned long npages)
+{
+	int i, ret;
+	unsigned long hpa;
+	struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+	struct page *pg = NULL;
+
+	if (!tbl)
+		return H_RESCINDED;
+
+	/* Check all TCEs */
+	for (i = 0; i < npages; ++i) {
+		if (iommu_tce_put_param_check(tbl, ioba +
+				(i << IOMMU_PAGE_SHIFT), tces[i]))
+			return H_PARAMETER;
+	}
+
+	/* Translate TCEs and go get_page() */
+	for (i = 0; i < npages; ++i) {
+		hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+		if (hpa == ERROR_ADDR) {
+			vcpu->arch.tce_tmp_num = i;
+			vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
+			return H_TOO_HARD;
+		}
+		vcpu->arch.tce_tmp_hpas[i] = hpa;
+	}
+
+	/* Put TCEs to the table */
+	ret = iommu_tce_build(tbl, (ioba >> IOMMU_PAGE_SHIFT),
+			vcpu->arch.tce_tmp_hpas, npages, true);
+	if (ret == -EAGAIN) {
+		vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+		return H_TOO_HARD;
+	} else if (ret) {
+		return H_HARDWARE;
+	}
+
+	return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+		struct kvmppc_spapr_tce_table *tt,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+	if (!tbl)
+		return H_RESCINDED;
+
+	if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+		return H_PARAMETER;
+
+	if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, npages, true))
+		return H_TOO_HARD;
+
+	return H_SUCCESS;
+}
+
 long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
@@ -201,6 +307,10 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->grp)
+		return kvmppc_rm_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+	/* Emulated IO */
 	if (ioba >= tt->window_size)
 		return H_PARAMETER;
 
@@ -243,6 +353,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		goto put_unlock_exit;
 	}
 
+	if (tt->grp) {
+		ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+			tt, ioba, (unsigned long *)tces, npages);
+		goto put_unlock_exit;
+	}
+
+	/* Emulated IO */
 	for (i = 0; i < npages; ++i) {
 		ret = kvmppc_tce_validate(((unsigned long *)tces)[i]);
 		if (ret)
@@ -273,6 +390,11 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!tt)
 		return H_TOO_HARD;
 
+	if (tt->grp)
+		return kvmppc_rm_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+				tce_value, npages);
+
+	/* Emulated IO */
 	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
 		return H_PARAMETER;
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index ccb578b..ea9af59 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 	case KVM_CAP_SPAPR_MULTITCE:
+	case KVM_CAP_SPAPR_TCE_IOMMU:
 		r = 1;
 		break;
 #endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7a8c1b3..016df11 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1053,6 +1053,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);
 
 extern struct kvm_device_ops kvm_mpic_ops;
 extern struct kvm_device_ops kvm_xics_ops;
+extern struct kvm_device_ops kvmppc_spapr_tce_iommu_ops;
 
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1580dd4..34c3c22 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2282,6 +2282,11 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 		ops = &kvm_xics_ops;
 		break;
 #endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+	case KVM_DEV_TYPE_SPAPR_TCE_IOMMU:
+		ops = &kvmppc_spapr_tce_iommu_ops;
+		break;
+#endif
 	default:
 		return -ENODEV;
 	}
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 11/13] KVM: PPC: add trampolines for VFIO external API
From: Alexey Kardashevskiy @ 2013-08-28  8:50 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

KVM is going to use VFIO's external API. However KVM can operate even VFIO
is not compiled or loaded so KVM is linked to VFIO dynamically.

This adds proxy functions for VFIO external API.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c | 49 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index cae1099..047b94c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/module.h>
+#include <linux/vfio.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -42,6 +44,53 @@
 
 #define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
+/*
+ * Dynamically linked version of the external user VFIO API.
+ *
+ * As a IOMMU group access control is implemented by VFIO,
+ * there is some API to vefiry that specific process can own
+ * a group. As KVM may run when VFIO is not loaded, KVM is not
+ * linked statically to VFIO, instead wrappers are used.
+ */
+struct vfio_group *kvmppc_vfio_group_get_external_user(struct file *filep)
+{
+	struct vfio_group *ret;
+	struct vfio_group * (*proc)(struct file *) =
+			symbol_get(vfio_group_get_external_user);
+	if (!proc)
+		return NULL;
+
+	ret = proc(filep);
+	symbol_put(vfio_group_get_external_user);
+
+	return ret;
+}
+
+void kvmppc_vfio_group_put_external_user(struct vfio_group *group)
+{
+	void (*proc)(struct vfio_group *) =
+			symbol_get(vfio_group_put_external_user);
+	if (!proc)
+		return;
+
+	proc(group);
+	symbol_put(vfio_group_put_external_user);
+}
+
+int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
+{
+	int ret;
+	int (*proc)(struct vfio_group *) =
+			symbol_get(vfio_external_user_iommu_id);
+	if (!proc)
+		return -EINVAL;
+
+	ret = proc(group);
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 static long kvmppc_stt_npages(unsigned long window_size)
 {
 	return ALIGN((window_size >> SPAPR_TCE_SHIFT)
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 10/13] KVM: PPC: remove warning from kvmppc_core_destroy_vm
From: Alexey Kardashevskiy @ 2013-08-28  8:49 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

Book3S KVM implements in-kernel TCE tables via kvmppc_spapr_tce_table
structs list (created per KVM). Entries in the list are per LIOBN
(logical bus number) and have a TCE table so DMA hypercalls (such as
H_PUT_TCE) can convert LIOBN to a TCE table.

The entry in the list is created via KVM_CREATE_SPAPR_TCE ioctl which
returns an anonymous fd. This fd is used to map the TCE table to the user
space and it also defines the lifetime of the TCE table in the kernel.
Every list item also hold the link to KVM so when KVM is about to be
destroyed, all kvmppc_spapr_tce_table objects are expected to be
released and removed from the global list. And this is what the warning
verifies.

The upcoming support of VFIO IOMMU will extend kvmppc_spapr_tce_table use.
Unlike emulated devices, it will create kvmppc_spapr_tce_table structs
via new KVM device API which opens an anonymous fd
(as KVM_CREATE_SPAPR_TCE) but the release callback does not call
KVM Device's destroy callback immediately. Instead, KVM devices destruction
is postponed this till KVM destruction but this happens after arch-specific
KVM destroy function so the warning does a false alarm.

This removes the warning as it never happens in real life and there is no
any obvious place to put it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_hv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9e823ad..5f15ff7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1974,7 +1974,6 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 	kvmppc_rtas_tokens_free(kvm);

 	kvmppc_free_hpt(kvm);
-	WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
 }

 /* These are stubs for now */
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 09/13] powerpc/iommu: rework to support realmode
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

The TCE tables handling may differ for real and virtual modes so
additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm
handlers were introduced earlier.

So this adds the following:
1. support for the new ppc_md calls;
2. ability to iommu_tce_build to process mupltiple entries per
call;
3. arch_spin_lock to protect TCE table from races in both real and virtual
modes;
4. proper TCE table protection from races with the existing IOMMU code
in iommu_take_ownership/iommu_release_ownership;
5. hwaddr variable renamed to hpa as it better describes what it
actually represents;
6. iommu_tce_direction is static now as it is not called from anywhere else.

This will be used by upcoming real mode support of VFIO on POWER.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v8:
* fixed warnings from check_patch.pl
---
 arch/powerpc/include/asm/iommu.h |   9 +-
 arch/powerpc/kernel/iommu.c      | 198 ++++++++++++++++++++++++++-------------
 2 files changed, 136 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 19ad77f..71ee525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
+	arch_spinlock_t it_rm_lock;
 #endif
 };
 
@@ -161,9 +162,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-		unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-		unsigned long entry);
+		unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+		unsigned long npages, bool rm);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -173,7 +174,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 15f8ca8..ff0cd90 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
 	kfree(name);
 }
 
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
 {
 	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
 		return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce)
 	else
 		return DMA_NONE;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
 
 void iommu_flush_tce(struct iommu_table *tbl)
 {
@@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
-
-	spin_lock(&(pool->lock));
-
-	oldtce = ppc_md.tce_get(tbl, entry);
-	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		ppc_md.tce_free(tbl, entry, 1);
-	else
-		oldtce = 0;
-
-	spin_unlock(&(pool->lock));
-
-	return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages)
 {
-	unsigned long oldtce;
-	struct page *page;
-
-	for ( ; pages; --pages, ++entry) {
-		oldtce = iommu_clear_tce(tbl, entry);
-		if (!oldtce)
-			continue;
-
-		page = pfn_to_page(oldtce >> PAGE_SHIFT);
-		WARN_ON(!page);
-		if (page) {
-			if (oldtce & TCE_PCI_WRITE)
-				SetPageDirty(page);
-			put_page(page);
-		}
-	}
-
-	return 0;
+	return iommu_free_tces(tbl, entry, pages, false);
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 
-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+		unsigned long npages, bool rm)
+{
+	int i, ret = 0, to_free = 0;
+
+	if (rm && !ppc_md.tce_free_rm)
+		return -EAGAIN;
+
+	arch_spin_lock(&tbl->it_rm_lock);
+
+	for (i = 0; i < npages; ++i) {
+		unsigned long oldtce = ppc_md.tce_get(tbl, entry + i);
+		if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+			continue;
+
+		if (rm) {
+			struct page *pg = realmode_pfn_to_page(
+					oldtce >> PAGE_SHIFT);
+			if (!pg) {
+				ret = -EAGAIN;
+			} else if (PageCompound(pg)) {
+				ret = -EAGAIN;
+			} else {
+				if (oldtce & TCE_PCI_WRITE)
+					SetPageDirty(pg);
+				if (!put_page_unless_one(pg))
+					ret = -EAGAIN;
+			}
+		} else {
+			struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
+			if (!pg) {
+				ret = -EAGAIN;
+			} else {
+				if (oldtce & TCE_PCI_WRITE)
+					SetPageDirty(pg);
+				put_page(pg);
+			}
+		}
+		if (ret)
+			break;
+		to_free = i + 1;
+	}
+
+	if (to_free) {
+		if (rm)
+			ppc_md.tce_free_rm(tbl, entry, to_free);
+		else
+			ppc_md.tce_free(tbl, entry, to_free);
+
+		if (rm && ppc_md.tce_flush_rm)
+			ppc_md.tce_flush_rm(tbl);
+		else if (!rm && ppc_md.tce_flush)
+			ppc_md.tce_flush(tbl);
+	}
+	arch_spin_unlock(&tbl->it_rm_lock);
+
+	/* Make sure updates are seen by hardware */
+	mb();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_free_tces);
+
 int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-		unsigned long hwaddr, enum dma_data_direction direction)
+		unsigned long *hpas, unsigned long npages, bool rm)
 {
-	int ret = -EBUSY;
-	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
+	int i, ret = 0;
 
-	spin_lock(&(pool->lock));
+	if (rm && !ppc_md.tce_build_rm)
+		return -EAGAIN;
 
-	oldtce = ppc_md.tce_get(tbl, entry);
-	/* Add new entry if it is not busy */
-	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+	arch_spin_lock(&tbl->it_rm_lock);
 
-	spin_unlock(&(pool->lock));
+	for (i = 0; i < npages; ++i) {
+		if (ppc_md.tce_get(tbl, entry + i) &
+				(TCE_PCI_WRITE | TCE_PCI_READ)) {
+			arch_spin_unlock(&tbl->it_rm_lock);
+			return -EBUSY;
+		}
+	}
 
-	/* if (unlikely(ret))
-		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
-				__func__, hwaddr, entry << IOMMU_PAGE_SHIFT,
-				hwaddr, ret); */
+	for (i = 0; i < npages; ++i) {
+		unsigned long hva = (unsigned long) __va(hpas[i]);
+		enum dma_data_direction dir = iommu_tce_direction(hva);
+
+		if (rm)
+			ret = ppc_md.tce_build_rm(tbl, entry + i, 1,
+					hva, dir, NULL);
+		else
+			ret = ppc_md.tce_build(tbl, entry + i, 1,
+					hva, dir, NULL);
+		if (ret)
+			break;
+	}
+
+	if (rm && ppc_md.tce_flush_rm)
+		ppc_md.tce_flush_rm(tbl);
+	else if (!rm && ppc_md.tce_flush)
+		ppc_md.tce_flush(tbl);
+
+	arch_spin_unlock(&tbl->it_rm_lock);
+
+	/* Make sure updates are seen by hardware */
+	mb();
 
 	return ret;
 }
@@ -1049,7 +1092,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 {
 	int ret;
 	struct page *page = NULL;
-	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
+	unsigned long hpa, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
 	enum dma_data_direction direction = iommu_tce_direction(tce);
 
 	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
@@ -1059,9 +1102,9 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 				tce, entry << IOMMU_PAGE_SHIFT, ret); */
 		return -EFAULT;
 	}
-	hwaddr = (unsigned long) page_address(page) + offset;
+	hpa = __pa((unsigned long) page_address(page)) + offset;
 
-	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
+	ret = iommu_tce_build(tbl, entry, &hpa, 1, false);
 	if (ret)
 		put_page(page);
 
@@ -1075,18 +1118,32 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	if (tbl->it_offset == 0)
 		clear_bit(0, tbl->it_map);
 
 	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
 		pr_err("iommu_tce: it_map is not empty");
-		return -EBUSY;
+		ret = -EBUSY;
+		if (tbl->it_offset == 0)
+			clear_bit(1, tbl->it_map);
+
+	} else {
+		memset(tbl->it_map, 0xff, sz);
 	}
 
-	memset(tbl->it_map, 0xff, sz);
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
+	if (!ret)
+		iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);
 
 	return 0;
 }
@@ -1094,14 +1151,23 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+	iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 	memset(tbl->it_map, 0, sz);
 
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 08/13] KVM: PPC: Add support for multiple-TCE hcalls
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a function to convert a guest physical address to a host
virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT.

This also implements the KVM_CAP_PPC_MULTITCE capability. When present,
the hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changelog:
v8:
* fixed warnings from check_patch.pl

2013/08/01 (v7):
* realmode_get_page/realmode_put_page use was replaced with
get_page_unless_zero/put_page_unless_one

2013/07/11:
* addressed many, many comments from maintainers

2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt       |  26 +++
 arch/powerpc/include/asm/kvm_host.h     |   9 ++
 arch/powerpc/include/asm/kvm_ppc.h      |  16 +-
 arch/powerpc/kvm/book3s_64_vio.c        | 132 +++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c     | 270 ++++++++++++++++++++++++++++----
 arch/powerpc/kvm/book3s_hv.c            |  41 ++++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c       |  35 +++++
 arch/powerpc/kvm/powerpc.c              |   3 +
 9 files changed, 506 insertions(+), 34 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ef925ea..1c8942a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to userspace to be
 handled.
 
 
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..a23f132 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #include <linux/kvm_para.h>
 #include <linux/list.h>
 #include <linux/atomic.h>
+#include <linux/tracepoint.h>
 #include <asm/kvm_asm.h>
 #include <asm/processor.h>
 #include <asm/page.h>
@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
 	spinlock_t tbacct_lock;
 	u64 busy_stolen;
 	u64 busy_preempt;
+
+	unsigned long *tce_tmp_hpas;	/* TCE cache for TCE_PUT_INDIRECT */
+	enum {
+		TCERM_NONE,
+		TCERM_GETPAGE,
+		TCERM_PUTTCE,
+		TCERM_PUTLIST,
+	} tce_rm_fail;			/* failed stage of request processing */
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..0ce4691 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
-			     unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+		struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_tce_validate(unsigned long tce);
+extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..cae1099 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -36,8 +37,10 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
-#define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      ((void *)~(unsigned long)0x0)
 
 static long kvmppc_stt_npages(unsigned long window_size)
 {
@@ -148,3 +151,130 @@ fail:
 	}
 	return ret;
 }
+
+/* Converts guest physical address to host virtual address */
+static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+		unsigned long gpa, struct page **pg)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+	const int is_write = 0;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	hva = __gfn_to_hva_memslot(memslot, gfn) | (gpa & ~PAGE_MASK);
+
+	if (get_user_pages_fast(hva & PAGE_MASK, 1, is_write, pg) != 1)
+		return ERROR_ADDR;
+
+	return (void *) hva;
+}
+
+long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	if (!tt)
+		return H_TOO_HARD;
+
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_tce_validate(tce);
+	if (ret)
+		return ret;
+
+	kvmppc_tce_put(tt, ioba, tce);
+
+	return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret = H_SUCCESS;
+	unsigned long __user *tces;
+	struct page *pg = NULL;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+	if (tces == ERROR_ADDR)
+		return H_TOO_HARD;
+
+	if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
+		goto put_list_page_exit;
+
+	for (i = 0; i < npages; ++i) {
+		if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
+			ret = H_PARAMETER;
+			goto put_list_page_exit;
+		}
+
+		ret = kvmppc_tce_validate(vcpu->arch.tce_tmp_hpas[i]);
+		if (ret)
+			goto put_list_page_exit;
+	}
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_put(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				vcpu->arch.tce_tmp_hpas[i]);
+put_list_page_exit:
+	if (pg) {
+		put_page(pg);
+		if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+			vcpu->arch.tce_rm_fail = TCERM_NONE;
+			/* Finish pending put_page() from realmode */
+			put_page(pg);
+		}
+	}
+
+	return ret;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	if (!tt)
+		return H_TOO_HARD;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_tce_validate(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_tce_put(tt, ioba, tce_value);
+
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..2b2ce0a 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -35,42 +36,253 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR      (~(unsigned long)0x0)
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
+/* Finds a TCE table descriptor by LIOBN.
+ *
+ * WARNING: This will be called in real or virtual mode on HV KVM and virtual
  *          mode on PR KVM
  */
-long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *tt;
+
+	list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (tt->liobn == liobn)
+			return tt;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validates TCE address.
+ * At the moment only flags are validated.
+ * As the host kernel does not access those addresses (just puts them
+ * to the table and user space is supposed to process them), we can skip
+ * checking other things (such as TCE is a guest RAM address or the page
+ * was actually allocated).
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *          mode on PR KVM
+ */
+long kvmppc_tce_validate(unsigned long tce)
+{
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return H_PARAMETER;
+
+	return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_validate);
+
+/* Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *          mode on PR KVM
+ */
+static u64 *kvmppc_page_address(struct page *page)
+{
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	return (u64 *) page_address(page);
+}
+
+/*
+ * Handles TCE requests for emulated devices.
+ * Puts guest TCE values to the table and expects user space to convert them.
+ * Called in both real and virtual modes.
+ * Cannot fail so kvmppc_tce_validate must be called before it.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ *          mode on PR KVM
+ */
+void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	page = tt->pages[idx / TCES_PER_PAGE];
+	tbl = kvmppc_page_address(page);
+
+	tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_put);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address to host physical address.
+ * Tries to increase page counter via get_page_unless_zero() and
+ * returns ERROR_ADDR if failed.
+ */
+static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+		unsigned long gpa, struct page **pg)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t *ptep, pte;
+	unsigned long hva, hpa = ERROR_ADDR;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+	unsigned shift = 0;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return ERROR_ADDR;
+
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+	if (!ptep || !pte_present(*ptep))
+		return ERROR_ADDR;
+	pte = *ptep;
+
+	if (!shift)
+		shift = PAGE_SHIFT;
+
+	/* Avoid handling anything potentially complicated in realmode */
+	if (shift > PAGE_SHIFT)
+		return ERROR_ADDR;
+
+	if (((gpa & TCE_PCI_WRITE) || pte_write(pte)) && !pte_dirty(pte))
+		return ERROR_ADDR;
+
+	if (!pte_young(pte))
+		return ERROR_ADDR;
+
+	/* Increase page counter */
+	*pg = realmode_pfn_to_page(pte_pfn(pte));
+	if (!*pg || PageCompound(*pg) || !get_page_unless_zero(*pg))
+		return ERROR_ADDR;
+
+	hpa = (pte_pfn(pte) << PAGE_SHIFT) + (gpa & ((1 << shift) - 1));
+
+	/*
+	 * Page has gone since we got pte, safer to put
+	 * the request to virt mode
+	 */
+	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+		hpa = ERROR_ADDR;
+		/* Try drop the page, if failed, let virtmode do that */
+		if (put_page_unless_one(*pg))
+			*pg = NULL;
+	}
+
+	return hpa;
+}
+
+long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
-	struct kvmppc_spapr_tce_table *stt;
-
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
-
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
-
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
-
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
-
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
+	long ret;
+	struct kvmppc_spapr_tce_table *tt;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	if (!tt)
+		return H_TOO_HARD;
+
+	if (ioba >= tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_tce_validate(tce);
+	if (!ret)
+		kvmppc_tce_put(tt, ioba, tce);
+
+	return ret;
+}
+
+long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret = H_SUCCESS;
+	unsigned long tces;
+	struct page *pg = NULL;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	if (!tt)
+		return H_TOO_HARD;
+
+	/*
+	 * The spec says that the maximum size of the list is 512 TCEs
+	 * so the whole table addressed resides in 4K page
+	 */
+	if (npages > 512)
+		return H_PARAMETER;
+
+	if (tce_list & ~IOMMU_PAGE_MASK)
+		return H_PARAMETER;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+	if (tces == ERROR_ADDR) {
+		ret = H_TOO_HARD;
+		goto put_unlock_exit;
 	}
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	for (i = 0; i < npages; ++i) {
+		ret = kvmppc_tce_validate(((unsigned long *)tces)[i]);
+		if (ret)
+			goto put_unlock_exit;
+	}
+
+	for (i = 0; i < npages; ++i)
+		kvmppc_tce_put(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+				((unsigned long *)tces)[i]);
+
+put_unlock_exit:
+	if (!ret && pg && !put_page_unless_one(pg)) {
+		vcpu->arch.tce_rm_fail = TCERM_PUTLIST;
+		ret = H_TOO_HARD;
+	}
+
+	return ret;
+}
+
+long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *tt;
+	long i, ret;
+
+	tt = kvmppc_find_tce_table(vcpu, liobn);
+	if (!tt)
+		return H_TOO_HARD;
+
+	if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+		return H_PARAMETER;
+
+	ret = kvmppc_tce_validate(tce_value);
+	if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return H_PARAMETER;
+
+	for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+		kvmppc_tce_put(tt, ioba, tce_value);
+
+	return H_SUCCESS;
 }
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7629cd3..9e823ad 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -567,7 +567,31 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 		if (kvmppc_xics_enabled(vcpu)) {
 			ret = kvmppc_xics_hcall(vcpu, req);
 			break;
-		} /* fallthrough */
+		}
+		return RESUME_HOST;
+	case H_PUT_TCE:
+		ret = kvmppc_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+						kvmppc_get_gpr(vcpu, 5),
+						kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
@@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	kvmppc_sanity_check(vcpu);
 
+	/*
+	 * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+	 * half executed, we first read TCEs from the user, check them and
+	 * return error if something went wrong and only then put TCEs into
+	 * the TCE table.
+	 *
+	 * tce_tmp_hpas is a cache for TCEs to avoid stack allocation or
+	 * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+	 * each (4096 bytes).
+	 */
+	vcpu->arch.tce_tmp_hpas = kmalloc(4096, GFP_KERNEL);
+	if (!vcpu->arch.tce_tmp_hpas)
+		goto free_vcpu;
+
 	return vcpu;
 
 free_vcpu:
@@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 	unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
 	unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
 	spin_unlock(&vcpu->arch.vpa_update_lock);
+	kfree(vcpu->arch.tce_tmp_hpas);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..15942bc 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1416,7 +1416,7 @@ hcall_real_table:
 	.long	0		/* 0x14 - H_CLEAR_REF */
 	.long	.kvmppc_h_protect - hcall_real_table
 	.long	0		/* 0x1c - H_GET_TCE */
-	.long	.kvmppc_h_put_tce - hcall_real_table
+	.long	.kvmppc_rm_h_put_tce - hcall_real_table
 	.long	0		/* 0x24 - H_SET_SPRG0 */
 	.long	.kvmppc_h_set_dabr - hcall_real_table
 	.long	0		/* 0x2c */
@@ -1490,6 +1490,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_rm_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_rm_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..6bd0d4a 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -227,6 +227,37 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	return EMULATE_DONE;
 }
 
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
 static int kvmppc_h_pr_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd)
 {
 	long rc = kvmppc_xics_hcall(vcpu, cmd);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..ccb578b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -394,6 +394,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_GET_SMMU_INFO:
 		r = 1;
 		break;
+	case KVM_CAP_SPAPR_MULTITCE:
+		r = 1;
+		break;
 #endif
 	default:
 		r = 0;
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 07/13] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

It does not make much sense to have KVM in book3s-64bit and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s-kvm.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c55c538..3b2b761 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -59,6 +59,7 @@ config KVM_BOOK3S_64
 	depends on PPC_BOOK3S_64
 	select KVM_BOOK3S_64_HANDLER
 	select KVM
+	select SPAPR_TCE_IOMMU
 	---help---
 	  Support running unmodified book3s_64 and book3s_32 guest kernels
 	  in virtual machines on book3s_64 host processors.
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 06/13] powerpc: add real mode support for dma operations on powernv
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use "Store Doubleword Caching Inhibited
Indexed" instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v8:
* fixed check_patch.pl warnings

2013/11/07:
* added comment why stdcix cannot be used in virtual mode

2013/08/07:
* tested on p7ioc and fixed a bug with realmode addresses
---
 arch/powerpc/include/asm/machdep.h        | 12 ++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 49 +++++++++++++++++++++++--------
 arch/powerpc/platforms/powernv/pci.c      | 42 ++++++++++++++++++++++----
 arch/powerpc/platforms/powernv/pci.h      |  3 +-
 4 files changed, 87 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 8b48090..07dd3b1 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,6 +78,18 @@ struct machdep_calls {
 				    long index);
 	void		(*tce_flush)(struct iommu_table *tbl);
 
+	/* _rm versions are for real mode use only */
+	int		(*tce_build_rm)(struct iommu_table *tbl,
+				     long index,
+				     long npages,
+				     unsigned long uaddr,
+				     enum dma_data_direction direction,
+				     struct dma_attrs *attrs);
+	void		(*tce_free_rm)(struct iommu_table *tbl,
+				    long index,
+				    long npages);
+	void		(*tce_flush_rm)(struct iommu_table *tbl);
+
 	void __iomem *	(*ioremap)(phys_addr_t addr, unsigned long size,
 				   unsigned long flags, void *caller);
 	void		(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 756bb58..8cba234 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR);
 define_pe_printk_level(pe_warn, KERN_WARNING);
 define_pe_printk_level(pe_info, KERN_INFO);
 
+/*
+ * stdcix is only supposed to be used in hypervisor real mode as per
+ * the architecture spec
+ */
+static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
+{
+	__asm__ __volatile__("stdcix %0,0,%1"
+		: : "r" (val), "r" (paddr) : "memory");
+}
+
 static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
 	unsigned long pe;
@@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 	}
 }
 
-static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
-					 u64 *startp, u64 *endp)
+static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
+					 struct iommu_table *tbl,
+					 u64 *startp, u64 *endp, bool rm)
 {
-	u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+	u64 __iomem *invalidate = rm ?
+		(u64 __iomem *)pe->tce_inval_reg_phys :
+		(u64 __iomem *)tbl->it_index;
 	unsigned long start, end, inc;
 
 	start = __pa(startp);
@@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
 
         mb(); /* Ensure above stores are visible */
         while (start <= end) {
-                __raw_writeq(start, invalidate);
+		if (rm)
+			__raw_rm_writeq(start, invalidate);
+		else
+			__raw_writeq(start, invalidate);
                 start += inc;
         }
 
@@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
 
 static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 					 struct iommu_table *tbl,
-					 u64 *startp, u64 *endp)
+					 u64 *startp, u64 *endp, bool rm)
 {
 	unsigned long start, end, inc;
-	u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+	u64 __iomem *invalidate = rm ?
+		(u64 __iomem *)pe->tce_inval_reg_phys :
+		(u64 __iomem *)tbl->it_index;
 
 	/* We'll invalidate DMA address in PE scope */
 	start = 0x2ul << 60;
@@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	mb();
 
 	while (start <= end) {
-		__raw_writeq(start, invalidate);
+		if (rm)
+			__raw_rm_writeq(start, invalidate);
+		else
+			__raw_writeq(start, invalidate);
 		start += inc;
 	}
 }
 
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-				 u64 *startp, u64 *endp)
+				 u64 *startp, u64 *endp, bool rm)
 {
 	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
 					      tce32_table);
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
-		pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
+		pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
 	else
-		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
+		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
 }
 
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
@@ -603,7 +624,9 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		 * bus number, print that out instead.
 		 */
 		tbl->it_busno = 0;
-		tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+		pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+		tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
+				8);
 		tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
 			       TCE_PCI_SWINV_PAIR;
 	}
@@ -681,7 +704,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		 * bus number, print that out instead.
 		 */
 		tbl->it_busno = 0;
-		tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+		pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+		tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
+				8);
 		tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
 	}
 	iommu_init_table(tbl, phb->hose->node);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c005011..8623529 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -401,7 +401,7 @@ struct pci_ops pnv_pci_ops = {
 
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 			 unsigned long uaddr, enum dma_data_direction direction,
-			 struct dma_attrs *attrs)
+			 struct dma_attrs *attrs, bool rm)
 {
 	u64 proto_tce;
 	u64 *tcep, *tces;
@@ -423,12 +423,22 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	 * of flags if that becomes the case
 	 */
 	if (tbl->it_type & TCE_PCI_SWINV_CREATE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
 
 	return 0;
 }
 
-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
+static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
+			    unsigned long uaddr,
+			    enum dma_data_direction direction,
+			    struct dma_attrs *attrs)
+{
+	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
+			false);
+}
+
+static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
+		bool rm)
 {
 	u64 *tcep, *tces;
 
@@ -438,7 +448,12 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 		*(tcep++) = 0;
 
 	if (tbl->it_type & TCE_PCI_SWINV_FREE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+}
+
+static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+{
+	pnv_tce_free(tbl, index, npages, false);
 }
 
 static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
@@ -446,6 +461,19 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 	return ((u64 *)tbl->it_base)[index - tbl->it_offset];
 }
 
+static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
+			    unsigned long uaddr,
+			    enum dma_data_direction direction,
+			    struct dma_attrs *attrs)
+{
+	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
+}
+
+static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
+{
+	pnv_tce_free(tbl, index, npages, true);
+}
+
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
 			       u64 dma_offset)
@@ -610,8 +638,10 @@ void __init pnv_pci_init(void)
 
 	/* Configure IOMMU DMA hooks */
 	ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
-	ppc_md.tce_build = pnv_tce_build;
-	ppc_md.tce_free = pnv_tce_free;
+	ppc_md.tce_build = pnv_tce_build_vm;
+	ppc_md.tce_free = pnv_tce_free_vm;
+	ppc_md.tce_build_rm = pnv_tce_build_rm;
+	ppc_md.tce_free_rm = pnv_tce_free_rm;
 	ppc_md.tce_get = pnv_tce_get;
 	ppc_md.pci_probe_mode = pnv_pci_probe_mode;
 	set_pci_dma_ops(&dma_iommu_ops);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d633c64..170dd98 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -52,6 +52,7 @@ struct pnv_ioda_pe {
 	int			tce32_seg;
 	int			tce32_segcount;
 	struct iommu_table	tce32_table;
+	phys_addr_t		tce_inval_reg_phys;
 
 	/* XXX TODO: Add support for additional 64-bit iommus */
 
@@ -193,6 +194,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
 extern void pnv_pci_init_ioda_hub(struct device_node *np);
 extern void pnv_pci_init_ioda2_phb(struct device_node *np);
 extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-					u64 *startp, u64 *endp);
+					u64 *startp, u64 *endp, bool rm);
 
 #endif /* __POWERNV_PCI_H */
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 05/13] powerpc: Prepare to support kernel handling of IOMMU map/unmap
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	Andrew Morton, David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API function realmode_pfn_to_page() to get page struct when
MMU is off.

This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(for example, real mode on PPC64) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Cc: linux-mm@kvack.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---

Changes:
2013/07/25 (v7):
* removed realmode_put_page and added put_page_unless_one() instead.
The name has been chosen to conform the already existing
get_page_unless_zero().
* removed realmode_get_page. Instead, get_page_unless_zero() should be used

2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).

2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.

2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  2 ++
 arch/powerpc/mm/init_64.c                | 50 +++++++++++++++++++++++++++++++-
 include/linux/mm.h                       | 14 +++++++++
 include/linux/page-flags.h               |  4 ++-
 4 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..4a191c4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
 	hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
 }
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+
 static inline char *get_hpte_slot_array(pmd_t *pmdp)
 {
 	/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..8cf345a 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,53 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * realmode_pfn_to_page functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
 
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..dcc99b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -290,12 +290,26 @@ static inline int put_page_testzero(struct page *page)
 /*
  * Try to grab a ref unless the page has a refcount of zero, return false if
  * that is the case.
+ * This can be called when MMU is off so it must not access
+ * any of the virtual mappings.
  */
 static inline int get_page_unless_zero(struct page *page)
 {
 	return atomic_inc_not_zero(&page->_count);
 }
 
+/*
+ * Try to drop a ref unless the page has a refcount of one, return false if
+ * that is the case.
+ * This is to make sure that the refcount won't become zero after this drop.
+ * This can be called when MMU is off so it must not access
+ * any of the virtual mappings.
+ */
+static inline int put_page_unless_one(struct page *page)
+{
+	return atomic_add_unless(&page->_count, -1, 1);
+}
+
 extern int page_is_ram(unsigned long pfn);
 
 /* Support for virtually mapped pages */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
  * System with lots of page flags available. This allows separate
  * flags for PageHead() and PageTail() checks of compound pages so that bit
  * tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
  */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 04/13] KVM: PPC: reserve a capability and KVM device type for realmode VFIO
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

This reserves a capability number for upcoming support
of VFIO-IOMMU DMA operations in real mode.

This reserves a number for a new "SPAPR TCE IOMMU" KVM device
which is going to manage lifetime of SPAPR TCE IOMMU object.

This defines an attribute of the "SPAPR TCE IOMMU" KVM device
which is going to be used for initialization.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changes:
v9:
* KVM ioctl is replaced with "SPAPR TCE IOMMU" KVM device type with
KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute

2013/08/15:
* fixed mistype in comments
* fixed commit message which says what uses ioctls 0xad and 0xae

2013/07/16:
* changed the number

2013/07/11:
* changed order in a file, added comment about a gap in ioctl number
---
 arch/powerpc/include/uapi/asm/kvm.h | 8 ++++++++
 include/uapi/linux/kvm.h            | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..c1ae1e5 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -511,4 +511,12 @@ struct kvm_get_htab_header {
 #define  KVM_XICS_MASKED		(1ULL << 41)
 #define  KVM_XICS_PENDING		(1ULL << 42)
 
+/* SPAPR TCE IOMMU device specification */
+struct kvm_create_spapr_tce_iommu_linkage {
+	__u64 liobn;
+	__u32 fd;
+	__u32 flags;
+};
+#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE	0
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..9d20630 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
 #define KVM_CAP_SPAPR_MULTITCE 94
+#define KVM_CAP_SPAPR_TCE_IOMMU 95
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -843,6 +844,7 @@ struct kvm_device_attr {
 #define KVM_DEV_TYPE_FSL_MPIC_20	1
 #define KVM_DEV_TYPE_FSL_MPIC_42	2
 #define KVM_DEV_TYPE_XICS		3
+#define KVM_DEV_TYPE_SPAPR_TCE_IOMMU	4
 
 /*
  * ioctls for VM fds
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 03/13] KVM: PPC: reserve a capability number for multitce support
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
2013/07/16:
* changed the number
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index acccd08..99c2533 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
+#define KVM_CAP_SPAPR_MULTITCE 94
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 02/13] hashtable: add hash_for_each_possible_rcu_notrace()
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changes:
v8:
* fixed warnings from check_patch.pl
---
 include/linux/hashtable.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..519b6e2 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
 		member)
 
 /**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+	hlist_for_each_entry_rcu_notrace(obj, \
+		&name[hash_min(key, HASH_BITS(name))], member)
+
+/**
  * hash_for_each_possible_safe - iterate over all possible objects hashing to the
  * same bucket safe against removals
  * @name: hashtable to iterate
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 01/13] KVM: PPC: POWERNV: move iommu_add_device earlier
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Paul Mackerras, Paolo Bonzini,
	David Gibson
In-Reply-To: <1377679070-3515-1-git-send-email-aik@ozlabs.ru>

The current implementation of IOMMU on sPAPR does not use iommu_ops
and therefore does not call IOMMU API's bus_set_iommu() which
1) sets iommu_ops for a bus
2) registers a bus notifier
Instead, PCI devices are added to IOMMU groups from
subsys_initcall_sync(tce_iommu_init) which does basically the same
thing without using iommu_ops callbacks.

However Freescale PAMU driver (https://lkml.org/lkml/2013/7/1/158)
implements iommu_ops and when tce_iommu_init is called, every PCI device
is already added to some group so there is a conflict.

This patch does 2 things:
1. removes the loop in which PCI devices were added to groups and
adds explicit iommu_add_device() calls to add devices as soon as they get
the iommu_table pointer assigned to them.
2. moves a bus notifier to powernv code in order to avoid conflict with
the notifier from Freescale driver.

iommu_add_device() and iommu_del_device() are public now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v8:
* added the check for iommu_group!=NULL before removing device from a group
as suggested by Wei Yang <weiyang@linux.vnet.ibm.com>

v2:
* added a helper - set_iommu_table_base_and_group - which does
set_iommu_table_base() and iommu_add_device()
---
 arch/powerpc/include/asm/iommu.h            |  9 +++++++
 arch/powerpc/kernel/iommu.c                 | 41 +++--------------------------
 arch/powerpc/platforms/powernv/pci-ioda.c   |  8 +++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  2 +-
 arch/powerpc/platforms/powernv/pci.c        | 33 ++++++++++++++++++++++-
 arch/powerpc/platforms/pseries/iommu.c      |  8 +++---
 6 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..19ad77f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -103,6 +103,15 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
 extern void iommu_register_group(struct iommu_table *tbl,
 				 int pci_domain_number, unsigned long pe_num);
+extern int iommu_add_device(struct device *dev);
+extern void iommu_del_device(struct device *dev);
+
+static inline void set_iommu_table_base_and_group(struct device *dev,
+						  void *base)
+{
+	set_iommu_table_base(dev, base);
+	iommu_add_device(dev);
+}
 
 extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 			struct scatterlist *sglist, int nelems,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..15f8ca8 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1105,7 +1105,7 @@ void iommu_release_ownership(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-static int iommu_add_device(struct device *dev)
+int iommu_add_device(struct device *dev)
 {
 	struct iommu_table *tbl;
 	int ret = 0;
@@ -1134,46 +1134,13 @@ static int iommu_add_device(struct device *dev)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(iommu_add_device);
 
-static void iommu_del_device(struct device *dev)
+void iommu_del_device(struct device *dev)
 {
 	iommu_group_remove_device(dev);
 }
-
-static int iommu_bus_notifier(struct notifier_block *nb,
-			      unsigned long action, void *data)
-{
-	struct device *dev = data;
-
-	switch (action) {
-	case BUS_NOTIFY_ADD_DEVICE:
-		return iommu_add_device(dev);
-	case BUS_NOTIFY_DEL_DEVICE:
-		iommu_del_device(dev);
-		return 0;
-	default:
-		return 0;
-	}
-}
-
-static struct notifier_block tce_iommu_bus_nb = {
-	.notifier_call = iommu_bus_notifier,
-};
-
-static int __init tce_iommu_init(void)
-{
-	struct pci_dev *pdev = NULL;
-
-	BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
-
-	for_each_pci_dev(pdev)
-		iommu_add_device(&pdev->dev);
-
-	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
-	return 0;
-}
-
-subsys_initcall_sync(tce_iommu_init);
+EXPORT_SYMBOL_GPL(iommu_del_device);
 
 #else
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index d8140b1..756bb58 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -440,7 +440,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 		return;
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
-	set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
 }
 
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
@@ -448,7 +448,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
-		set_iommu_table_base(&dev->dev, &pe->tce32_table);
+		set_iommu_table_base_and_group(&dev->dev, &pe->tce32_table);
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
 	}
@@ -611,7 +611,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	iommu_register_group(tbl, pci_domain_nr(pe->pbus), pe->pe_number);
 
 	if (pe->pdev)
-		set_iommu_table_base(&pe->pdev->dev, tbl);
+		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
 	else
 		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
@@ -687,7 +687,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	iommu_init_table(tbl, phb->hose->node);
 
 	if (pe->pdev)
-		set_iommu_table_base(&pe->pdev->dev, tbl);
+		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
 	else
 		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index b68db63..ede341b 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -92,7 +92,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
 	}
 
-	set_iommu_table_base(&pdev->dev, &phb->p5ioc2.iommu_table);
+	set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
 }
 
 static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a28d3b5..c005011 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -504,7 +504,7 @@ static void pnv_pci_dma_fallback_setup(struct pci_controller *hose,
 		pdn->iommu_table = pnv_pci_setup_bml_iommu(hose);
 	if (!pdn->iommu_table)
 		return;
-	set_iommu_table_base(&pdev->dev, pdn->iommu_table);
+	set_iommu_table_base_and_group(&pdev->dev, pdn->iommu_table);
 }
 
 static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
@@ -623,3 +623,34 @@ void __init pnv_pci_init(void)
 	ppc_md.teardown_msi_irqs = pnv_teardown_msi_irqs;
 #endif
 }
+
+static int tce_iommu_bus_notifier(struct notifier_block *nb,
+		unsigned long action, void *data)
+{
+	struct device *dev = data;
+
+	switch (action) {
+	case BUS_NOTIFY_ADD_DEVICE:
+		return iommu_add_device(dev);
+	case BUS_NOTIFY_DEL_DEVICE:
+		if (dev->iommu_group)
+			iommu_del_device(dev);
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static struct notifier_block tce_iommu_bus_nb = {
+	.notifier_call = tce_iommu_bus_notifier,
+};
+
+static int __init tce_iommu_bus_notifier_init(void)
+{
+	BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
+
+	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
+	return 0;
+}
+
+subsys_initcall_sync(tce_iommu_bus_notifier_init);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 23fc1dc..884ae71 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -687,7 +687,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		iommu_table_setparms(phb, dn, tbl);
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
 		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
-		set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
+		set_iommu_table_base_and_group(&dev->dev,
+					       PCI_DN(dn)->iommu_table);
 		return;
 	}
 
@@ -699,7 +700,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		dn = dn->parent;
 
 	if (dn && PCI_DN(dn))
-		set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
+		set_iommu_table_base_and_group(&dev->dev,
+					       PCI_DN(dn)->iommu_table);
 	else
 		printk(KERN_WARNING "iommu: Device %s has no iommu table\n",
 		       pci_name(dev));
@@ -1193,7 +1195,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
 	}
 
-	set_iommu_table_base(&dev->dev, pci->iommu_table);
+	set_iommu_table_base_and_group(&dev->dev, pci->iommu_table);
 }
 
 static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
-- 
1.8.4.rc4

^ permalink raw reply related

* [PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO
From: Alexey Kardashevskiy @ 2013-08-28  8:37 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Gleb Natapov, Alexey Kardashevskiy, Alexander Graf, kvm-ppc,
	linux-kernel, linux-mm, Alex Williamson, Paul Mackerras,
	Paolo Bonzini, David Gibson

This accelerates VFIO DMA operations on POWER by moving them
into kernel.

This depends on VFIO external API patch which is on its way to upstream.

Changes:
v9:
* replaced the "link logical bus number to IOMMU group" ioctl to KVM
with a KVM device doing the same thing, i.e. the actual changes are in
these 3 patches:
  KVM: PPC: reserve a capability and KVM device type for realmode VFIO
  KVM: PPC: remove warning from kvmppc_core_destroy_vm
  KVM: PPC: Add support for IOMMU in-kernel handling

* moved some VFIO external API bits to a separate patch to reduce the size
of the "KVM: PPC: Add support for IOMMU in-kernel handling" patch

* fixed code style problems reported by checkpatch.pl.

v8:
* fixed comments about capabilities numbers

v7:
* rebased on v3.11-rc3.
* VFIO external user API will go through VFIO tree so it is
excluded from this series.
* As nobody ever reacted on "hashtable: add hash_for_each_possible_rcu_notrace()",
Ben suggested to push it via his tree so I included it to the series.
* realmode_(get|put)_page is reworked.

More details in the individual patch comments.

Alexey Kardashevskiy (13):
  KVM: PPC: POWERNV: move iommu_add_device earlier
  hashtable: add hash_for_each_possible_rcu_notrace()
  KVM: PPC: reserve a capability number for multitce support
  KVM: PPC: reserve a capability and KVM device type for realmode VFIO
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  powerpc: add real mode support for dma operations on powernv
  KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc/iommu: rework to support realmode
  KVM: PPC: remove warning from kvmppc_core_destroy_vm
  KVM: PPC: add trampolines for VFIO external API
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt                  |  26 +
 .../virtual/kvm/devices/spapr_tce_iommu.txt        |  37 ++
 arch/powerpc/include/asm/iommu.h                   |  18 +-
 arch/powerpc/include/asm/kvm_host.h                |  38 ++
 arch/powerpc/include/asm/kvm_ppc.h                 |  16 +-
 arch/powerpc/include/asm/machdep.h                 |  12 +
 arch/powerpc/include/asm/pgtable-ppc64.h           |   2 +
 arch/powerpc/include/uapi/asm/kvm.h                |   8 +
 arch/powerpc/kernel/iommu.c                        | 243 +++++----
 arch/powerpc/kvm/Kconfig                           |   1 +
 arch/powerpc/kvm/book3s_64_vio.c                   | 597 ++++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c                | 408 +++++++++++++-
 arch/powerpc/kvm/book3s_hv.c                       |  42 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S            |   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c                  |  35 ++
 arch/powerpc/kvm/powerpc.c                         |   4 +
 arch/powerpc/mm/init_64.c                          |  50 +-
 arch/powerpc/platforms/powernv/pci-ioda.c          |  57 +-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c        |   2 +-
 arch/powerpc/platforms/powernv/pci.c               |  75 ++-
 arch/powerpc/platforms/powernv/pci.h               |   3 +-
 arch/powerpc/platforms/pseries/iommu.c             |   8 +-
 include/linux/hashtable.h                          |  15 +
 include/linux/kvm_host.h                           |   1 +
 include/linux/mm.h                                 |  14 +
 include/linux/page-flags.h                         |   4 +-
 include/uapi/linux/kvm.h                           |   3 +
 virt/kvm/kvm_main.c                                |   5 +
 28 files changed, 1564 insertions(+), 168 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

-- 
1.8.4.rc4

^ permalink raw reply

* Re: [PATCH v8 1/3] DMA: Freescale: revise device tree binding document
From: Hongbo Zhang @ 2013-08-28  8:18 UTC (permalink / raw)
  To: Mark Rutland
  Cc: devicetree@vger.kernel.org, ian.campbell@citrix.com, Pawel Moll,
	swarren@wwwdotorg.org, vinod.koul@intel.com,
	linux-kernel@vger.kernel.org, rob.herring@calxeda.com,
	djbw@fb.com, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20130827112509.GH19893@e106331-lin.cambridge.arm.com>

On 08/27/2013 07:25 PM, Mark Rutland wrote:
> On Tue, Aug 27, 2013 at 11:42:01AM +0100, hongbo.zhang@freescale.com wrote:
>> From: Hongbo Zhang <hongbo.zhang@freescale.com>
>>
>> This patch updates the discription of each type of DMA controller and its
>> channels, it is preparation for adding another new DMA controller binding, it
>> also fixes some defects of indent for text alignment at the same time.
>>
>> Signed-off-by: Hongbo Zhang <hongbo.zhang@freescale.com>
>> ---
>>   .../devicetree/bindings/powerpc/fsl/dma.txt        |   62 +++++++++-----------
>>   1 file changed, 27 insertions(+), 35 deletions(-)
>>
>> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
>> index 2a4b4bc..ddf17af 100644
>> --- a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
>> +++ b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
>> @@ -1,33 +1,29 @@
>> -* Freescale 83xx DMA Controller
>> +* Freescale DMA Controllers
>>   
>> -Freescale PowerPC 83xx have on chip general purpose DMA controllers.
>> +** Freescale Elo DMA Controller
>> +   This is a little-endian DMA controller, used in Freescale mpc83xx series
>> +   chips such as mpc8315, mpc8349, mpc8379 etc.
>>   
>>   Required properties:
>>   
>> -- compatible        : compatible list, contains 2 entries, first is
>> -		 "fsl,CHIP-dma", where CHIP is the processor
>> -		 (mpc8349, mpc8360, etc.) and the second is
>> -		 "fsl,elo-dma"
>> -- reg               : <registers mapping for DMA general status reg>
>> -- ranges		: Should be defined as specified in 1) to describe the
>> -		  DMA controller channels.
>> +- compatible        : must include "fsl,elo-dma"
> We should list the other values that may be in the list also, unless
> they are really of no consequence, in which case their presence in dt is
> questionable.
Hmm.  Stephen questioned here too, it seems this is a default rule.
Although Scott@freescale had explained our thoughts, I'd like to edit 
this item like this:

"must include "fsl,eloplus-dma", and a "fsl,CHIP-dma" is optional, where 
CHIP is the processor name"

We don't list all the chip name because we have tens of them and we 
cannot list all of them, and it is unnecessary to list them because we 
even don't use "fsl,CHIP-dma" in the new driver, add "fsl,CHIP-dma" here 
just make it questionable when it presents in example and  old dts files.

I remove the examples in bracket "(mpc8349, mpc8360, etc.)" because we 
can see the real example below.
I don't say" if "fsl,CHIP-dma" presents, it should be the first one, and 
the "fsl,eloplus-dma" should be the second" because it is common rule.
the description language should be clear and concise too I think.
>> +- reg               : <registers specifier for DMA general status reg>
>> +- ranges            : describes the mapping between the address space of the
>> +                      DMA channels and the address space of the DMA controller
>>   - cell-index        : controller index.  0 for controller @ 0x8100
>> -- interrupts        : <interrupt mapping for DMA IRQ>
>> +- interrupts        : <interrupt specifier for DMA IRQ>
>>   - interrupt-parent  : optional, if needed for interrupt mapping
>>   
>> -
>>   - DMA channel nodes:
>> -        - compatible        : compatible list, contains 2 entries, first is
>> -			 "fsl,CHIP-dma-channel", where CHIP is the processor
>> -			 (mpc8349, mpc8350, etc.) and the second is
>> -			 "fsl,elo-dma-channel". However, see note below.
>> -        - reg               : <registers mapping for channel>
>> +        - compatible        : must include "fsl,elo-dma-channel"
>> +                              However, see note below.
> Again, I think we should list the other entries that may be in the list.
> Otherwise it's not clear what the binding defines. Similarly for the
> other compatible list definitions below...
>
>> +        - reg               : <registers specifier for channel>
>>           - cell-index        : dma channel index starts at 0.
> I realise you haven't changed it, but it's unclear what the cell-index
> property is (and somewhat confusingly there seem to be multiple
> defnitions). It might be worth clarifying it while performing the other
> cleanup.
not clear with your point "multiple definitions", we really have 
multiple dma channels for one dma controller.
cell-index is used as channel index, this is an old method used by old 
driver, my patch didn't touch this part.
>>   
>>   Optional properties:
>> -        - interrupts        : <interrupt mapping for DMA channel IRQ>
>> -			  (on 83xx this is expected to be identical to
>> -			   the interrupts property of the parent node)
>> +        - interrupts        : <interrupt specifier for DMA channel IRQ>
>> +                              (on 83xx this is expected to be identical to
>> +                              the interrupts property of the parent node)
>>           - interrupt-parent  : optional, if needed for interrupt mapping
>>   
>>   Example:
>> @@ -70,30 +66,26 @@ Example:
>>   		};
>>   	};
>>   
>> -* Freescale 85xx/86xx DMA Controller
>> -
>> -Freescale PowerPC 85xx/86xx have on chip general purpose DMA controllers.
>> +** Freescale EloPlus DMA Controller
>> +   This is DMA controller with extended addresses and chaining, mainly used in
>> +   Freescale mpc85xx/86xx, Pxxx and BSC series chips, such as mpc8540, mpc8641
>> +   p4080, bsc9131 etc.
>>   
>>   Required properties:
>>   
>> -- compatible        : compatible list, contains 2 entries, first is
>> -		 "fsl,CHIP-dma", where CHIP is the processor
>> -		 (mpc8540, mpc8540, etc.) and the second is
>> -		 "fsl,eloplus-dma"
>> -- reg               : <registers mapping for DMA general status reg>
>> +- compatible        : must include "fsl,eloplus-dma"
>> +- reg               : <registers specifier for DMA general status reg>
>>   - cell-index        : controller index.  0 for controller @ 0x21000,
>>                                            1 for controller @ 0xc000
>> -- ranges		: Should be defined as specified in 1) to describe the
>> -		  DMA controller channels.
>> +- ranges            : describes the mapping between the address space of the
>> +                      DMA channels and the address space of the DMA controller
>>   
>>   - DMA channel nodes:
>> -        - compatible        : compatible list, contains 2 entries, first is
>> -			 "fsl,CHIP-dma-channel", where CHIP is the processor
>> -			 (mpc8540, mpc8560, etc.) and the second is
>> -			 "fsl,eloplus-dma-channel". However, see note below.
>> +        - compatible        : must include "fsl,eloplus-dma-channel"
>> +                              However, see note below.
>>           - cell-index        : dma channel index starts at 0.
>> -        - reg               : <registers mapping for channel>
>> -        - interrupts        : <interrupt mapping for DMA channel IRQ>
>> +        - reg               : <registers specifier for channel>
>> +        - interrupts        : <interrupt specifier for DMA channel IRQ>
>>           - interrupt-parent  : optional, if needed for interrupt mapping
>>   
>>   Example:
>> -- 
>> 1.7.9.5
> Thanks,
> Mark.
>

^ permalink raw reply

* Re: [PATCH v8 2/3] DMA: Freescale: Add new 8-channel DMA engine device tree nodes
From: Hongbo Zhang @ 2013-08-28  6:54 UTC (permalink / raw)
  To: Mark Rutland
  Cc: devicetree@vger.kernel.org, ian.campbell@citrix.com, Pawel Moll,
	swarren@wwwdotorg.org, vinod.koul@intel.com,
	linux-kernel@vger.kernel.org, rob.herring@calxeda.com,
	djbw@fb.com, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20130827113534.GI19893@e106331-lin.cambridge.arm.com>

On 08/27/2013 07:35 PM, Mark Rutland wrote:
> On Tue, Aug 27, 2013 at 11:42:02AM +0100, hongbo.zhang@freescale.com wrote:
>> From: Hongbo Zhang <hongbo.zhang@freescale.com>
>>
>> Freescale QorIQ T4 and B4 introduce new 8-channel DMA engines, this patch adds
>> the device tree nodes for them.
>>
>> Signed-off-by: Hongbo Zhang <hongbo.zhang@freescale.com>
>> ---
>>   .../devicetree/bindings/powerpc/fsl/dma.txt        |   66 ++++++++++++++++
>>   arch/powerpc/boot/dts/fsl/b4si-post.dtsi           |    4 +-
>>   arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi          |   81 ++++++++++++++++++++
>>   arch/powerpc/boot/dts/fsl/elo3-dma-1.dtsi          |   81 ++++++++++++++++++++
>>   arch/powerpc/boot/dts/fsl/t4240si-post.dtsi        |    4 +-
>>   5 files changed, 232 insertions(+), 4 deletions(-)
>>   create mode 100644 arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi
>>   create mode 100644 arch/powerpc/boot/dts/fsl/elo3-dma-1.dtsi
>>
>> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
>> index ddf17af..10fd031 100644
>> --- a/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
>> +++ b/Documentation/devicetree/bindings/powerpc/fsl/dma.txt
>> @@ -126,6 +126,72 @@ Example:
>>                  };
>>          };
>>
>> +** Freescale Elo3 DMA Controller
>> +   This is EloPlus controller with 8 channels, used in Freescale Txxx and Bxxx
>> +   series chips, such as t1040, t4240, b4860.
>> +
>> +Required properties:
>> +
>> +- compatible        : must include "fsl,elo3-dma"
>> +- reg               : <registers specifier for DMA general status reg>
>> +- ranges            : describes the mapping between the address space of the
>> +                      DMA channels and the address space of the DMA controller
>> +
>> +- DMA channel nodes:
>> +        - compatible        : must include "fsl,eloplus-dma-channel"
>> +        - reg               : <registers specifier for channel>
>> +        - interrupts        : <interrupt specifier for DMA channel IRQ>
>> +        - interrupt-parent  : optional, if needed for interrupt mapping
>> +
>> +Example:
>> +dma@100300 {
>> +       #address-cells = <1>;
>> +       #size-cells = <1>;
>> +       compatible = "fsl,elo3-dma";
>> +       reg = <0x100300 0x4 0x100600 0x4>;
> Is that one reg entry where #size-cells=2 and #address-cells=2?
>
> That's what the binding implies (given it only describes a single reg
> entry).
>
> if it's two entries, we should make that explicit (both in the binding
> and example):
>
> 	reg = <0x100300 0x4>,
> 	      <0x100600 0x4>;
Yes they are two entries, I will change it this way.
>> +       ranges = <0x0 0x100100 0x500>;
> If it is one reg entry then the example ranges property isn't big enough
> to contain the parent-bus-address.
They are two reg entries, so the range is big enough.
>
>> +       dma-channel@0 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x0 0x80>;
>> +               interrupts = <28 2 0 0>;
>> +       };
>> +       dma-channel@80 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x80 0x80>;
>> +               interrupts = <29 2 0 0>;
>> +       };
>> +       dma-channel@100 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x100 0x80>;
>> +               interrupts = <30 2 0 0>;
>> +       };
>> +       dma-channel@180 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x180 0x80>;
>> +               interrupts = <31 2 0 0>;
>> +       };
>> +       dma-channel@300 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x300 0x80>;
>> +               interrupts = <76 2 0 0>;
>> +       };
>> +       dma-channel@380 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x380 0x80>;
>> +               interrupts = <77 2 0 0>;
>> +       };
>> +       dma-channel@400 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x400 0x80>;
>> +               interrupts = <78 2 0 0>;
>> +       };
>> +       dma-channel@480 {
>> +               compatible = "fsl,eloplus-dma-channel";
>> +               reg = <0x480 0x80>;
>> +               interrupts = <79 2 0 0>;
>> +       };
>> +};
>> +
>>   Note on DMA channel compatible properties: The compatible property must say
>>   "fsl,elo-dma-channel" or "fsl,eloplus-dma-channel" to be used by the Elo DMA
>>   driver (fsldma).  Any DMA channel used by fsldma cannot be used by another
>> diff --git a/arch/powerpc/boot/dts/fsl/b4si-post.dtsi b/arch/powerpc/boot/dts/fsl/b4si-post.dtsi
>> index 7399154..ea53ea1 100644
>> --- a/arch/powerpc/boot/dts/fsl/b4si-post.dtsi
>> +++ b/arch/powerpc/boot/dts/fsl/b4si-post.dtsi
>> @@ -223,13 +223,13 @@
>>                  reg = <0xe2000 0x1000>;
>>          };
>>
>> -/include/ "qoriq-dma-0.dtsi"
>> +/include/ "elo3-dma-0.dtsi"
>>          dma@100300 {
>>                  fsl,iommu-parent = <&pamu0>;
>>                  fsl,liodn-reg = <&guts 0x580>; /* DMA1LIODNR */
>>          };
>>
>> -/include/ "qoriq-dma-1.dtsi"
>> +/include/ "elo3-dma-1.dtsi"
>>          dma@101300 {
>>                  fsl,iommu-parent = <&pamu0>;
>>                  fsl,liodn-reg = <&guts 0x584>; /* DMA2LIODNR */
>> diff --git a/arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi b/arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi
>> new file mode 100644
>> index 0000000..69a3277
>> --- /dev/null
>> +++ b/arch/powerpc/boot/dts/fsl/elo3-dma-0.dtsi
>> @@ -0,0 +1,81 @@
>> +/*
>> + * QorIQ DMA device tree stub [ controller @ offset 0x100000 ]
> Copy-pasted?
>
> Presumably should be "Elo3 DMA devicetree stub", or similar?
>
> Similarly for elo3-dma-1.dtsi.
Yes copy-pasted, but QorIQ isn't wrong, it is name of Freescale series 
chips.
To be more specific, I'd like to use "QorIQ Elo3 DMA devicetree stub"
> Thanks,
> Mark.
>

^ permalink raw reply

* Re: [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
From: Gleb Natapov @ 2013-08-28  6:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1377653191.3819.146.camel@pasglop>

On Wed, Aug 28, 2013 at 11:26:31AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-28 at 10:51 +1000, Alexey Kardashevskiy wrote:
> > The ioctl I made up is basically a copy of KVM_CREATE_SPAPR_TCE which does
> > the same thing for emulated devices and it is there for quite a while but
> > it is not really extensible. And these two ioctls share some bits of code.
> > Now we will have 2 pieces of code which do almost the same thing but in a
> > different way. Kinda sucks :(
> 
> Right. Thus the question, Gleb, we can either:
> 
>  - Keep Alexey patch as-is allowing us to *finally* merge that stuff
> that's been around for monthes
> 
>  - Convert *both* existing TCE objects to the new 
> KVM_CREATE_DEVICE, and have some backward compat code for the old one.
> 
> I don't think it makes sense to have the "emulated TCE" and "IOMMU TCE"
> objects use a fundamentally different API and infrastructure.
> 
As a general rule we are not going to mandate converting old devices to
new API, but if it make sense to do here I would much prefer that over
adding another special ioctl

> > >> So my stuff is not going to upstream again. Heh. Ok. I'll implement it.
> > >>
> > > Thanks! Should I keep KVM_CAP_SPAPR_MULTITCE capability patch or can I
> > > drop it for now?
> > 
> > Please keep it, it is unrelated to the IOMMU-VFIO thing.
> 

--
			Gleb.

^ permalink raw reply

* RE: [PATCH v2 2/3] powerpc/85xx: add hardware automatically enter altivec idle state
From: Wang Dongsheng-B40534 @ 2013-08-28  6:08 UTC (permalink / raw)
  To: Wang Dongsheng-B40534, Wood Scott-B07421,
	galak@kernel.crashing.org
  Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1377592900-5020-2-git-send-email-dongsheng.wang@freescale.com>



> -----Original Message-----
> From: Wang Dongsheng-B40534
> Sent: Tuesday, August 27, 2013 4:42 PM
> To: Wood Scott-B07421; galak@kernel.crashing.org
> Cc: linuxppc-dev@lists.ozlabs.org; Wang Dongsheng-B40534
> Subject: [PATCH v2 2/3] powerpc/85xx: add hardware automatically enter
> altivec idle state
>=20
> From: Wang Dongsheng <dongsheng.wang@freescale.com>
>=20
> Each core's AltiVec unit may be placed into a power savings mode
> by turning off power to the unit. Core hardware will automatically
> power down the AltiVec unit after no AltiVec instructions have
> executed in N cycles. The AltiVec power-control is triggered by hardware.
>=20
> Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
> ---
> *v2:
> Remove:
> delete setup_idle_hw_governor function.
> delete "Fix erratum" for rev1.
>=20
> Move:
> move setup_* into __setup/restore_cpu_e6500.
>=20
> diff --git a/arch/powerpc/include/asm/reg_booke.h
> b/arch/powerpc/include/asm/reg_booke.h
> index 86ede76..8364bbe 100644
> --- a/arch/powerpc/include/asm/reg_booke.h
> +++ b/arch/powerpc/include/asm/reg_booke.h
> @@ -217,6 +217,9 @@
>  #define	CCR1_DPC	0x00000100 /* Disable L1 I-Cache/D-Cache parity
> checking */
>  #define	CCR1_TCS	0x00000080 /* Timer Clock Select */
>=20
> +/* Bit definitions for PWRMGTCR0. */
> +#define PWRMGTCR0_ALTIVEC_IDLE	(1 << 22) /* Altivec idle enable */
> +
>  /* Bit definitions for the MCSR. */
>  #define MCSR_MCS	0x80000000 /* Machine Check Summary */
>  #define MCSR_IB		0x40000000 /* Instruction PLB Error */
> diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
> b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
> index bfb18c7..90bbb46 100644
> --- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
> +++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
> @@ -58,6 +58,7 @@ _GLOBAL(__setup_cpu_e6500)
>  #ifdef CONFIG_PPC64
>  	bl	.setup_altivec_ivors
>  #endif
> +	bl	setup_altivec_idle
>  	bl	__setup_cpu_e5500
>  	mtlr	r6
>  	blr
> @@ -119,6 +120,7 @@ _GLOBAL(__setup_cpu_e5500)
>  _GLOBAL(__restore_cpu_e6500)
>  	mflr	r5
>  	bl	.setup_altivec_ivors
> +	bl	setup_altivec_idle
>  	bl	__restore_cpu_e5500
>  	mtlr	r5
>  	blr
> diff --git a/arch/powerpc/platforms/85xx/common.c
> b/arch/powerpc/platforms/85xx/common.c
> index d0861a0..93b563b 100644
> --- a/arch/powerpc/platforms/85xx/common.c
> +++ b/arch/powerpc/platforms/85xx/common.c
> @@ -11,6 +11,16 @@
>=20
>  #include "mpc85xx.h"
>=20
> +#define MAX_BIT				64
> +
This should be change to 63, i will fix this in next patch.

- dongsheng

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox