LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] tty/powerpc: fix build break with ehv_bytechan.c on allyesconfig
From: Timur Tabi @ 2011-08-25 18:02 UTC (permalink / raw)
  To: Greg KH; +Cc: sfr, linux-next, linux-kernel, linuxppc-dev
In-Reply-To: <20110825163234.GA31629@kroah.com>

Greg KH wrote:
> tested doesn't mean that it shouldn't still build properly for other
> platforms, right?

The problem is the dependency on MSR_GS, which is defined only for Book-E
PowerPC chips, not all PowerPC.

So I gave it some more thought, and technically ePAPR extends beyond Book-E, so
it's wrong for the driver to depend on anything specific to Book-E.  I've
removed the code that breaks:

	/* Check if we're running as a guest of a hypervisor */
	if (!(mfmsr() & MSR_GS))
		return;

> What is keeping the driver from building on all PPC, or even all arches
> today?

I've made a few changes, and it builds on all PPC now.  I'll post a new patch.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-25 17:20 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	qemu-devel, Aaron Fabbri, iommu, Avi Kivity, Anthony Liguori,
	linux-pci@vger.kernel.org, linuxppc-dev, benve@cisco.com
In-Reply-To: <20110825105402.GB1923@amd.com>

On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> Hi Alex,
> 
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> > Is this roughly what you're thinking of for the iommu_group component?
> > Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> > support in the iommu base.  Would AMD-Vi do something similar (or
> > exactly the same) for group #s?  Thanks,
> 
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
> 
> > diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> > index 6e6b6a1..6b54c1a 100644
> > --- a/drivers/base/iommu.c
> > +++ b/drivers/base/iommu.c
> > @@ -17,20 +17,56 @@
> >   */
> >  
> >  #include <linux/bug.h>
> > +#include <linux/device.h>
> >  #include <linux/types.h>
> >  #include <linux/module.h>
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/pci.h>
> >  
> >  static struct iommu_ops *iommu_ops;
> >  
> > +static ssize_t show_iommu_group(struct device *dev,
> > +				struct device_attribute *attr, char *buf)
> > +{
> > +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
> 
> Probably add a 0x prefix so userspace knows the format?

I think I'll probably change it to %u.  Seems common to have decimal in
sysfs and doesn't get confusing if we cat it with a string.  As a bonus,
it abstracts that vt-d is just stuffing a PCI device address in there,
which nobody should ever rely on.

> > +}
> > +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> > +
> > +static int add_iommu_group(struct device *dev, void *unused)
> > +{
> > +	if (iommu_dev_to_group(dev) >= 0)
> > +		return device_create_file(dev, &dev_attr_iommu_group);
> > +
> > +	return 0;
> > +}
> > +
> > +static int device_notifier(struct notifier_block *nb,
> > +			   unsigned long action, void *data)
> > +{
> > +	struct device *dev = data;
> > +
> > +	if (action == BUS_NOTIFY_ADD_DEVICE)
> > +		return add_iommu_group(dev, NULL);
> > +
> > +	return 0;
> > +}
> > +
> > +static struct notifier_block device_nb = {
> > +	.notifier_call = device_notifier,
> > +};
> > +
> >  void register_iommu(struct iommu_ops *ops)
> >  {
> >  	if (iommu_ops)
> >  		BUG();
> >  
> >  	iommu_ops = ops;
> > +
> > +	/* FIXME - non-PCI, really want for_each_bus() */
> > +	bus_register_notifier(&pci_bus_type, &device_nb);
> > +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
> >  }
> 
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.

That sounds good.  Is anyone working on it?  It seems like it doesn't
hurt to use this in the interim, we may just be watching the wrong bus
and never add any sysfs group info.

> >  bool iommu_found(void)
> > @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
> >  
> > +long iommu_dev_to_group(struct device *dev)
> > +{
> > +	if (iommu_ops->dev_to_group)
> > +		return iommu_ops->dev_to_group(dev);
> > +	return -ENODEV;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
> 
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.

Ok.

> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.

The convenience of using seg|bus|dev|fn was too much to resist, too bad
it requires a full 32bits.  Maybe I'll change it to:
        int iommu_device_group(struct device *dev, unsigned int *group)

> > +
> >  int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >  	      phys_addr_t paddr, int gfp_order, int prot)
> >  {
> > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> > index f02c34d..477259c 100644
> > --- a/drivers/pci/intel-iommu.c
> > +++ b/drivers/pci/intel-iommu.c
> > @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
> >  static int dmar_forcedac;
> >  static int intel_iommu_strict;
> >  static int intel_iommu_superpage = 1;
> > +static int intel_iommu_no_mf_groups;
> >  
> >  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
> >  static DEFINE_SPINLOCK(device_domain_lock);
> > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
> >  			printk(KERN_INFO
> >  				"Intel-IOMMU: disable supported super page\n");
> >  			intel_iommu_superpage = 0;
> > +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> > +			printk(KERN_INFO
> > +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> > +			intel_iommu_no_mf_groups = 1;
> 
> This should really be a global iommu option and not be VT-d specific.

You think?  It's meaningless on benh's power systems.

> >  
> >  		str += strcspn(str, ",");
> > @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +/* Group numbers are arbitrary.  Device with the same group number
> > + * indicate the iommu cannot differentiate between them.  To avoid
> > + * tracking used groups we just use the seg|bus|devfn of the lowest
> > + * level we're able to differentiate devices */
> > +static long intel_iommu_dev_to_group(struct device *dev)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	struct pci_dev *bridge;
> > +	union {
> > +		struct {
> > +			u8 devfn;
> > +			u8 bus;
> > +			u16 segment;
> > +		} pci;
> > +		u32 group;
> > +	} id;
> > +
> > +	if (iommu_no_mapping(dev))
> > +		return -ENODEV;
> > +
> > +	id.pci.segment = pci_domain_nr(pdev->bus);
> > +	id.pci.bus = pdev->bus->number;
> > +	id.pci.devfn = pdev->devfn;
> > +
> > +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> > +		return -ENODEV;
> > +
> > +	bridge = pci_find_upstream_pcie_bridge(pdev);
> > +	if (bridge) {
> > +		if (pci_is_pcie(bridge)) {
> > +			id.pci.bus = bridge->subordinate->number;
> > +			id.pci.devfn = 0;
> > +		} else {
> > +			id.pci.bus = bridge->bus->number;
> > +			id.pci.devfn = bridge->devfn;
> > +		}
> > +	}
> > +
> > +	/* Virtual functions always get their own group */
> > +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> > +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> > +
> > +	/* FIXME - seg # >= 0x8000 on 32b */
> > +	return id.group;
> > +}
> 
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).

The iommu-api reaches into dev->arch.iommu.groupid?  I figured we should
at least start out with a lightweight, optional interface without the
overhead of predefining groupids setup by bus notification callbacks in
each iommu driver.  Thanks,

Alex

> 
> > +
> >  static struct iommu_ops intel_iommu_ops = {
> >  	.domain_init	= intel_iommu_domain_init,
> >  	.domain_destroy = intel_iommu_domain_destroy,
> > @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
> >  	.unmap		= intel_iommu_unmap,
> >  	.iova_to_phys	= intel_iommu_iova_to_phys,
> >  	.domain_has_cap = intel_iommu_domain_has_cap,
> > +	.dev_to_group	= intel_iommu_dev_to_group,
> >  };
> >  
> >  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 0a2ba40..90c1a86 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -45,6 +45,7 @@ struct iommu_ops {
> >  				    unsigned long iova);
> >  	int (*domain_has_cap)(struct iommu_domain *domain,
> >  			      unsigned long cap);
> > +	long (*dev_to_group)(struct device *dev);
> >  };
> >  
> >  #ifdef CONFIG_IOMMU_API
> > @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
> >  				      unsigned long iova);
> >  extern int iommu_domain_has_cap(struct iommu_domain *domain,
> >  				unsigned long cap);
> > +extern long iommu_dev_to_group(struct device *dev);
> >  
> >  #else /* CONFIG_IOMMU_API */
> >  
> > @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
> >  	return 0;
> >  }
> >  
> > +static inline long iommu_dev_to_group(struct device *dev);
> > +{
> > +	return -ENODEV;
> > +}
> >  #endif /* CONFIG_IOMMU_API */
> >  
> >  #endif /* __LINUX_IOMMU_H */
> > 
> > 
> > 
> 

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-25 16:46 UTC (permalink / raw)
  To: Don Dutile
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	qemu-devel, iommu, chrisw, Alex Williamson, Avi Kivity,
	Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <4E566C61.9060105@redhat.com>

On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote:

> On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> > We need to solve this differently. ARM is starting to use the iommu-api
> > too and this definitly does not work there. One possible solution might
> > be to make the iommu-ops per-bus.
> >
> When you think of a system where there isn't just one bus-type
> with iommu support, it makes more sense.
> Additionally, it also allows the long-term architecture to use different types
> of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
> esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
> for direct-attach disk hba's.

Not sure how likely it is to have different types of IOMMUs within a
given bus-type. But if they become reality we can multiplex in the
iommu-api without much hassle :)
For now, something like bus_set_iommu() or bus_register_iommu() would
provide a nice way to do bus-specific setups for a given iommu
implementation.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* Re: [PATCH] tty/powerpc: fix build break with ehv_bytechan.c on allyesconfig
From: Greg KH @ 2011-08-25 16:32 UTC (permalink / raw)
  To: Timur Tabi; +Cc: sfr, linux-next, linux-kernel, linuxppc-dev
In-Reply-To: <1314289245-14946-1-git-send-email-timur@freescale.com>

On Thu, Aug 25, 2011 at 11:20:45AM -0500, Timur Tabi wrote:
> The Kconfig for the ePAPR hypervisor byte channel driver has a "depends on PPC",
> which means it would compile on all PowerPC platforms, even though it's
> only been tested on Freescale platforms.  Change the Kconfig to depend on
> FSL_SOC instead.

tested doesn't mean that it shouldn't still build properly for other
platforms, right?

What is keeping the driver from building on all PPC, or even all arches
today?

greg k-h

^ permalink raw reply

* [PATCH] tty/powerpc: fix build break with ehv_bytechan.c on allyesconfig
From: Timur Tabi @ 2011-08-25 16:20 UTC (permalink / raw)
  To: greg, sfr, linux-next, linux-kernel, linuxppc-dev

The Kconfig for the ePAPR hypervisor byte channel driver has a "depends on PPC",
which means it would compile on all PowerPC platforms, even though it's
only been tested on Freescale platforms.  Change the Kconfig to depend on
FSL_SOC instead.

Signed-off-by: Timur Tabi <timur@freescale.com>
---
 drivers/tty/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/tty/Kconfig b/drivers/tty/Kconfig
index f1ea59b..535af0a 100644
--- a/drivers/tty/Kconfig
+++ b/drivers/tty/Kconfig
@@ -353,7 +353,7 @@ config TRACE_SINK
 
 config PPC_EPAPR_HV_BYTECHAN
 	tristate "ePAPR hypervisor byte channel driver"
-	depends on PPC
+	depends on FSL_SOC
 	help
 	  This driver creates /dev entries for each ePAPR hypervisor byte
 	  channel, thereby allowing applications to communicate with byte
-- 
1.7.3.4

^ permalink raw reply related

* Re: linux-next: build failure after merge of the final tree (tty tree related)
From: Arnaud Lacombe @ 2011-08-25 16:09 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Greg KH, linux-next, ppc-dev, Timur Tabi, linux-kernel
In-Reply-To: <20110826015111.49af16f792d5554fd931d230@canb.auug.org.au>

Hi,

On Thu, Aug 25, 2011 at 11:51 AM, Stephen Rothwell <sfr@canb.auug.org.au> w=
rote:
> Hi Timur,
>
> On Thu, 25 Aug 2011 10:22:05 -0500 Timur Tabi <timur@freescale.com> wrote=
:
>>
>> Is there some trick to building allyesconfig on PowerPC? =A0When I do tr=
y that, I
>> get all sorts of weird build errors, and it dies long before it gets to =
my
>> driver. =A0I get stuff like:
>>
>> =A0 LD =A0 =A0 =A0arch/powerpc/sysdev/xics/built-in.o
>> WARNING: arch/powerpc/sysdev/xics/built-in.o(.text+0x1310): Section mism=
atch in
>> reference from the function .icp_native_init() to the function
>> .init.text:.icp_native_init_one_node()
>> The function .icp_native_init() references
>> the function __init .icp_native_init_one_node().
>> This is often because .icp_native_init lacks a __init
>> annotation or the annotation of .icp_native_init_one_node is wrong.
>
> We get lots of those in many builds. :-( =A0Just a warning.
>
If you could provide an exhaustive list of them, I'd be interested. Do
you account/reference them in the report you make on each new -next
tree ?

 - Arnaud

>> and
>>
>> =A0 AS =A0 =A0 =A0arch/powerpc/kernel/head_64.o
>> arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
>> arch/powerpc/kernel/exceptions-64s.S:1151: Error: attempt to move .org b=
ackwards
>> arch/powerpc/kernel/exceptions-64s.S:1160: Error: attempt to move .org b=
ackwards
>
> There is a patch for that pending with either the kvm guys or the powerpc=
 guys.
>
>> I guess I don't have the right compiler.
>
> Yours seems to be OK. =A0If you pass -k to make it will get further. =A0O=
r
> you could configure it and then just try building your driver rather than
> the whole tree.
>
>> Anyway, I think I know how to fix the break that Stephen is seeing. =A0I=
 will post
>> a v4 patch in a few minutes.
>
> Thanks.
> --
> Cheers,
> Stephen Rothwell =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0sfr@canb.auug.org=
.au
> http://www.canb.auug.org.au/~sfr/
>

^ permalink raw reply

* [PATCH] xics/icp_natives: add __init to marker icp_native_init()
From: Arnaud Lacombe @ 2011-08-25 16:07 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: timur, Arnaud Lacombe, linux-kernel

This should fix the following warning:

 LD      arch/powerpc/sysdev/xics/built-in.o
WARNING: arch/powerpc/sysdev/xics/built-in.o(.text+0x1310): Section mismatch in
reference from the function .icp_native_init() to the function
.init.text:.icp_native_init_one_node()
The function .icp_native_init() references
the function __init .icp_native_init_one_node().
This is often because .icp_native_init lacks a __init
annotation or the annotation of .icp_native_init_one_node is wrong.

icp_native_init() is only referenced in `arch/powerpc/sysdev/xics/xics-common.c'
by xics_init() which is itself marked with __init.

= not built-tested =

Reported-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Arnaud Lacombe <lacombar@gmail.com>
---
 arch/powerpc/sysdev/xics/icp-native.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/icp-native.c b/arch/powerpc/sysdev/xics/icp-native.c
index 50e32af..4c79b6f 100644
--- a/arch/powerpc/sysdev/xics/icp-native.c
+++ b/arch/powerpc/sysdev/xics/icp-native.c
@@ -276,7 +276,7 @@ static const struct icp_ops icp_native_ops = {
 #endif
 };

-int icp_native_init(void)
+int __init icp_native_init(void)
 {
 	struct device_node *np;
 	u32 indx = 0;
-- 
1.7.6.153.g78432

^ permalink raw reply related

* Re: [PATCH] [v4] tty/powerpc: introduce the ePAPR embedded hypervisor byte channel driver
From: Timur Tabi @ 2011-08-25 16:03 UTC (permalink / raw)
  To: Greg KH; +Cc: sfr, linux-next, linux-kernel, linuxppc-dev
In-Reply-To: <20110825155050.GA10084@kroah.com>

Greg KH wrote:
> No, this doesn't work, I need just a fix, as I took your previous patch
> already.

Sorry, coming right up.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [PATCH] [v4] tty/powerpc: introduce the ePAPR embedded hypervisor byte channel driver
From: Greg KH @ 2011-08-25 15:50 UTC (permalink / raw)
  To: Timur Tabi; +Cc: sfr, linux-next, linux-kernel, linuxppc-dev
In-Reply-To: <1314286345-27056-1-git-send-email-timur@freescale.com>

On Thu, Aug 25, 2011 at 10:32:25AM -0500, Timur Tabi wrote:
> The ePAPR embedded hypervisor specification provides an API for "byte
> channels", which are serial-like virtual devices for sending and receiving
> streams of bytes.  This driver provides Linux kernel support for byte
> channels via three distinct interfaces:
> 
> 1) An early-console (udbg) driver.  This provides early console output
> through a byte channel.  The byte channel handle must be specified in a
> Kconfig option.
> 
> 2) A normal console driver.  Output is sent to the byte channel designated
> for stdout in the device tree.  The console driver is for handling kernel
> printk calls.
> 
> 3) A tty driver, which is used to handle user-space input and output.  The
> byte channel used for the console is designated as the default tty.
> 
> Signed-off-by: Timur Tabi <timur@freescale.com>

No, this doesn't work, I need just a fix, as I took your previous patch
already.

greg k-h

^ permalink raw reply

* Re: linux-next: build failure after merge of the final tree (tty tree related)
From: Stephen Rothwell @ 2011-08-25 15:51 UTC (permalink / raw)
  To: Timur Tabi; +Cc: Greg KH, linux-next, ppc-dev, linux-kernel
In-Reply-To: <4E56689D.3080202@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 1630 bytes --]

Hi Timur,

On Thu, 25 Aug 2011 10:22:05 -0500 Timur Tabi <timur@freescale.com> wrote:
>
> Is there some trick to building allyesconfig on PowerPC?  When I do try that, I
> get all sorts of weird build errors, and it dies long before it gets to my
> driver.  I get stuff like:
> 
>   LD      arch/powerpc/sysdev/xics/built-in.o
> WARNING: arch/powerpc/sysdev/xics/built-in.o(.text+0x1310): Section mismatch in
> reference from the function .icp_native_init() to the function
> .init.text:.icp_native_init_one_node()
> The function .icp_native_init() references
> the function __init .icp_native_init_one_node().
> This is often because .icp_native_init lacks a __init
> annotation or the annotation of .icp_native_init_one_node is wrong.

We get lots of those in many builds. :-(  Just a warning.

> and
> 
>   AS      arch/powerpc/kernel/head_64.o
> arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
> arch/powerpc/kernel/exceptions-64s.S:1151: Error: attempt to move .org backwards
> arch/powerpc/kernel/exceptions-64s.S:1160: Error: attempt to move .org backwards

There is a patch for that pending with either the kvm guys or the powerpc guys.

> I guess I don't have the right compiler.

Yours seems to be OK.  If you pass -k to make it will get further.  Or
you could configure it and then just try building your driver rather than
the whole tree.

> Anyway, I think I know how to fix the break that Stephen is seeing.  I will post
> a v4 patch in a few minutes.

Thanks.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Don Dutile @ 2011-08-25 15:38 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	qemu-devel, iommu, chrisw, Alex Williamson, Avi Kivity,
	Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110825105402.GB1923@amd.com>

On 08/25/2011 06:54 AM, Roedel, Joerg wrote:
> Hi Alex,
>
> On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
>> Is this roughly what you're thinking of for the iommu_group component?
>> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
>> support in the iommu base.  Would AMD-Vi do something similar (or
>> exactly the same) for group #s?  Thanks,
>
> The concept looks good, I have some comments, though. On AMD-Vi the
> implementation would look a bit different because there is a
> data-structure were the information can be gathered from, so no need for
> PCI bus scanning there.
>
>> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
>> index 6e6b6a1..6b54c1a 100644
>> --- a/drivers/base/iommu.c
>> +++ b/drivers/base/iommu.c
>> @@ -17,20 +17,56 @@
>>    */
>>
>>   #include<linux/bug.h>
>> +#include<linux/device.h>
>>   #include<linux/types.h>
>>   #include<linux/module.h>
>>   #include<linux/slab.h>
>>   #include<linux/errno.h>
>>   #include<linux/iommu.h>
>> +#include<linux/pci.h>
>>
>>   static struct iommu_ops *iommu_ops;
>>
>> +static ssize_t show_iommu_group(struct device *dev,
>> +				struct device_attribute *attr, char *buf)
>> +{
>> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));
>
> Probably add a 0x prefix so userspace knows the format?
>
>> +}
>> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
>> +
>> +static int add_iommu_group(struct device *dev, void *unused)
>> +{
>> +	if (iommu_dev_to_group(dev)>= 0)
>> +		return device_create_file(dev,&dev_attr_iommu_group);
>> +
>> +	return 0;
>> +}
>> +
>> +static int device_notifier(struct notifier_block *nb,
>> +			   unsigned long action, void *data)
>> +{
>> +	struct device *dev = data;
>> +
>> +	if (action == BUS_NOTIFY_ADD_DEVICE)
>> +		return add_iommu_group(dev, NULL);
>> +
>> +	return 0;
>> +}
>> +
>> +static struct notifier_block device_nb = {
>> +	.notifier_call = device_notifier,
>> +};
>> +
>>   void register_iommu(struct iommu_ops *ops)
>>   {
>>   	if (iommu_ops)
>>   		BUG();
>>
>>   	iommu_ops = ops;
>> +
>> +	/* FIXME - non-PCI, really want for_each_bus() */
>> +	bus_register_notifier(&pci_bus_type,&device_nb);
>> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>>   }
>
> We need to solve this differently. ARM is starting to use the iommu-api
> too and this definitly does not work there. One possible solution might
> be to make the iommu-ops per-bus.
>
When you think of a system where there isn't just one bus-type
with iommu support, it makes more sense.
Additionally, it also allows the long-term architecture to use different types
of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
for direct-attach disk hba's.


>>   bool iommu_found(void)
>> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>>
>> +long iommu_dev_to_group(struct device *dev)
>> +{
>> +	if (iommu_ops->dev_to_group)
>> +		return iommu_ops->dev_to_group(dev);
>> +	return -ENODEV;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);
>
> Please rename this to iommu_device_group(). The dev_to_group name
> suggests a conversion but it is actually just a property of the device.
> Also the return type should not be long but something that fits into
> 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> choice.
>
>> +
>>   int iommu_map(struct iommu_domain *domain, unsigned long iova,
>>   	      phys_addr_t paddr, int gfp_order, int prot)
>>   {
>> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
>> index f02c34d..477259c 100644
>> --- a/drivers/pci/intel-iommu.c
>> +++ b/drivers/pci/intel-iommu.c
>> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>>   static int dmar_forcedac;
>>   static int intel_iommu_strict;
>>   static int intel_iommu_superpage = 1;
>> +static int intel_iommu_no_mf_groups;
>>
>>   #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>>   static DEFINE_SPINLOCK(device_domain_lock);
>> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>>   			printk(KERN_INFO
>>   				"Intel-IOMMU: disable supported super page\n");
>>   			intel_iommu_superpage = 0;
>> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
>> +			printk(KERN_INFO
>> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
>> +			intel_iommu_no_mf_groups = 1;
>
> This should really be a global iommu option and not be VT-d specific.
>
>>
>>   		str += strcspn(str, ",");
>> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +/* Group numbers are arbitrary.  Device with the same group number
>> + * indicate the iommu cannot differentiate between them.  To avoid
>> + * tracking used groups we just use the seg|bus|devfn of the lowest
>> + * level we're able to differentiate devices */
>> +static long intel_iommu_dev_to_group(struct device *dev)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	struct pci_dev *bridge;
>> +	union {
>> +		struct {
>> +			u8 devfn;
>> +			u8 bus;
>> +			u16 segment;
>> +		} pci;
>> +		u32 group;
>> +	} id;
>> +
>> +	if (iommu_no_mapping(dev))
>> +		return -ENODEV;
>> +
>> +	id.pci.segment = pci_domain_nr(pdev->bus);
>> +	id.pci.bus = pdev->bus->number;
>> +	id.pci.devfn = pdev->devfn;
>> +
>> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
>> +		return -ENODEV;
>> +
>> +	bridge = pci_find_upstream_pcie_bridge(pdev);
>> +	if (bridge) {
>> +		if (pci_is_pcie(bridge)) {
>> +			id.pci.bus = bridge->subordinate->number;
>> +			id.pci.devfn = 0;
>> +		} else {
>> +			id.pci.bus = bridge->bus->number;
>> +			id.pci.devfn = bridge->devfn;
>> +		}
>> +	}
>> +
>> +	/* Virtual functions always get their own group */
>> +	if (!pdev->is_virtfn&&  intel_iommu_no_mf_groups)
>> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
>> +
>> +	/* FIXME - seg #>= 0x8000 on 32b */
>> +	return id.group;
>> +}
>
> This looks like code duplication in the VT-d driver. It doesn't need to
> be generalized now, but we should keep in mind to do a more general
> solution later.
> Maybe it is beneficial if the IOMMU drivers only setup the number in
> dev->arch.iommu.groupid and the iommu-api fetches it from there then.
> But as I said, this is some more work and does not need to be done for
> this patch(-set).
>
>> +
>>   static struct iommu_ops intel_iommu_ops = {
>>   	.domain_init	= intel_iommu_domain_init,
>>   	.domain_destroy = intel_iommu_domain_destroy,
>> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>>   	.unmap		= intel_iommu_unmap,
>>   	.iova_to_phys	= intel_iommu_iova_to_phys,
>>   	.domain_has_cap = intel_iommu_domain_has_cap,
>> +	.dev_to_group	= intel_iommu_dev_to_group,
>>   };
>>
>>   static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 0a2ba40..90c1a86 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -45,6 +45,7 @@ struct iommu_ops {
>>   				    unsigned long iova);
>>   	int (*domain_has_cap)(struct iommu_domain *domain,
>>   			      unsigned long cap);
>> +	long (*dev_to_group)(struct device *dev);
>>   };
>>
>>   #ifdef CONFIG_IOMMU_API
>> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>>   				      unsigned long iova);
>>   extern int iommu_domain_has_cap(struct iommu_domain *domain,
>>   				unsigned long cap);
>> +extern long iommu_dev_to_group(struct device *dev);
>>
>>   #else /* CONFIG_IOMMU_API */
>>
>> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>>   	return 0;
>>   }
>>
>> +static inline long iommu_dev_to_group(struct device *dev);
>> +{
>> +	return -ENODEV;
>> +}
>>   #endif /* CONFIG_IOMMU_API */
>>
>>   #endif /* __LINUX_IOMMU_H */
>>
>>
>>
>

^ permalink raw reply

* [PATCH] [v4] tty/powerpc: introduce the ePAPR embedded hypervisor byte channel driver
From: Timur Tabi @ 2011-08-25 15:32 UTC (permalink / raw)
  To: greg, sfr, linux-next, linux-kernel, linuxppc-dev

The ePAPR embedded hypervisor specification provides an API for "byte
channels", which are serial-like virtual devices for sending and receiving
streams of bytes.  This driver provides Linux kernel support for byte
channels via three distinct interfaces:

1) An early-console (udbg) driver.  This provides early console output
through a byte channel.  The byte channel handle must be specified in a
Kconfig option.

2) A normal console driver.  Output is sent to the byte channel designated
for stdout in the device tree.  The console driver is for handling kernel
printk calls.

3) A tty driver, which is used to handle user-space input and output.  The
byte channel used for the console is designated as the default tty.

Signed-off-by: Timur Tabi <timur@freescale.com>
---
 arch/powerpc/include/asm/udbg.h |    1 +
 arch/powerpc/kernel/udbg.c      |    2 +
 drivers/tty/Kconfig             |   34 ++
 drivers/tty/Makefile            |    1 +
 drivers/tty/ehv_bytechan.c      |  888 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 926 insertions(+), 0 deletions(-)
 create mode 100644 drivers/tty/ehv_bytechan.c

diff --git a/arch/powerpc/include/asm/udbg.h b/arch/powerpc/include/asm/udbg.h
index 93e05d1..5354ae9 100644
--- a/arch/powerpc/include/asm/udbg.h
+++ b/arch/powerpc/include/asm/udbg.h
@@ -54,6 +54,7 @@ extern void __init udbg_init_40x_realmode(void);
 extern void __init udbg_init_cpm(void);
 extern void __init udbg_init_usbgecko(void);
 extern void __init udbg_init_wsp(void);
+extern void __init udbg_init_ehv_bc(void);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_UDBG_H */
diff --git a/arch/powerpc/kernel/udbg.c b/arch/powerpc/kernel/udbg.c
index faa82c1..b4607a9 100644
--- a/arch/powerpc/kernel/udbg.c
+++ b/arch/powerpc/kernel/udbg.c
@@ -67,6 +67,8 @@ void __init udbg_early_init(void)
 	udbg_init_usbgecko();
 #elif defined(CONFIG_PPC_EARLY_DEBUG_WSP)
 	udbg_init_wsp();
+#elif defined(CONFIG_PPC_EARLY_DEBUG_EHV_BC)
+	udbg_init_ehv_bc();
 #endif
 
 #ifdef CONFIG_PPC_EARLY_DEBUG
diff --git a/drivers/tty/Kconfig b/drivers/tty/Kconfig
index bd7cc05..535af0a 100644
--- a/drivers/tty/Kconfig
+++ b/drivers/tty/Kconfig
@@ -350,3 +350,37 @@ config TRACE_SINK
 
 	  If you select this option, you need to select
 	  "Trace data router for MIPI P1149.7 cJTAG standard".
+
+config PPC_EPAPR_HV_BYTECHAN
+	tristate "ePAPR hypervisor byte channel driver"
+	depends on FSL_SOC
+	help
+	  This driver creates /dev entries for each ePAPR hypervisor byte
+	  channel, thereby allowing applications to communicate with byte
+	  channels as if they were serial ports.
+
+config PPC_EARLY_DEBUG_EHV_BC
+	bool "Early console (udbg) support for ePAPR hypervisors"
+	depends on PPC_EPAPR_HV_BYTECHAN
+	help
+	  Select this option to enable early console (a.k.a. "udbg") support
+	  via an ePAPR byte channel.  You also need to choose the byte channel
+	  handle below.
+
+config PPC_EARLY_DEBUG_EHV_BC_HANDLE
+	int "Byte channel handle for early console (udbg)"
+	depends on PPC_EARLY_DEBUG_EHV_BC
+	default 0
+	help
+	  If you want early console (udbg) output through a byte channel,
+	  specify the handle of the byte channel to use.
+
+	  For this to work, the byte channel driver must be compiled
+	  in-kernel, not as a module.
+
+	  Note that only one early console driver can be enabled, so don't
+	  enable any others if you enable this one.
+
+	  If the number you specify is not a valid byte channel handle, then
+	  there simply will be no early console output.  This is true also
+	  if you don't boot under a hypervisor at all.
diff --git a/drivers/tty/Makefile b/drivers/tty/Makefile
index ea89b0b..2953059 100644
--- a/drivers/tty/Makefile
+++ b/drivers/tty/Makefile
@@ -26,5 +26,6 @@ obj-$(CONFIG_ROCKETPORT)	+= rocket.o
 obj-$(CONFIG_SYNCLINK_GT)	+= synclink_gt.o
 obj-$(CONFIG_SYNCLINKMP)	+= synclinkmp.o
 obj-$(CONFIG_SYNCLINK)		+= synclink.o
+obj-$(CONFIG_PPC_EPAPR_HV_BYTECHAN) += ehv_bytechan.o
 
 obj-y += ipwireless/
diff --git a/drivers/tty/ehv_bytechan.c b/drivers/tty/ehv_bytechan.c
new file mode 100644
index 0000000..e67f70b
--- /dev/null
+++ b/drivers/tty/ehv_bytechan.c
@@ -0,0 +1,888 @@
+/* ePAPR hypervisor byte channel device driver
+ *
+ * Copyright 2009-2011 Freescale Semiconductor, Inc.
+ *
+ * Author: Timur Tabi <timur@freescale.com>
+ *
+ * This file is licensed under the terms of the GNU General Public License
+ * version 2.  This program is licensed "as is" without any warranty of any
+ * kind, whether express or implied.
+ *
+ * This driver support three distinct interfaces, all of which are related to
+ * ePAPR hypervisor byte channels.
+ *
+ * 1) An early-console (udbg) driver.  This provides early console output
+ * through a byte channel.  The byte channel handle must be specified in a
+ * Kconfig option.
+ *
+ * 2) A normal console driver.  Output is sent to the byte channel designated
+ * for stdout in the device tree.  The console driver is for handling kernel
+ * printk calls.
+ *
+ * 3) A tty driver, which is used to handle user-space input and output.  The
+ * byte channel used for the console is designated as the default tty.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/interrupt.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <asm/epapr_hcalls.h>
+#include <linux/of.h>
+#include <linux/platform_device.h>
+#include <linux/cdev.h>
+#include <linux/console.h>
+#include <linux/tty.h>
+#include <linux/tty_flip.h>
+#include <linux/circ_buf.h>
+#include <asm/udbg.h>
+
+/* The size of the transmit circular buffer.  This must be a power of two. */
+#define BUF_SIZE	2048
+
+/* Per-byte channel private data */
+struct ehv_bc_data {
+	struct device *dev;
+	struct tty_port port;
+	uint32_t handle;
+	unsigned int rx_irq;
+	unsigned int tx_irq;
+
+	spinlock_t lock;	/* lock for transmit buffer */
+	unsigned char buf[BUF_SIZE];	/* transmit circular buffer */
+	unsigned int head;	/* circular buffer head */
+	unsigned int tail;	/* circular buffer tail */
+
+	int tx_irq_enabled;	/* true == TX interrupt is enabled */
+};
+
+/* Array of byte channel objects */
+static struct ehv_bc_data *bcs;
+
+/* Byte channel handle for stdout (and stdin), taken from device tree */
+static unsigned int stdout_bc;
+
+/* Virtual IRQ for the byte channel handle for stdin, taken from device tree */
+static unsigned int stdout_irq;
+
+/**************************** SUPPORT FUNCTIONS ****************************/
+
+/*
+ * Enable the transmit interrupt
+ *
+ * Unlike a serial device, byte channels have no mechanism for disabling their
+ * own receive or transmit interrupts.  To emulate that feature, we toggle
+ * the IRQ in the kernel.
+ *
+ * We cannot just blindly call enable_irq() or disable_irq(), because these
+ * calls are reference counted.  This means that we cannot call enable_irq()
+ * if interrupts are already enabled.  This can happen in two situations:
+ *
+ * 1. The tty layer makes two back-to-back calls to ehv_bc_tty_write()
+ * 2. A transmit interrupt occurs while executing ehv_bc_tx_dequeue()
+ *
+ * To work around this, we keep a flag to tell us if the IRQ is enabled or not.
+ */
+static void enable_tx_interrupt(struct ehv_bc_data *bc)
+{
+	if (!bc->tx_irq_enabled) {
+		enable_irq(bc->tx_irq);
+		bc->tx_irq_enabled = 1;
+	}
+}
+
+static void disable_tx_interrupt(struct ehv_bc_data *bc)
+{
+	if (bc->tx_irq_enabled) {
+		disable_irq_nosync(bc->tx_irq);
+		bc->tx_irq_enabled = 0;
+	}
+}
+
+/*
+ * find the byte channel handle to use for the console
+ *
+ * The byte channel to be used for the console is specified via a "stdout"
+ * property in the /chosen node.
+ *
+ * For compatible with legacy device trees, we also look for a "stdout" alias.
+ */
+static int find_console_handle(void)
+{
+	struct device_node *np, *np2;
+	const char *sprop = NULL;
+	const uint32_t *iprop;
+
+	np = of_find_node_by_path("/chosen");
+	if (np)
+		sprop = of_get_property(np, "stdout-path", NULL);
+
+	if (!np || !sprop) {
+		of_node_put(np);
+		np = of_find_node_by_name(NULL, "aliases");
+		if (np)
+			sprop = of_get_property(np, "stdout", NULL);
+	}
+
+	if (!sprop) {
+		of_node_put(np);
+		return 0;
+	}
+
+	/* We don't care what the aliased node is actually called.  We only
+	 * care if it's compatible with "epapr,hv-byte-channel", because that
+	 * indicates that it's a byte channel node.  We use a temporary
+	 * variable, 'np2', because we can't release 'np' until we're done with
+	 * 'sprop'.
+	 */
+	np2 = of_find_node_by_path(sprop);
+	of_node_put(np);
+	np = np2;
+	if (!np) {
+		pr_warning("ehv-bc: stdout node '%s' does not exist\n", sprop);
+		return 0;
+	}
+
+	/* Is it a byte channel? */
+	if (!of_device_is_compatible(np, "epapr,hv-byte-channel")) {
+		of_node_put(np);
+		return 0;
+	}
+
+	stdout_irq = irq_of_parse_and_map(np, 0);
+	if (stdout_irq == NO_IRQ) {
+		pr_err("ehv-bc: no 'interrupts' property in %s node\n", sprop);
+		of_node_put(np);
+		return 0;
+	}
+
+	/*
+	 * The 'hv-handle' property contains the handle for this byte channel.
+	 */
+	iprop = of_get_property(np, "hv-handle", NULL);
+	if (!iprop) {
+		pr_err("ehv-bc: no 'hv-handle' property in %s node\n",
+		       np->name);
+		of_node_put(np);
+		return 0;
+	}
+	stdout_bc = be32_to_cpu(*iprop);
+
+	of_node_put(np);
+	return 1;
+}
+
+/*************************** EARLY CONSOLE DRIVER ***************************/
+
+#ifdef CONFIG_PPC_EARLY_DEBUG_EHV_BC
+
+/*
+ * send a byte to a byte channel, wait if necessary
+ *
+ * This function sends a byte to a byte channel, and it waits and
+ * retries if the byte channel is full.  It returns if the character
+ * has been sent, or if some error has occurred.
+ *
+ */
+static void byte_channel_spin_send(const char data)
+{
+	int ret, count;
+
+	do {
+		count = 1;
+		ret = ev_byte_channel_send(CONFIG_PPC_EARLY_DEBUG_EHV_BC_HANDLE,
+					   &count, &data);
+	} while (ret == EV_EAGAIN);
+}
+
+/*
+ * The udbg subsystem calls this function to display a single character.
+ * We convert CR to a CR/LF.
+ */
+static void ehv_bc_udbg_putc(char c)
+{
+	if (c == '\n')
+		byte_channel_spin_send('\r');
+
+	byte_channel_spin_send(c);
+}
+
+/*
+ * early console initialization
+ *
+ * PowerPC kernels support an early printk console, also known as udbg.
+ * This function must be called via the ppc_md.init_early function pointer.
+ * At this point, the device tree has been unflattened, so we can obtain the
+ * byte channel handle for stdout.
+ *
+ * We only support displaying of characters (putc).  We do not support
+ * keyboard input.
+ */
+void __init udbg_init_ehv_bc(void)
+{
+	unsigned int rx_count, tx_count;
+	unsigned int ret;
+
+	/* Check if we're running as a guest of a hypervisor */
+	if (!(mfmsr() & MSR_GS))
+		return;
+
+	/* Verify the byte channel handle */
+	ret = ev_byte_channel_poll(CONFIG_PPC_EARLY_DEBUG_EHV_BC_HANDLE,
+				   &rx_count, &tx_count);
+	if (ret)
+		return;
+
+	udbg_putc = ehv_bc_udbg_putc;
+	register_early_udbg_console();
+
+	udbg_printf("ehv-bc: early console using byte channel handle %u\n",
+		    CONFIG_PPC_EARLY_DEBUG_EHV_BC_HANDLE);
+}
+
+#endif
+
+/****************************** CONSOLE DRIVER ******************************/
+
+static struct tty_driver *ehv_bc_driver;
+
+/*
+ * Byte channel console sending worker function.
+ *
+ * For consoles, if the output buffer is full, we should just spin until it
+ * clears.
+ */
+static int ehv_bc_console_byte_channel_send(unsigned int handle, const char *s,
+			     unsigned int count)
+{
+	unsigned int len;
+	int ret = 0;
+
+	while (count) {
+		len = min_t(unsigned int, count, EV_BYTE_CHANNEL_MAX_BYTES);
+		do {
+			ret = ev_byte_channel_send(handle, &len, s);
+		} while (ret == EV_EAGAIN);
+		count -= len;
+		s += len;
+	}
+
+	return ret;
+}
+
+/*
+ * write a string to the console
+ *
+ * This function gets called to write a string from the kernel, typically from
+ * a printk().  This function spins until all data is written.
+ *
+ * We copy the data to a temporary buffer because we need to insert a \r in
+ * front of every \n.  It's more efficient to copy the data to the buffer than
+ * it is to make multiple hcalls for each character or each newline.
+ */
+static void ehv_bc_console_write(struct console *co, const char *s,
+				 unsigned int count)
+{
+	unsigned int handle = (unsigned int)co->data;
+	char s2[EV_BYTE_CHANNEL_MAX_BYTES];
+	unsigned int i, j = 0;
+	char c;
+
+	for (i = 0; i < count; i++) {
+		c = *s++;
+
+		if (c == '\n')
+			s2[j++] = '\r';
+
+		s2[j++] = c;
+		if (j >= (EV_BYTE_CHANNEL_MAX_BYTES - 1)) {
+			if (ehv_bc_console_byte_channel_send(handle, s2, j))
+				return;
+			j = 0;
+		}
+	}
+
+	if (j)
+		ehv_bc_console_byte_channel_send(handle, s2, j);
+}
+
+/*
+ * When /dev/console is opened, the kernel iterates the console list looking
+ * for one with ->device and then calls that method. On success, it expects
+ * the passed-in int* to contain the minor number to use.
+ */
+static struct tty_driver *ehv_bc_console_device(struct console *co, int *index)
+{
+	*index = co->index;
+
+	return ehv_bc_driver;
+}
+
+static struct console ehv_bc_console = {
+	.name		= "ttyEHV",
+	.write		= ehv_bc_console_write,
+	.device		= ehv_bc_console_device,
+	.flags		= CON_PRINTBUFFER | CON_ENABLED,
+};
+
+/*
+ * Console initialization
+ *
+ * This is the first function that is called after the device tree is
+ * available, so here is where we determine the byte channel handle and IRQ for
+ * stdout/stdin, even though that information is used by the tty and character
+ * drivers.
+ */
+static int __init ehv_bc_console_init(void)
+{
+	if (!find_console_handle()) {
+		pr_debug("ehv-bc: stdout is not a byte channel\n");
+		return -ENODEV;
+	}
+
+#ifdef CONFIG_PPC_EARLY_DEBUG_EHV_BC
+	/* Print a friendly warning if the user chose the wrong byte channel
+	 * handle for udbg.
+	 */
+	if (stdout_bc != CONFIG_PPC_EARLY_DEBUG_EHV_BC_HANDLE)
+		pr_warning("ehv-bc: udbg handle %u is not the stdout handle\n",
+			   CONFIG_PPC_EARLY_DEBUG_EHV_BC_HANDLE);
+#endif
+
+	ehv_bc_console.data = (void *)stdout_bc;
+
+	/* add_preferred_console() must be called before register_console(),
+	   otherwise it won't work.  However, we don't want to enumerate all the
+	   byte channels here, either, since we only care about one. */
+
+	add_preferred_console(ehv_bc_console.name, ehv_bc_console.index, NULL);
+	register_console(&ehv_bc_console);
+
+	pr_info("ehv-bc: registered console driver for byte channel %u\n",
+		stdout_bc);
+
+	return 0;
+}
+console_initcall(ehv_bc_console_init);
+
+/******************************** TTY DRIVER ********************************/
+
+/*
+ * byte channel receive interupt handler
+ *
+ * This ISR is called whenever data is available on a byte channel.
+ */
+static irqreturn_t ehv_bc_tty_rx_isr(int irq, void *data)
+{
+	struct ehv_bc_data *bc = data;
+	struct tty_struct *ttys = tty_port_tty_get(&bc->port);
+	unsigned int rx_count, tx_count, len;
+	int count;
+	char buffer[EV_BYTE_CHANNEL_MAX_BYTES];
+	int ret;
+
+	/* ttys could be NULL during a hangup */
+	if (!ttys)
+		return IRQ_HANDLED;
+
+	/* Find out how much data needs to be read, and then ask the TTY layer
+	 * if it can handle that much.  We want to ensure that every byte we
+	 * read from the byte channel will be accepted by the TTY layer.
+	 */
+	ev_byte_channel_poll(bc->handle, &rx_count, &tx_count);
+	count = tty_buffer_request_room(ttys, rx_count);
+
+	/* 'count' is the maximum amount of data the TTY layer can accept at
+	 * this time.  However, during testing, I was never able to get 'count'
+	 * to be less than 'rx_count'.  I'm not sure whether I'm calling it
+	 * correctly.
+	 */
+
+	while (count > 0) {
+		len = min_t(unsigned int, count, sizeof(buffer));
+
+		/* Read some data from the byte channel.  This function will
+		 * never return more than EV_BYTE_CHANNEL_MAX_BYTES bytes.
+		 */
+		ev_byte_channel_receive(bc->handle, &len, buffer);
+
+		/* 'len' is now the amount of data that's been received. 'len'
+		 * can't be zero, and most likely it's equal to one.
+		 */
+
+		/* Pass the received data to the tty layer. */
+		ret = tty_insert_flip_string(ttys, buffer, len);
+
+		/* 'ret' is the number of bytes that the TTY layer accepted.
+		 * If it's not equal to 'len', then it means the buffer is
+		 * full, which should never happen.  If it does happen, we can
+		 * exit gracefully, but we drop the last 'len - ret' characters
+		 * that we read from the byte channel.
+		 */
+		if (ret != len)
+			break;
+
+		count -= len;
+	}
+
+	/* Tell the tty layer that we're done. */
+	tty_flip_buffer_push(ttys);
+
+	tty_kref_put(ttys);
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * dequeue the transmit buffer to the hypervisor
+ *
+ * This function, which can be called in interrupt context, dequeues as much
+ * data as possible from the transmit buffer to the byte channel.
+ */
+static void ehv_bc_tx_dequeue(struct ehv_bc_data *bc)
+{
+	unsigned int count;
+	unsigned int len, ret;
+	unsigned long flags;
+
+	do {
+		spin_lock_irqsave(&bc->lock, flags);
+		len = min_t(unsigned int,
+			    CIRC_CNT_TO_END(bc->head, bc->tail, BUF_SIZE),
+			    EV_BYTE_CHANNEL_MAX_BYTES);
+
+		ret = ev_byte_channel_send(bc->handle, &len, bc->buf + bc->tail);
+
+		/* 'len' is valid only if the return code is 0 or EV_EAGAIN */
+		if (!ret || (ret == EV_EAGAIN))
+			bc->tail = (bc->tail + len) & (BUF_SIZE - 1);
+
+		count = CIRC_CNT(bc->head, bc->tail, BUF_SIZE);
+		spin_unlock_irqrestore(&bc->lock, flags);
+	} while (count && !ret);
+
+	spin_lock_irqsave(&bc->lock, flags);
+	if (CIRC_CNT(bc->head, bc->tail, BUF_SIZE))
+		/*
+		 * If we haven't emptied the buffer, then enable the TX IRQ.
+		 * We'll get an interrupt when there's more room in the
+		 * hypervisor's output buffer.
+		 */
+		enable_tx_interrupt(bc);
+	else
+		disable_tx_interrupt(bc);
+	spin_unlock_irqrestore(&bc->lock, flags);
+}
+
+/*
+ * byte channel transmit interupt handler
+ *
+ * This ISR is called whenever space becomes available for transmitting
+ * characters on a byte channel.
+ */
+static irqreturn_t ehv_bc_tty_tx_isr(int irq, void *data)
+{
+	struct ehv_bc_data *bc = data;
+	struct tty_struct *ttys = tty_port_tty_get(&bc->port);
+
+	ehv_bc_tx_dequeue(bc);
+	if (ttys) {
+		tty_wakeup(ttys);
+		tty_kref_put(ttys);
+	}
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * This function is called when the tty layer has data for us send.  We store
+ * the data first in a circular buffer, and then dequeue as much of that data
+ * as possible.
+ *
+ * We don't need to worry about whether there is enough room in the buffer for
+ * all the data.  The purpose of ehv_bc_tty_write_room() is to tell the tty
+ * layer how much data it can safely send to us.  We guarantee that
+ * ehv_bc_tty_write_room() will never lie, so the tty layer will never send us
+ * too much data.
+ */
+static int ehv_bc_tty_write(struct tty_struct *ttys, const unsigned char *s,
+			    int count)
+{
+	struct ehv_bc_data *bc = ttys->driver_data;
+	unsigned long flags;
+	unsigned int len;
+	unsigned int written = 0;
+
+	while (1) {
+		spin_lock_irqsave(&bc->lock, flags);
+		len = CIRC_SPACE_TO_END(bc->head, bc->tail, BUF_SIZE);
+		if (count < len)
+			len = count;
+		if (len) {
+			memcpy(bc->buf + bc->head, s, len);
+			bc->head = (bc->head + len) & (BUF_SIZE - 1);
+		}
+		spin_unlock_irqrestore(&bc->lock, flags);
+		if (!len)
+			break;
+
+		s += len;
+		count -= len;
+		written += len;
+	}
+
+	ehv_bc_tx_dequeue(bc);
+
+	return written;
+}
+
+/*
+ * This function can be called multiple times for a given tty_struct, which is
+ * why we initialize bc->ttys in ehv_bc_tty_port_activate() instead.
+ *
+ * The tty layer will still call this function even if the device was not
+ * registered (i.e. tty_register_device() was not called).  This happens
+ * because tty_register_device() is optional and some legacy drivers don't
+ * use it.  So we need to check for that.
+ */
+static int ehv_bc_tty_open(struct tty_struct *ttys, struct file *filp)
+{
+	struct ehv_bc_data *bc = &bcs[ttys->index];
+
+	if (!bc->dev)
+		return -ENODEV;
+
+	return tty_port_open(&bc->port, ttys, filp);
+}
+
+/*
+ * Amazingly, if ehv_bc_tty_open() returns an error code, the tty layer will
+ * still call this function to close the tty device.  So we can't assume that
+ * the tty port has been initialized.
+ */
+static void ehv_bc_tty_close(struct tty_struct *ttys, struct file *filp)
+{
+	struct ehv_bc_data *bc = &bcs[ttys->index];
+
+	if (bc->dev)
+		tty_port_close(&bc->port, ttys, filp);
+}
+
+/*
+ * Return the amount of space in the output buffer
+ *
+ * This is actually a contract between the driver and the tty layer outlining
+ * how much write room the driver can guarantee will be sent OR BUFFERED.  This
+ * driver MUST honor the return value.
+ */
+static int ehv_bc_tty_write_room(struct tty_struct *ttys)
+{
+	struct ehv_bc_data *bc = ttys->driver_data;
+	unsigned long flags;
+	int count;
+
+	spin_lock_irqsave(&bc->lock, flags);
+	count = CIRC_SPACE(bc->head, bc->tail, BUF_SIZE);
+	spin_unlock_irqrestore(&bc->lock, flags);
+
+	return count;
+}
+
+/*
+ * Stop sending data to the tty layer
+ *
+ * This function is called when the tty layer's input buffers are getting full,
+ * so the driver should stop sending it data.  The easiest way to do this is to
+ * disable the RX IRQ, which will prevent ehv_bc_tty_rx_isr() from being
+ * called.
+ *
+ * The hypervisor will continue to queue up any incoming data.  If there is any
+ * data in the queue when the RX interrupt is enabled, we'll immediately get an
+ * RX interrupt.
+ */
+static void ehv_bc_tty_throttle(struct tty_struct *ttys)
+{
+	struct ehv_bc_data *bc = ttys->driver_data;
+
+	disable_irq(bc->rx_irq);
+}
+
+/*
+ * Resume sending data to the tty layer
+ *
+ * This function is called after previously calling ehv_bc_tty_throttle().  The
+ * tty layer's input buffers now have more room, so the driver can resume
+ * sending it data.
+ */
+static void ehv_bc_tty_unthrottle(struct tty_struct *ttys)
+{
+	struct ehv_bc_data *bc = ttys->driver_data;
+
+	/* If there is any data in the queue when the RX interrupt is enabled,
+	 * we'll immediately get an RX interrupt.
+	 */
+	enable_irq(bc->rx_irq);
+}
+
+static void ehv_bc_tty_hangup(struct tty_struct *ttys)
+{
+	struct ehv_bc_data *bc = ttys->driver_data;
+
+	ehv_bc_tx_dequeue(bc);
+	tty_port_hangup(&bc->port);
+}
+
+/*
+ * TTY driver operations
+ *
+ * If we could ask the hypervisor how much data is still in the TX buffer, or
+ * at least how big the TX buffers are, then we could implement the
+ * .wait_until_sent and .chars_in_buffer functions.
+ */
+static const struct tty_operations ehv_bc_ops = {
+	.open		= ehv_bc_tty_open,
+	.close		= ehv_bc_tty_close,
+	.write		= ehv_bc_tty_write,
+	.write_room	= ehv_bc_tty_write_room,
+	.throttle	= ehv_bc_tty_throttle,
+	.unthrottle	= ehv_bc_tty_unthrottle,
+	.hangup		= ehv_bc_tty_hangup,
+};
+
+/*
+ * initialize the TTY port
+ *
+ * This function will only be called once, no matter how many times
+ * ehv_bc_tty_open() is called.  That's why we register the ISR here, and also
+ * why we initialize tty_struct-related variables here.
+ */
+static int ehv_bc_tty_port_activate(struct tty_port *port,
+				    struct tty_struct *ttys)
+{
+	struct ehv_bc_data *bc = container_of(port, struct ehv_bc_data, port);
+	int ret;
+
+	ttys->driver_data = bc;
+
+	ret = request_irq(bc->rx_irq, ehv_bc_tty_rx_isr, 0, "ehv-bc", bc);
+	if (ret < 0) {
+		dev_err(bc->dev, "could not request rx irq %u (ret=%i)\n",
+		       bc->rx_irq, ret);
+		return ret;
+	}
+
+	/* request_irq also enables the IRQ */
+	bc->tx_irq_enabled = 1;
+
+	ret = request_irq(bc->tx_irq, ehv_bc_tty_tx_isr, 0, "ehv-bc", bc);
+	if (ret < 0) {
+		dev_err(bc->dev, "could not request tx irq %u (ret=%i)\n",
+		       bc->tx_irq, ret);
+		free_irq(bc->rx_irq, bc);
+		return ret;
+	}
+
+	/* The TX IRQ is enabled only when we can't write all the data to the
+	 * byte channel at once, so by default it's disabled.
+	 */
+	disable_tx_interrupt(bc);
+
+	return 0;
+}
+
+static void ehv_bc_tty_port_shutdown(struct tty_port *port)
+{
+	struct ehv_bc_data *bc = container_of(port, struct ehv_bc_data, port);
+
+	free_irq(bc->tx_irq, bc);
+	free_irq(bc->rx_irq, bc);
+}
+
+static const struct tty_port_operations ehv_bc_tty_port_ops = {
+	.activate = ehv_bc_tty_port_activate,
+	.shutdown = ehv_bc_tty_port_shutdown,
+};
+
+static int __devinit ehv_bc_tty_probe(struct platform_device *pdev)
+{
+	struct device_node *np = pdev->dev.of_node;
+	struct ehv_bc_data *bc;
+	const uint32_t *iprop;
+	unsigned int handle;
+	int ret;
+	static unsigned int index = 1;
+	unsigned int i;
+
+	iprop = of_get_property(np, "hv-handle", NULL);
+	if (!iprop) {
+		dev_err(&pdev->dev, "no 'hv-handle' property in %s node\n",
+			np->name);
+		return -ENODEV;
+	}
+
+	/* We already told the console layer that the index for the console
+	 * device is zero, so we need to make sure that we use that index when
+	 * we probe the console byte channel node.
+	 */
+	handle = be32_to_cpu(*iprop);
+	i = (handle == stdout_bc) ? 0 : index++;
+	bc = &bcs[i];
+
+	bc->handle = handle;
+	bc->head = 0;
+	bc->tail = 0;
+	spin_lock_init(&bc->lock);
+
+	bc->rx_irq = irq_of_parse_and_map(np, 0);
+	bc->tx_irq = irq_of_parse_and_map(np, 1);
+	if ((bc->rx_irq == NO_IRQ) || (bc->tx_irq == NO_IRQ)) {
+		dev_err(&pdev->dev, "no 'interrupts' property in %s node\n",
+			np->name);
+		ret = -ENODEV;
+		goto error;
+	}
+
+	bc->dev = tty_register_device(ehv_bc_driver, i, &pdev->dev);
+	if (IS_ERR(bc->dev)) {
+		ret = PTR_ERR(bc->dev);
+		dev_err(&pdev->dev, "could not register tty (ret=%i)\n", ret);
+		goto error;
+	}
+
+	tty_port_init(&bc->port);
+	bc->port.ops = &ehv_bc_tty_port_ops;
+
+	dev_set_drvdata(&pdev->dev, bc);
+
+	dev_info(&pdev->dev, "registered /dev/%s%u for byte channel %u\n",
+		ehv_bc_driver->name, i, bc->handle);
+
+	return 0;
+
+error:
+	irq_dispose_mapping(bc->tx_irq);
+	irq_dispose_mapping(bc->rx_irq);
+
+	memset(bc, 0, sizeof(struct ehv_bc_data));
+	return ret;
+}
+
+static int ehv_bc_tty_remove(struct platform_device *pdev)
+{
+	struct ehv_bc_data *bc = dev_get_drvdata(&pdev->dev);
+
+	tty_unregister_device(ehv_bc_driver, bc - bcs);
+
+	irq_dispose_mapping(bc->tx_irq);
+	irq_dispose_mapping(bc->rx_irq);
+
+	return 0;
+}
+
+static const struct of_device_id ehv_bc_tty_of_ids[] = {
+	{ .compatible = "epapr,hv-byte-channel" },
+	{}
+};
+
+static struct platform_driver ehv_bc_tty_driver = {
+	.driver = {
+		.owner = THIS_MODULE,
+		.name = "ehv-bc",
+		.of_match_table = ehv_bc_tty_of_ids,
+	},
+	.probe		= ehv_bc_tty_probe,
+	.remove		= ehv_bc_tty_remove,
+};
+
+/**
+ * ehv_bc_init - ePAPR hypervisor byte channel driver initialization
+ *
+ * This function is called when this module is loaded.
+ */
+static int __init ehv_bc_init(void)
+{
+	struct device_node *np;
+	unsigned int count = 0; /* Number of elements in bcs[] */
+	int ret;
+
+	pr_info("ePAPR hypervisor byte channel driver\n");
+
+	/* Count the number of byte channels */
+	for_each_compatible_node(np, NULL, "epapr,hv-byte-channel")
+		count++;
+
+	if (!count)
+		return -ENODEV;
+
+	/* The array index of an element in bcs[] is the same as the tty index
+	 * for that element.  If you know the address of an element in the
+	 * array, then you can use pointer math (e.g. "bc - bcs") to get its
+	 * tty index.
+	 */
+	bcs = kzalloc(count * sizeof(struct ehv_bc_data), GFP_KERNEL);
+	if (!bcs)
+		return -ENOMEM;
+
+	ehv_bc_driver = alloc_tty_driver(count);
+	if (!ehv_bc_driver) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	ehv_bc_driver->owner = THIS_MODULE;
+	ehv_bc_driver->driver_name = "ehv-bc";
+	ehv_bc_driver->name = ehv_bc_console.name;
+	ehv_bc_driver->type = TTY_DRIVER_TYPE_CONSOLE;
+	ehv_bc_driver->subtype = SYSTEM_TYPE_CONSOLE;
+	ehv_bc_driver->init_termios = tty_std_termios;
+	ehv_bc_driver->flags = TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV;
+	tty_set_operations(ehv_bc_driver, &ehv_bc_ops);
+
+	ret = tty_register_driver(ehv_bc_driver);
+	if (ret) {
+		pr_err("ehv-bc: could not register tty driver (ret=%i)\n", ret);
+		goto error;
+	}
+
+	ret = platform_driver_register(&ehv_bc_tty_driver);
+	if (ret) {
+		pr_err("ehv-bc: could not register platform driver (ret=%i)\n",
+		       ret);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	if (ehv_bc_driver) {
+		tty_unregister_driver(ehv_bc_driver);
+		put_tty_driver(ehv_bc_driver);
+	}
+
+	kfree(bcs);
+
+	return ret;
+}
+
+
+/**
+ * ehv_bc_exit - ePAPR hypervisor byte channel driver termination
+ *
+ * This function is called when this driver is unloaded.
+ */
+static void __exit ehv_bc_exit(void)
+{
+	tty_unregister_driver(ehv_bc_driver);
+	put_tty_driver(ehv_bc_driver);
+	kfree(bcs);
+}
+
+module_init(ehv_bc_init);
+module_exit(ehv_bc_exit);
+
+MODULE_AUTHOR("Timur Tabi <timur@freescale.com>");
+MODULE_DESCRIPTION("ePAPR hypervisor byte channel driver");
+MODULE_LICENSE("GPL v2");
-- 
1.7.3.4

^ permalink raw reply related

* Re: linux-next: build failure after merge of the final tree (tty tree related)
From: Timur Tabi @ 2011-08-25 15:22 UTC (permalink / raw)
  To: Greg KH; +Cc: Stephen Rothwell, linux-next, ppc-dev, linux-kernel
In-Reply-To: <20110825140820.GA9126@kroah.com>

Greg KH wrote:
>> > MSR_GS is defined in arch/powerpc/include/asm/reg_booke.h which is
>> > included by arch/powerpc/include/asm/reg.h but only when defined
>> > (CONFIG_BOOKE) || defined(CONFIG_40x).
> Thanks for the report.
> 
> Timur, care to send a fixup patch for this so this gets resolved?

Is there some trick to building allyesconfig on PowerPC?  When I do try that, I
get all sorts of weird build errors, and it dies long before it gets to my
driver.  I get stuff like:

  LD      arch/powerpc/sysdev/xics/built-in.o
WARNING: arch/powerpc/sysdev/xics/built-in.o(.text+0x1310): Section mismatch in
reference from the function .icp_native_init() to the function
.init.text:.icp_native_init_one_node()
The function .icp_native_init() references
the function __init .icp_native_init_one_node().
This is often because .icp_native_init lacks a __init
annotation or the annotation of .icp_native_init_one_node is wrong.

and

  AS      arch/powerpc/kernel/head_64.o
arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:1151: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1160: Error: attempt to move .org backwards
make[1]: *** [arch/powerpc/kernel/head_64.o] Error 1

I guess I don't have the right compiler.

Anyway, I think I know how to fix the break that Stephen is seeing.  I will post
a v4 patch in a few minutes.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: linux-next: build failure after merge of the final tree (tty tree related)
From: Greg KH @ 2011-08-25 14:08 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: linux-next, ppc-dev, linux-kernel, Timur Tabi
In-Reply-To: <20110825161843.daf46b6b4023926fcfeec387@canb.auug.org.au>

On Thu, Aug 25, 2011 at 04:18:43PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the final tree, today's linux-next build (powerpc
> allyesconfig) failed like this:
> 
> drivers/tty/ehv_bytechan.c: In function 'udbg_init_ehv_bc':
> drivers/tty/ehv_bytechan.c:230:18: error: 'MSR_GS' undeclared (first use in this function)
> drivers/tty/ehv_bytechan.c: In function 'ehv_bc_console_write':
> drivers/tty/ehv_bytechan.c:289:24: warning: cast from pointer to integer of different size
> drivers/tty/ehv_bytechan.c: In function 'ehv_bc_console_init':
> drivers/tty/ehv_bytechan.c:355:24: warning: cast to pointer from integer of different size
> 
> Caused by commit dcd83aaff1c8 ("tty/powerpc: introduce the ePAPR embedded
> hypervisor byte channel driver").
> 
> MSR_GS is defined in arch/powerpc/include/asm/reg_booke.h which is
> included by arch/powerpc/include/asm/reg.h but only when defined
> (CONFIG_BOOKE) || defined(CONFIG_40x).

Thanks for the report.

Timur, care to send a fixup patch for this so this gets resolved?

greg k-h

^ permalink raw reply

* Re: linux-next: build failure after merge of the final tree (tty tree related)
From: Timur Tabi @ 2011-08-25 14:28 UTC (permalink / raw)
  To: Greg KH
  Cc: Stephen Rothwell, <linux-next@vger.kernel.org>, ppc-dev,
	<linux-kernel@vger.kernel.org>
In-Reply-To: <20110825140820.GA9126@kroah.com>



On Aug 25, 2011, at 9:08 AM, Greg KH <greg@kroah.com> wrote:

> On Thu, Aug 25, 2011 at 04:18:43PM +1000, Stephen Rothwell wrote:
>> 
> 
> Thanks for the report.
> 
> Timur, care to send a fixup patch for this so this gets resolved?

Yes, I will do it ASAP, probably within the next two hours.
> 

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Alexander Graf @ 2011-08-25 13:25 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	qemu-devel, iommu, chrisw, Alex Williamson, Avi Kivity,
	Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110825123146.GD1923@amd.com>

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]


On 25.08.2011, at 07:31, Roedel, Joerg wrote:

> On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
>> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> 

[...]

>> We need to try the polite method of attempting to hot unplug the device
>> from qemu first, which the current vfio code already implements.  We can
>> then escalate if it doesn't respond.  The current code calls abort in
>> qemu if the guest doesn't respond, but I agree we should also be
>> enforcing this at the kernel interface.  I think the problem with the
>> hard-unplug is that we don't have a good revoke mechanism for the mmio
>> mmaps.
> 
> For mmio we could stop the guest and replace the mmio region with a
> region that is filled with 0xff, no?

Sure, but that happens in user space. The question is how does kernel space enforce an MMIO region to not be mapped after the hotplug event occured? Keep in mind that user space is pretty much untrusted here - it doesn't have to be QEMU. It could just as well be a generic user space driver. And that can just ignore hotplug events.


Alex


[-- Attachment #2: Type: text/html, Size: 1843 bytes --]

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-25 12:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linux-pci@vger.kernel.org, linuxppc-dev, benve@cisco.com
In-Reply-To: <1314198467.2859.192.camel@bling.home>

On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> > On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> > 
> > > > Handling it through fds is a good idea. This makes sure that everything
> > > > belongs to one process. I am not really sure yet if we go the way to
> > > > just bind plain groups together or if we create meta-groups. The
> > > > meta-groups thing seems somewhat cleaner, though.
> > > 
> > > I'm leaning towards binding because we need to make it dynamic, but I
> > > don't really have a good picture of the lifecycle of a meta-group.
> > 
> > In my view the life-cycle of the meta-group is a subrange of the
> > qemu-instance's life-cycle.
> 
> I guess I mean the lifecycle of a super-group that's actually exposed as
> a new group in sysfs.  Who creates it?  How?  How are groups dynamically
> added and removed from the super-group?  The group merging makes sense
> to me because it's largely just an optimization that qemu will try to
> merge groups.  If it works, great.  If not, it manages them separately.
> When all the devices from a group are unplugged, unmerge the group if
> necessary.

Right. The super-group thing is an optimization.

> We need to try the polite method of attempting to hot unplug the device
> from qemu first, which the current vfio code already implements.  We can
> then escalate if it doesn't respond.  The current code calls abort in
> qemu if the guest doesn't respond, but I agree we should also be
> enforcing this at the kernel interface.  I think the problem with the
> hard-unplug is that we don't have a good revoke mechanism for the mmio
> mmaps.

For mmio we could stop the guest and replace the mmio region with a
region that is filled with 0xff, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* [PATCH] powerpc/time: When starting the decrementer don't zero the other bits in TCR
From: Laurentiu Tudor @ 2011-08-25 12:19 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Laurentiu Tudor

Clearing the other TCR bits might break code that sets them (e.g. to setup
the watchdog or fixed interval timer) before start_cpu_decrementer() gets
called.

Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
---
 arch/powerpc/kernel/time.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 03b29a6..e8b5cdc 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -721,7 +721,7 @@ void start_cpu_decrementer(void)
 	mtspr(SPRN_TSR, TSR_ENW | TSR_WIS | TSR_DIS | TSR_FIS);
 
 	/* Enable decrementer interrupt */
-	mtspr(SPRN_TCR, TCR_DIE);
+	mtspr(SPRN_TCR, mfspr(SPRN_TCR) | TCR_DIE);
 #endif /* defined(CONFIG_BOOKE) || defined(CONFIG_40x) */
 }
 
-- 
1.7.1

^ permalink raw reply related

* When set mtu 9600 by gfar_change_mtu, the maxfrm register is greater than 9600
From: Rongqing Li @ 2011-08-25  9:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: netdev

Hi:

When set MTU to 9600 by gfar_change_mtu(), the maxfrm register will
be set to 9728 which is greater than 9600 in gianfar.c.

But the MPC8315 Reference manual says the value of maxfrm can not
greater than 9600.

Is it a defect, Do we need to fix it?

-- 
Best Reagrds,
Roy | RongQing Li

^ permalink raw reply

* Re: [PATCH v3] mtd/nand : workaround for Freescale FCM to support large-page Nand chip
From: Matthieu CASTET @ 2011-08-25 11:25 UTC (permalink / raw)
  To: LiuShuo
  Cc: Scott Wood, linuxppc-dev@ozlabs.org, dwmw2@infradead.org,
	Li Yang-R58472, linux-mtd@lists.infradead.org
In-Reply-To: <4E546672.3070100@freescale.com>

Hi,

LiuShuo a écrit :
> 于 2011年08月23日 18:02, Matthieu CASTET 写道:
>> LiuShuo a écrit :
>>> 于 2011年08月19日 00:25, Scott Wood 写道:
>>>> On 08/17/2011 09:33 PM, b35362@freescale.com wrote:
>>>>> From: Liu Shuo<b35362@freescale.com>
>>>>>
>>>>> Freescale FCM controller has a 2K size limitation of buffer RAM. In order
>>>>> to support the Nand flash chip whose page size is larger than 2K bytes,
>>>>> we divide a page into multi-2K pages for MTD layer driver. In that case,
>>>>> we force to set the page size to 2K bytes. We convert the page address of
>>>>> MTD layer driver to a real page address in flash chips and a column index
>>>>> in fsl_elbc driver. We can issue any column address by UA instruction of
>>>>> elbc controller.
>>>>>
>>>>> NOTE: Due to there is a limitation of 'Number of Partial Program Cycles in
>>>>> the Same Page (NOP)', the flash chip which is supported by this workaround
>>>>> have to meet below conditions.
>>>>> 	1. page size is not greater than 4KB
>>>>> 	2.	1) if main area and spare area have independent NOPs:
>>>>> 			  main  area NOP    :>=3
>>>>> 			  spare area NOP    :>=2?
>>>> How often are the NOPs split like this?
>>>>
>>>>> 		2) if main area and spare area have a common NOP:
>>>>> 			  NOP               :>=4
>>>> This depends on how the flash is used.  If you treat it as a NOP1 flash
>>>> (e.g. run ubifs rather than jffs2), then you need NOP2 for a 4K chip and
>>>> NOP4 for an 8K chip.  OTOH, if you would be making full use of NOP4 on a
>>>> real 2K chip, you'll need NOP8 for a 4K chip.
>>>>
>>>> The NOP restrictions should be documented in the code itself, not just
>>>> in the git changelog.  Maybe print it to the console when this hack is
>>>> used, along with the NOP value read from the ID.
>>> We can't read the NOP from the ID on any chip. Some chips don't
>>> give this infomation.(e.g. Micron MT29F4G08BAC)
>> Doesn't the micron chip provide it with onfi info ?
> Sorry, there is something wrong with my expression.
> We can get the NOP info from datasheet, but can't get it by READID 
> command in code.
> 
ok I was thinking the micron chip was a 4K nand. But it is an old 2K. Why do you
want NOP from it ?


Also can you reply my question about the sequence you use when trying to read 4k
with one command.


Thanks


Matthieu

^ permalink raw reply

* Re: [PATCH v3] mtd/nand : workaround for Freescale FCM to support large-page Nand chip
From: Artem Bityutskiy @ 2011-08-25 11:18 UTC (permalink / raw)
  To: Scott Wood
  Cc: Li Yang-R58472, LiuShuo, Matthieu CASTET, linuxppc-dev@ozlabs.org,
	linux-mtd@lists.infradead.org, dwmw2@infradead.org
In-Reply-To: <4E53D15D.2050807@freescale.com>

On Tue, 2011-08-23 at 11:12 -0500, Scott Wood wrote:
> On 08/23/2011 05:02 AM, Matthieu CASTET wrote:
> > LiuShuo a écrit :
> >> We can't read the NOP from the ID on any chip. Some chips don't
> >> give this infomation.(e.g. Micron MT29F4G08BAC)
> 
> Are there any 4K+ chips (especially ones with insufficient NOP) that
> don't have the info?
> 
> This chip is 2K and NOP8.
> 
> Is there an easy way (without needing to have every datasheet for every
> chip ever made) to determine at runtime which chips supply this information?
> 
> > Doesn't the micron chip provide it with onfi info ?
> 
> This chip doesn't appear to be ONFI.

Few quick thoughts.

1. I think that if driver is able to detect flash NOP parameter and
refuse flashes with too low NOP, then your change is OK.
2. For ONFI flashes, can we take NOP from ONFI info?
3. For non-ONFI chip, is it fair to conclude that MLCs _all_ have NOP 1?
Can distinguish between MLC/SLC? If not, can this table help:
http://www.linux-mtd.infradead.org/nand-data/nanddata.html? If needed,
can we put "bits-per-cell" data to 'struct nand_flash_dev
nand_flash_ids' ?
4. Can we add a NOP field to 'struct nand_flash_dev nand_flash_ids'
array?

-- 
Best Regards,
Artem Bityutskiy

^ permalink raw reply

* Re: Kernel boot up
From: Gary Thomas @ 2011-08-25 11:11 UTC (permalink / raw)
  To: smitha.vanga; +Cc: linuxppc-dev
In-Reply-To: <07ACDFB8ECA8EF47863A613BC01BBB22035E3A70@HYD-MKD-MBX02.wipro.com>

On 2011-08-25 01:57, smitha.vanga@wipro.com wrote:
> Hi Scott,
>
> I am currently trying to bring up 2.6.39 kernel on a target based on MPC8247
> Processor, using the attched .dts file . I get the below logs while the kernel is booting.
> I see that the unflattening of the device tree and the initial loading of the kernel and ramdisk file system is happening correctly. Can you point me where exactly I can look for
> this issue. I am attaching the .config and .dts file I am using.

There doesn't seem to be anything wrong with the kernel in this log.
The failure is in the user code - in particular, udev is giving up
with these errors:
   /sbin/udevstart: '/lib/libc.so.6' library contains unsupported TLS
After that, all is lost as there will be no console, etc, for the
rest of the system to use...

You need to examine how you built the root file system and why you
are getting these errors.  This problem seems to be unique to uclibc

>
>
> bootm 1000000 2000000 c00000
> ## Current stack ends at 0x03e93cc8
> * kernel: cmdline image address = 0x01000000
> ## Booting kernel from Legacy Image at 01000000 ...
> Image Name: Linux-2.6.39
> Image Type: PowerPC Linux Kernel Image (gzip compressed)
> Data Size: 1766015 Bytes = 1.7 MiB
> Load Address: 00000000
> Entry Point: 00000000
> Verifying Checksum ... OK
> kernel data at 0x01000040, len = 0x001af27f (1766015)
> * ramdisk: cmdline image address = 0x02000000
> ## Loading init Ramdisk from Legacy Image at 02000000 ...
> Image Name:
> Image Type: PowerPC Linux RAMDisk Image (gzip compressed)
> Data Size: 2211111 Bytes = 2.1 MiB
> Load Address: 00000000
> Entry Point: 00000000
> Verifying Checksum ... OK
> ramdisk start = 0x02000040, ramdisk end = 0x0221bd67
> * fdt: cmdline image address = 0x00c00000
> ## Checking for 'FDT'/'FDT Image' at 00c00000
> * fdt: raw FDT blob
> ## Flattened Device Tree blob at 00c00000
> Booting using the fdt blob at 0xc00000
> of_flat_tree at 0x00c00000 size 0x00000f12
> Uncompressing Kernel Image ... OK
> kernel loaded at 0x00000000, end = 0x00389d20
> ## initrd_high = 0xffffffff, copy_to_ram = 1
> Loading Ramdisk to 03c76000, end 03e91d27 ... OK
> ramdisk load start = 0x03c76000, ramdisk load end = 0x03e91d27
> ## device tree at 00c00000 ... 00c00f11 (len=16146 [0x3F12])
> Loading Device Tree to 007fc000, end 007fff11 ... OK
> Updating property 'clock-frequency' = 00 fe 70 b8
> Updating property 'bus-frequency' = 03 f9 c2 e0
> Updating property 'timebase-frequency' = 00 7f 38 5c
> Updating property 'clock-frequency' = 09 f0 67 30
> ## Transferring control to Linux (at address 00000000) ...
> Booting using OF flat tree...
> Using Freescale MPC8272 ADS machine description
> Linux version 2.6.39 (2.6.39) (ktuser@ktuser) (gcc version 4.4.5 (Buildroot 2011
> .02) ) #5 Wed Aug 24 15:02:07 IST 2011
> Found initrd at 0xc3c76000:0xc3e91d27
> No bcsr in device tree
> Zone PFN ranges:
> DMA 0x00000000 -> 0x00004000
> Normal empty
> Movable zone start PFN for each node
> early_node_map[1] active PFN ranges
> 0: 0x00000000 -> 0x00004000
> Built 1 zonelists in Zone order, mobility grouping on. Total pages: 16256
> Kernel command line: mem=64M root=/dev/ram rw
> PID hash table entries: 256 (order: -2, 1024 bytes)
> Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
> Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
> Memory: 57972k/65536k available (3524k kernel code, 7564k reserved, 100k data, 1
> 137k bss, 168k init)
> Kernel virtual memory layout:
> * 0xfffdf000..0xfffff000 : fixmap
> * 0xfdfb6000..0xfe000000 : early ioremap
> * 0xc5000000..0xfdfb6000 : vmalloc & ioremap
> SLUB: Genslabs=15, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> NR_IRQS:512 nr_irqs:512 16
> No pci pic node in device tree.
> clocksource: timebase mult[1dfc2974] shift[22] registered
> console [ttyCPM0] enabled
> pid_max: default: 32768 minimum: 301
> Mount-cache hash table entries: 512
> NET: Registered protocol family 16
> PCI: Probing PCI hardware
> bio: create slab <bio-0> at 0
> vgaarb: loaded
> Switching to clocksource timebase
>
> brd: module loaded
> loop: module loaded
> of-flash ff800000.flash: do_map_probe() failed
> PPP generic driver version 2.4.2
> PPP Deflate Compression module registered
> tun: Universal TUN/TAP device driver, 1.6
> tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
> eth0: fs_enet: 00:00:00:00:00:00
> eth1: fs_enet: 00:00:00:00:00:00
> CPM2 Bitbanged MII: probed
> mousedev: PS/2 mouse device common for all mice
> TCP cubic registered
> NET: Registered protocol family 10
> IPv6 over IPv4 tunneling driver
> NET: Registered protocol family 17
> Freeing unused kernel memory: 168k init
> Populating /dev using udev: /sbin/udevd: '/lib/libc.so.6' library contains unsup
> ported TLS
> /sbin/udevd: '/lib/libc.so.6' library contains unsupported TLS
> /sbin/udevd: can't load library 'libc.so.6'
> FAIL
> /sbin/udevstart: '/lib/libc.so.6' library contains unsupported TLS
> /sbin/udevstart: '/lib/libc.so.6' library contains unsupported TLS
> /sbin/udevstart: can't load library 'libc.so.6'
> FAIL
> done
> Starting network...
>
> Regards,
>
> Smitha
>
> *Please do not print this email unless it is absolutely necessary. *
>
> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary,
> confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and
> destroy all copies of this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for
> any damage caused by any virus transmitted by this email.
>
> www.wipro.com
>
>
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH v3] mtd/nand : workaround for Freescale FCM to support large-page Nand chip
From: Artem Bityutskiy @ 2011-08-25 11:06 UTC (permalink / raw)
  To: Scott Wood
  Cc: linuxppc-dev@ozlabs.org, linux-mtd@lists.infradead.org, LiuShuo,
	dwmw2@infradead.org, Matthieu CASTET
In-Reply-To: <4E527C91.6080009@freescale.com>

On Mon, 2011-08-22 at 10:58 -0500, Scott Wood wrote:
> On 08/22/2011 05:58 AM, Artem Bityutskiy wrote:
> > On Fri, 2011-08-19 at 13:10 -0500, Scott Wood wrote:
> >> On 08/19/2011 03:57 AM, Matthieu CASTET wrote:
> >>> How the bad block marker are handled with this remapping ?
> >>
> >> It has to be migrated prior to first use (this needs to be documented,
> >> and ideally a U-Boot command provided do do this), or else special
> >> handling would be needed when building the BBT.  The only way around
> >> this would be to do ECC in software, and do the buffering needed to let
> >> MTD treat it as a 4K chip.
> > 
> > It really feels like a special hack which would better not go to
> > mainline - am I the only one with such feeling? If yes, probably I am
> > wrong...
> 
> While the implementation is (of necessity) a hack, the feature is
> something that multiple people have been asking for (it's not a special
> case for a specific user).  They say 2K chips are getting more difficult
> to obtain.  It doesn't change anything for people using 512/2K chips,
> and (in its current form) doesn't introduce significant complexity to
> the driver.  I'm not sure how maintaining it out of tree would be a
> better situation for anyone.

I am just afraid that (a) other drivers will do this (b) this will start
causing various weird bug-reports...

-- 
Best Regards,
Artem Bityutskiy

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-25 11:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve@cisco.com
In-Reply-To: <1314197775.2859.182.camel@bling.home>

On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote:
> On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote:
> > A side-note: Might it be better to expose assigned devices in a guest on
> > a seperate bus? This will make it easier to emulate an IOMMU for the
> > guest inside qemu.
> 
> I think we want that option, sure.  A lot of guests aren't going to
> support hotplugging buses though, so I think our default, map the entire
> guest model should still be using bus 0.  The ACPI gets a lot more
> complicated for that model too; dynamic SSDTs?  Thanks,

Ok, if only AMD-Vi should be emulated then it is not strictly
necessary. For this IOMMU we can specify that devices on the same bus
belong to different IOMMUs. So we can implement an IOMMU that handles
internal qemu-devices and one that handles pass-through devices.
Not sure if this is possible with VT-d too. Okay VT-d emulation would
also require that the devices emulation of a PCIe bridge, no?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-25 10:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	qemu-devel, Aaron Fabbri, iommu, Avi Kivity, Anthony Liguori,
	linux-pci@vger.kernel.org, linuxppc-dev, benve@cisco.com
In-Reply-To: <1314220434.2859.203.camel@bling.home>

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:
> Is this roughly what you're thinking of for the iommu_group component?
> Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
> support in the iommu base.  Would AMD-Vi do something similar (or
> exactly the same) for group #s?  Thanks,

The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.

> diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
> index 6e6b6a1..6b54c1a 100644
> --- a/drivers/base/iommu.c
> +++ b/drivers/base/iommu.c
> @@ -17,20 +17,56 @@
>   */
>  
>  #include <linux/bug.h>
> +#include <linux/device.h>
>  #include <linux/types.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/pci.h>
>  
>  static struct iommu_ops *iommu_ops;
>  
> +static ssize_t show_iommu_group(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lx", iommu_dev_to_group(dev));

Probably add a 0x prefix so userspace knows the format?

> +}
> +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
> +
> +static int add_iommu_group(struct device *dev, void *unused)
> +{
> +	if (iommu_dev_to_group(dev) >= 0)
> +		return device_create_file(dev, &dev_attr_iommu_group);
> +
> +	return 0;
> +}
> +
> +static int device_notifier(struct notifier_block *nb,
> +			   unsigned long action, void *data)
> +{
> +	struct device *dev = data;
> +
> +	if (action == BUS_NOTIFY_ADD_DEVICE)
> +		return add_iommu_group(dev, NULL);
> +
> +	return 0;
> +}
> +
> +static struct notifier_block device_nb = {
> +	.notifier_call = device_notifier,
> +};
> +
>  void register_iommu(struct iommu_ops *ops)
>  {
>  	if (iommu_ops)
>  		BUG();
>  
>  	iommu_ops = ops;
> +
> +	/* FIXME - non-PCI, really want for_each_bus() */
> +	bus_register_notifier(&pci_bus_type, &device_nb);
> +	bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group);
>  }

We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.

>  bool iommu_found(void)
> @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
>  
> +long iommu_dev_to_group(struct device *dev)
> +{
> +	if (iommu_ops->dev_to_group)
> +		return iommu_ops->dev_to_group(dev);
> +	return -ENODEV;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_to_group);

Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.

> +
>  int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  	      phys_addr_t paddr, int gfp_order, int prot)
>  {
> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index f02c34d..477259c 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
>  static int dmar_forcedac;
>  static int intel_iommu_strict;
>  static int intel_iommu_superpage = 1;
> +static int intel_iommu_no_mf_groups;
>  
>  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
>  static DEFINE_SPINLOCK(device_domain_lock);
> @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
>  			printk(KERN_INFO
>  				"Intel-IOMMU: disable supported super page\n");
>  			intel_iommu_superpage = 0;
> +		} else if (!strncmp(str, "no_mf_groups", 12)) {
> +			printk(KERN_INFO
> +				"Intel-IOMMU: disable separate groups for multifunction devices\n");
> +			intel_iommu_no_mf_groups = 1;

This should really be a global iommu option and not be VT-d specific.

>  
>  		str += strcspn(str, ",");
> @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +/* Group numbers are arbitrary.  Device with the same group number
> + * indicate the iommu cannot differentiate between them.  To avoid
> + * tracking used groups we just use the seg|bus|devfn of the lowest
> + * level we're able to differentiate devices */
> +static long intel_iommu_dev_to_group(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct pci_dev *bridge;
> +	union {
> +		struct {
> +			u8 devfn;
> +			u8 bus;
> +			u16 segment;
> +		} pci;
> +		u32 group;
> +	} id;
> +
> +	if (iommu_no_mapping(dev))
> +		return -ENODEV;
> +
> +	id.pci.segment = pci_domain_nr(pdev->bus);
> +	id.pci.bus = pdev->bus->number;
> +	id.pci.devfn = pdev->devfn;
> +
> +	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
> +		return -ENODEV;
> +
> +	bridge = pci_find_upstream_pcie_bridge(pdev);
> +	if (bridge) {
> +		if (pci_is_pcie(bridge)) {
> +			id.pci.bus = bridge->subordinate->number;
> +			id.pci.devfn = 0;
> +		} else {
> +			id.pci.bus = bridge->bus->number;
> +			id.pci.devfn = bridge->devfn;
> +		}
> +	}
> +
> +	/* Virtual functions always get their own group */
> +	if (!pdev->is_virtfn && intel_iommu_no_mf_groups)
> +		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
> +
> +	/* FIXME - seg # >= 0x8000 on 32b */
> +	return id.group;
> +}

This looks like code duplication in the VT-d driver. It doesn't need to
be generalized now, but we should keep in mind to do a more general
solution later.
Maybe it is beneficial if the IOMMU drivers only setup the number in
dev->arch.iommu.groupid and the iommu-api fetches it from there then.
But as I said, this is some more work and does not need to be done for
this patch(-set).

> +
>  static struct iommu_ops intel_iommu_ops = {
>  	.domain_init	= intel_iommu_domain_init,
>  	.domain_destroy = intel_iommu_domain_destroy,
> @@ -3911,6 +3962,7 @@ static struct iommu_ops intel_iommu_ops = {
>  	.unmap		= intel_iommu_unmap,
>  	.iova_to_phys	= intel_iommu_iova_to_phys,
>  	.domain_has_cap = intel_iommu_domain_has_cap,
> +	.dev_to_group	= intel_iommu_dev_to_group,
>  };
>  
>  static void __devinit quirk_iommu_rwbf(struct pci_dev *dev)
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0a2ba40..90c1a86 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -45,6 +45,7 @@ struct iommu_ops {
>  				    unsigned long iova);
>  	int (*domain_has_cap)(struct iommu_domain *domain,
>  			      unsigned long cap);
> +	long (*dev_to_group)(struct device *dev);
>  };
>  
>  #ifdef CONFIG_IOMMU_API
> @@ -65,6 +66,7 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
>  				      unsigned long iova);
>  extern int iommu_domain_has_cap(struct iommu_domain *domain,
>  				unsigned long cap);
> +extern long iommu_dev_to_group(struct device *dev);
>  
>  #else /* CONFIG_IOMMU_API */
>  
> @@ -121,6 +123,10 @@ static inline int domain_has_cap(struct iommu_domain *domain,
>  	return 0;
>  }
>  
> +static inline long iommu_dev_to_group(struct device *dev);
> +{
> +	return -ENODEV;
> +}
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> 
> 
> 

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox