LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-23 19:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1314046904.7662.37.camel@pasglop>

On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote:
> 
> > Yes, that's the idea.  An open question I have towards the configuration
> > side is whether we might add iommu driver specific options to the
> > groups.  For instance on x86 where we typically have B:D.F granularity,
> > should we have an option not to trust multi-function devices and use a
> > B:D granularity for grouping?
> 
> Or even B or range of busses... if you want to enforce strict isolation
> you really can't trust anything below a bus level :-)
> 
> > Right, we can also combine models.  Binding a device to vfio
> > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no
> > device access until all the group devices are also bound.  I think
> > the /dev/vfio/$GROUP might help provide an enumeration interface as well
> > though, which could be useful.
> 
> Could be tho in what form ? returning sysfs pathes ?

I'm at a loss there, please suggest.  I think we need an ioctl that
returns some kind of array of devices within the group and another that
maybe takes an index from that array and returns an fd for that device.
A sysfs path string might be a reasonable array element, but it sounds
like a pain to work with.

> > 1:1 group<->process is probably too strong.  Not allowing concurrent
> > open()s on the group file enforces a single userspace entity is
> > responsible for that group.  Device fds can be passed to other
> > processes, but only retrieved via the group fd.  I suppose we could even
> > branch off the dma interface into a different fd, but it seems like we
> > would logically want to serialize dma mappings at each iommu group
> > anyway.  I'm open to alternatives, this just seemed an easy way to do
> > it.  Restricting on UID implies that we require isolated qemu instances
> > to run as different UIDs.  I know that's a goal, but I don't know if we
> > want to make it an assumption in the group security model.
> 
> 1:1 process has the advantage of linking to an -mm which makes the whole
> mmu notifier business doable. How do you want to track down mappings and
> do the second level translation in the case of explicit map/unmap (like
> on power) if you are not tied to an mm_struct ?

Right, I threw away the mmu notifier code that was originally part of
vfio because we can't do anything useful with it yet on x86.  I
definitely don't want to prevent it where it makes sense though.  Maybe
we just record current->mm on open and restrict subsequent opens to the
same.

> > Yes.  I'm not sure there's a good ROI to prioritize that model.  We have
> > to assume >1 device per guest is a typical model and that the iotlb is
> > large enough that we might improve thrashing to see both a resource and
> > performance benefit from it.  I'm open to suggestions for how we could
> > include it though.
> 
> Sharing may or may not be possible depending on setups so yes, it's a
> bit tricky.
> 
> My preference is to have a static interface (and that's actually where
> your pet netlink might make some sense :-) to create "synthetic" groups
> made of other groups if the arch allows it. But that might not be the
> best approach. In another email I also proposed an option for a group to
> "capture" another one...

I already made some comments on this in a different thread, so I won't
repeat here.

> > > If that's
> > > not what you're saying, how would the domains - now made up of a
> > > user's selection of groups, rather than individual devices - be
> > > configured?
> > > 
> > > > Hope that captures it, feel free to jump in with corrections and
> > > > suggestions.  Thanks,
> > > 
> 
> Another aspect I don't see discussed is how we represent these things to
> the guest.
> 
> On Power for example, I have a requirement that a given iommu domain is
> represented by a single dma window property in the device-tree. What
> that means is that that property needs to be either in the node of the
> device itself if there's only one device in the group or in a parent
> node (ie a bridge or host bridge) if there are multiple devices.
> 
> Now I do -not- want to go down the path of simulating P2P bridges,
> besides we'll quickly run out of bus numbers if we go there.
> 
> For us the most simple and logical approach (which is also what pHyp
> uses and what Linux handles well) is really to expose a given PCI host
> bridge per group to the guest. Believe it or not, it makes things
> easier :-)

I'm all for easier.  Why does exposing the bridge use less bus numbers
than emulating a bridge?

On x86, I want to maintain that our default assignment is at the device
level.  A user should be able to pick single or multiple devices from
across several groups and have them all show up as individual,
hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
also seen cases where users try to attach a bridge to the guest,
assuming they'll get all the devices below the bridge, so I'd be in
favor of making this "just work" if possible too, though we may have to
prevent hotplug of those.

Given the device requirement on x86 and since everything is a PCI device
on x86, I'd like to keep a qemu command line something like -device
vfio,host=00:19.0.  I assume that some of the iommu properties, such as
dma window size/address, will be query-able through an architecture
specific (or general if possible) ioctl on the vfio group fd.  I hope
that will help the specification, but I don't fully understand what all
remains.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH part1 v2 1/9] Add udbg driver using the PS3 gelic Ethernet device
From: Geoff Levand @ 2011-08-23 20:53 UTC (permalink / raw)
  To: Andre Heider; +Cc: cbe-oss-dev, Hector Martin, linuxppc-dev
In-Reply-To: <1313091073-4572-2-git-send-email-a.heider@gmail.com>

Hi,

We had some questions as to why we have this totally separate driver
from the gelic driver, so I think it worthwhile to have an
explanation of why in the commit log.   Otherwise, the code looks
OK.

-Geoff

> From: Hector Martin <hector@marcansoft.com>
> 
> Signed-off-by: Hector Martin <hector@marcansoft.com>
> [a.heider: Various cleanups to make checkpatch.pl happy]
> Signed-off-by: Andre Heider <a.heider@gmail.com>
> ---
>  arch/powerpc/Kconfig.debug              |    8 +
>  arch/powerpc/include/asm/udbg.h         |    1 +
>  arch/powerpc/kernel/udbg.c              |    2 +
>  arch/powerpc/platforms/ps3/Kconfig      |   12 ++
>  arch/powerpc/platforms/ps3/Makefile     |    1 +
>  arch/powerpc/platforms/ps3/gelic_udbg.c |  273 +++++++++++++++++++++++++++++++
>  drivers/net/ps3_gelic_net.c             |    3 +
>  drivers/net/ps3_gelic_net.h             |    6 +
>  8 files changed, 306 insertions(+), 0 deletions(-)
>  create mode 100644 arch/powerpc/platforms/ps3/gelic_udbg.c
> 
> diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> index 067cb84..ab2335f 100644
> --- a/arch/powerpc/Kconfig.debug
> +++ b/arch/powerpc/Kconfig.debug
> @@ -258,6 +258,14 @@ config PPC_EARLY_DEBUG_WSP
>  	depends on PPC_WSP
>  	select PPC_UDBG_16550
>  
> +config PPC_EARLY_DEBUG_PS3GELIC
> +	bool "Early debugging through the PS3 Ethernet port"
> +	depends on PPC_PS3
> +	select PS3GELIC_UDBG
> +	help
> +	  Select this to enable early debugging for the PlayStation3 via
> +	  UDP broadcasts sent out through the Ethernet port.
> +
>  endchoice
>  
>  config PPC_EARLY_DEBUG_HVSI_VTERMNO
> diff --git a/arch/powerpc/include/asm/udbg.h b/arch/powerpc/include/asm/udbg.h
> index 93e05d1..7cf796f 100644
> --- a/arch/powerpc/include/asm/udbg.h
> +++ b/arch/powerpc/include/asm/udbg.h
> @@ -54,6 +54,7 @@ extern void __init udbg_init_40x_realmode(void);
>  extern void __init udbg_init_cpm(void);
>  extern void __init udbg_init_usbgecko(void);
>  extern void __init udbg_init_wsp(void);
> +extern void __init udbg_init_ps3gelic(void);
>  
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_UDBG_H */
> diff --git a/arch/powerpc/kernel/udbg.c b/arch/powerpc/kernel/udbg.c
> index faa82c1..5b3e98e 100644
> --- a/arch/powerpc/kernel/udbg.c
> +++ b/arch/powerpc/kernel/udbg.c
> @@ -67,6 +67,8 @@ void __init udbg_early_init(void)
>  	udbg_init_usbgecko();
>  #elif defined(CONFIG_PPC_EARLY_DEBUG_WSP)
>  	udbg_init_wsp();
> +#elif defined(CONFIG_PPC_EARLY_DEBUG_PS3GELIC)
> +	udbg_init_ps3gelic();
>  #endif
>  
>  #ifdef CONFIG_PPC_EARLY_DEBUG
> diff --git a/arch/powerpc/platforms/ps3/Kconfig b/arch/powerpc/platforms/ps3/Kconfig
> index dfe316b..476d9d9 100644
> --- a/arch/powerpc/platforms/ps3/Kconfig
> +++ b/arch/powerpc/platforms/ps3/Kconfig
> @@ -148,4 +148,16 @@ config PS3_LPM
>  	  profiling support of the Cell processor with programs like
>  	  oprofile and perfmon2, then say Y or M, otherwise say N.
>  
> +config PS3GELIC_UDBG
> +	bool "PS3 udbg output via UDP broadcasts on Ethernet"
> +	depends on PPC_PS3
> +	help
> +	  Enables udbg early debugging output by sending broadcast UDP
> +	  via the Ethernet port (UDP port number 18194).
> +
> +	  This driver uses a trivial implementation and is independent
> +	  from the main network driver.
> +
> +	  If in doubt, say N here.
> +
>  endmenu
> diff --git a/arch/powerpc/platforms/ps3/Makefile b/arch/powerpc/platforms/ps3/Makefile
> index ac1bdf8..02b9e63 100644
> --- a/arch/powerpc/platforms/ps3/Makefile
> +++ b/arch/powerpc/platforms/ps3/Makefile
> @@ -2,6 +2,7 @@ obj-y += setup.o mm.o time.o hvcall.o htab.o repository.o
>  obj-y += interrupt.o exports.o os-area.o
>  obj-y += system-bus.o
>  
> +obj-$(CONFIG_PS3GELIC_UDBG) += gelic_udbg.o
>  obj-$(CONFIG_SMP) += smp.o
>  obj-$(CONFIG_SPU_BASE) += spu.o
>  obj-y += device-init.o
> diff --git a/arch/powerpc/platforms/ps3/gelic_udbg.c b/arch/powerpc/platforms/ps3/gelic_udbg.c
> new file mode 100644
> index 0000000..20b46a1
> --- /dev/null
> +++ b/arch/powerpc/platforms/ps3/gelic_udbg.c
> @@ -0,0 +1,273 @@
> +/*
> + * udbg debug output routine via GELIC UDP broadcasts
> + *
> + * Copyright (C) 2007 Sony Computer Entertainment Inc.
> + * Copyright 2006, 2007 Sony Corporation
> + * Copyright (C) 2010 Hector Martin <hector@marcansoft.com>
> + * Copyright (C) 2011 Andre Heider <a.heider@gmail.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + */
> +
> +#include <asm/io.h>
> +#include <asm/udbg.h>
> +#include <asm/lv1call.h>
> +
> +#define GELIC_BUS_ID 1
> +#define GELIC_DEVICE_ID 0
> +#define GELIC_DEBUG_PORT 18194
> +#define GELIC_MAX_MESSAGE_SIZE 1000
> +
> +#define GELIC_LV1_GET_MAC_ADDRESS 1
> +#define GELIC_LV1_GET_VLAN_ID 4
> +#define GELIC_LV1_VLAN_TX_ETHERNET_0 2
> +
> +#define GELIC_DESCR_DMA_STAT_MASK 0xf0000000
> +#define GELIC_DESCR_DMA_CARDOWNED 0xa0000000
> +
> +#define GELIC_DESCR_TX_DMA_IKE 0x00080000
> +#define GELIC_DESCR_TX_DMA_NO_CHKSUM 0x00000000
> +#define GELIC_DESCR_TX_DMA_FRAME_TAIL 0x00040000
> +
> +#define GELIC_DESCR_DMA_CMD_NO_CHKSUM (GELIC_DESCR_DMA_CARDOWNED | \
> +				       GELIC_DESCR_TX_DMA_IKE | \
> +				       GELIC_DESCR_TX_DMA_NO_CHKSUM)
> +
> +static u64 bus_addr;
> +
> +struct gelic_descr {
> +	/* as defined by the hardware */
> +	__be32 buf_addr;
> +	__be32 buf_size;
> +	__be32 next_descr_addr;
> +	__be32 dmac_cmd_status;
> +	__be32 result_size;
> +	__be32 valid_size;	/* all zeroes for tx */
> +	__be32 data_status;
> +	__be32 data_error;	/* all zeroes for tx */
> +} __attribute__((aligned(32)));
> +
> +struct debug_block {
> +	struct gelic_descr descr;
> +	u8 pkt[1520];
> +} __packed;
> +
> +struct ethhdr {
> +	u8 dest[6];
> +	u8 src[6];
> +	u16 type;
> +} __packed;
> +
> +struct vlantag {
> +	u16 vlan;
> +	u16 subtype;
> +} __packed;
> +
> +struct iphdr {
> +	u8 ver_len;
> +	u8 dscp_ecn;
> +	u16 total_length;
> +	u16 ident;
> +	u16 frag_off_flags;
> +	u8 ttl;
> +	u8 proto;
> +	u16 checksum;
> +	u32 src;
> +	u32 dest;
> +} __packed;
> +
> +struct udphdr {
> +	u16 src;
> +	u16 dest;
> +	u16 len;
> +	u16 checksum;
> +} __packed;
> +
> +static __iomem struct ethhdr *h_eth;
> +static __iomem struct vlantag *h_vlan;
> +static __iomem struct iphdr *h_ip;
> +static __iomem struct udphdr *h_udp;
> +
> +static __iomem char *pmsg;
> +static __iomem char *pmsgc;
> +
> +static __iomem struct debug_block dbg __attribute__((aligned(32)));
> +
> +static int header_size;
> +
> +static void map_dma_mem(int bus_id, int dev_id, void *start, size_t len,
> +			u64 *real_bus_addr)
> +{
> +	s64 result;
> +	u64 real_addr = ((u64)start) & 0x0fffffffffffffffUL;
> +	u64 real_end = real_addr + len;
> +	u64 map_start = real_addr & ~0xfff;
> +	u64 map_end = (real_end + 0xfff) & ~0xfff;
> +	u64 bus_addr = 0;
> +
> +	u64 flags = 0xf800000000000000UL;
> +
> +	result = lv1_allocate_device_dma_region(bus_id, dev_id,
> +						map_end - map_start, 12, 0,
> +						&bus_addr);
> +	if (result)
> +		lv1_panic(0);
> +
> +	result = lv1_map_device_dma_region(bus_id, dev_id, map_start,
> +					   bus_addr, map_end - map_start,
> +					   flags);
> +	if (result)
> +		lv1_panic(0);
> +
> +	*real_bus_addr = bus_addr + real_addr - map_start;
> +}
> +
> +static int unmap_dma_mem(int bus_id, int dev_id, u64 bus_addr, size_t len)
> +{
> +	s64 result;
> +	u64 real_bus_addr;
> +
> +	real_bus_addr = bus_addr & ~0xfff;
> +	len += bus_addr - real_bus_addr;
> +	len = (len + 0xfff) & ~0xfff;
> +
> +	result = lv1_unmap_device_dma_region(bus_id, dev_id, real_bus_addr,
> +					     len);
> +	if (result)
> +		return result;
> +
> +	return lv1_free_device_dma_region(bus_id, dev_id, real_bus_addr);
> +}
> +
> +static void gelic_debug_init(void)
> +{
> +	s64 result;
> +	u64 v2;
> +	u64 mac;
> +	u64 vlan_id;
> +
> +	result = lv1_open_device(GELIC_BUS_ID, GELIC_DEVICE_ID, 0);
> +	if (result)
> +		lv1_panic(0);
> +
> +	map_dma_mem(GELIC_BUS_ID, GELIC_DEVICE_ID, &dbg, sizeof(dbg),
> +		    &bus_addr);
> +
> +	memset(&dbg, 0, sizeof(dbg));
> +
> +	dbg.descr.buf_addr = bus_addr + offsetof(struct debug_block, pkt);
> +
> +	wmb();
> +
> +	result = lv1_net_control(GELIC_BUS_ID, GELIC_DEVICE_ID,
> +				 GELIC_LV1_GET_MAC_ADDRESS, 0, 0, 0,
> +				 &mac, &v2);
> +	if (result)
> +		lv1_panic(0);
> +
> +	mac <<= 16;
> +
> +	h_eth = (struct ethhdr *)dbg.pkt;
> +
> +	memset(&h_eth->dest, 0xff, 6);
> +	memcpy(&h_eth->src, &mac, 6);
> +
> +	header_size = sizeof(struct ethhdr);
> +
> +	result = lv1_net_control(GELIC_BUS_ID, GELIC_DEVICE_ID,
> +				 GELIC_LV1_GET_VLAN_ID,
> +				 GELIC_LV1_VLAN_TX_ETHERNET_0, 0, 0,
> +				 &vlan_id, &v2);
> +	if (!result) {
> +		h_eth->type = 0x8100;
> +
> +		header_size += sizeof(struct vlantag);
> +		h_vlan = (struct vlantag *)(h_eth + 1);
> +		h_vlan->vlan = vlan_id;
> +		h_vlan->subtype = 0x0800;
> +		h_ip = (struct iphdr *)(h_vlan + 1);
> +	} else {
> +		h_eth->type = 0x0800;
> +		h_ip = (struct iphdr *)(h_eth + 1);
> +	}
> +
> +	header_size += sizeof(struct iphdr);
> +	h_ip->ver_len = 0x45;
> +	h_ip->ttl = 10;
> +	h_ip->proto = 0x11;
> +	h_ip->src = 0x00000000;
> +	h_ip->dest = 0xffffffff;
> +
> +	header_size += sizeof(struct udphdr);
> +	h_udp = (struct udphdr *)(h_ip + 1);
> +	h_udp->src = GELIC_DEBUG_PORT;
> +	h_udp->dest = GELIC_DEBUG_PORT;
> +
> +	pmsgc = pmsg = (char *)(h_udp + 1);
> +}
> +
> +static void gelic_debug_shutdown(void)
> +{
> +	if (bus_addr)
> +		unmap_dma_mem(GELIC_BUS_ID, GELIC_DEVICE_ID,
> +			      bus_addr, sizeof(dbg));
> +	lv1_close_device(GELIC_BUS_ID, GELIC_DEVICE_ID);
> +}
> +
> +static void gelic_sendbuf(int msgsize)
> +{
> +	u16 *p;
> +	u32 sum;
> +	int i;
> +
> +	dbg.descr.buf_size = header_size + msgsize;
> +	h_ip->total_length = msgsize + sizeof(struct udphdr) +
> +			     sizeof(struct iphdr);
> +	h_udp->len = msgsize + sizeof(struct udphdr);
> +
> +	h_ip->checksum = 0;
> +	sum = 0;
> +	p = (u16 *)h_ip;
> +	for (i = 0; i < 5; i++)
> +		sum += *p++;
> +	h_ip->checksum = ~(sum + (sum >> 16));
> +
> +	dbg.descr.dmac_cmd_status = GELIC_DESCR_DMA_CMD_NO_CHKSUM |
> +				    GELIC_DESCR_TX_DMA_FRAME_TAIL;
> +	dbg.descr.result_size = 0;
> +	dbg.descr.data_status = 0;
> +
> +	wmb();
> +
> +	lv1_net_start_tx_dma(GELIC_BUS_ID, GELIC_DEVICE_ID, bus_addr, 0);
> +
> +	while ((dbg.descr.dmac_cmd_status & GELIC_DESCR_DMA_STAT_MASK) ==
> +	       GELIC_DESCR_DMA_CARDOWNED)
> +		cpu_relax();
> +}
> +
> +static void ps3gelic_udbg_putc(char ch)
> +{
> +	*pmsgc++ = ch;
> +	if (ch == '\n' || (pmsgc-pmsg) >= GELIC_MAX_MESSAGE_SIZE) {
> +		gelic_sendbuf(pmsgc-pmsg);
> +		pmsgc = pmsg;
> +	}
> +}
> +
> +void __init udbg_init_ps3gelic(void)
> +{
> +	gelic_debug_init();
> +	udbg_putc = ps3gelic_udbg_putc;
> +}
> +
> +void udbg_shutdown_ps3gelic(void)
> +{
> +	udbg_putc = NULL;
> +	gelic_debug_shutdown();
> +}
> +EXPORT_SYMBOL(udbg_shutdown_ps3gelic);
> diff --git a/drivers/net/ps3_gelic_net.c b/drivers/net/ps3_gelic_net.c
> index d82a82d..e743c94 100644
> --- a/drivers/net/ps3_gelic_net.c
> +++ b/drivers/net/ps3_gelic_net.c
> @@ -1674,6 +1674,9 @@ static int __devinit ps3_gelic_driver_probe(struct ps3_system_bus_device *dev)
>  	int result;
>  
>  	pr_debug("%s: called\n", __func__);
> +
> +	udbg_shutdown_ps3gelic();
> +
>  	result = ps3_open_hv_device(dev);
>  
>  	if (result) {
> diff --git a/drivers/net/ps3_gelic_net.h b/drivers/net/ps3_gelic_net.h
> index d3fadfb..a93df6a 100644
> --- a/drivers/net/ps3_gelic_net.h
> +++ b/drivers/net/ps3_gelic_net.h
> @@ -359,6 +359,12 @@ static inline void *port_priv(struct gelic_port *port)
>  	return port->priv;
>  }
>  
> +#ifdef CONFIG_PPC_EARLY_DEBUG_PS3GELIC
> +extern void udbg_shutdown_ps3gelic(void);
> +#else
> +static inline void udbg_shutdown_ps3gelic(void) {}
> +#endif
> +
>  extern int gelic_card_set_irq_mask(struct gelic_card *card, u64 mask);
>  /* shared netdev ops */
>  extern void gelic_card_up(struct gelic_card *card);

^ permalink raw reply

* Re: [PATCH part1 v2 2/9] ps3: Add helper functions to read highmem info from the repository
From: Geoff Levand @ 2011-08-23 20:53 UTC (permalink / raw)
  To: Andre Heider; +Cc: cbe-oss-dev, Hector Martin, linuxppc-dev
In-Reply-To: <1313091073-4572-3-git-send-email-a.heider@gmail.com>

On 08/11/2011 12:31 PM, Andre Heider wrote:
> An earlier step in the boot chain can preallocate the highmem region.
> A boot loader doing so will place the region infos in the repository.
> Provide helper functions to read the required nodes.
> 
> Signed-off-by: Andre Heider <a.heider@gmail.com>
> ---
>  arch/powerpc/platforms/ps3/platform.h   |    3 ++
>  arch/powerpc/platforms/ps3/repository.c |   36 +++++++++++++++++++++++++++++++
>  2 files changed, 39 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/ps3/platform.h b/arch/powerpc/platforms/ps3/platform.h
> index 9a196a8..d9b4ec0 100644
> --- a/arch/powerpc/platforms/ps3/platform.h
> +++ b/arch/powerpc/platforms/ps3/platform.h
> @@ -187,6 +187,9 @@ int ps3_repository_read_rm_size(unsigned int ppe_id, u64 *rm_size);
>  int ps3_repository_read_region_total(u64 *region_total);
>  int ps3_repository_read_mm_info(u64 *rm_base, u64 *rm_size,
>  	u64 *region_total);
> +int ps3_repository_read_highmem_base(u64 *highmem_base);
> +int ps3_repository_read_highmem_size(u64 *highmem_size);
> +int ps3_repository_read_highmem_info(u64 *highmem_base, u64 *highmem_size);

In the general case we could have multiple regions.  If we
add a region arg here we can handle that if needed.
region_index would be {1..}.  ps3_repository_read_highmem_info
could hold how many regions, so:

  int ps3_repository_read_highmem_base(unsigned int region_index, u64 *highmem_base);
  int ps3_repository_read_highmem_size(unsigned int region_index, u64 *highmem_size);
  int ps3_repository_read_highmem_region(unsigned int region_index, u64 *highmem_base, u64 *highmem_size);

>  
>  /* repository pme info */
>  
> diff --git a/arch/powerpc/platforms/ps3/repository.c b/arch/powerpc/platforms/ps3/repository.c
> index 5e304c2..9908d61 100644
> --- a/arch/powerpc/platforms/ps3/repository.c
> +++ b/arch/powerpc/platforms/ps3/repository.c
> @@ -778,6 +778,42 @@ int ps3_repository_read_mm_info(u64 *rm_base, u64 *rm_size, u64 *region_total)
>  		: ps3_repository_read_region_total(region_total);
>  }
>  
> +int ps3_repository_read_highmem_base(u64 *highmem_base)
> +{
> +	return read_node(PS3_LPAR_ID_CURRENT,
> +			 make_first_field("bi", 0),
> +			 make_field("highmem", 0),
> +			 make_field("base", 0),
> +			 0,
> +			 highmem_base, NULL);
> +}

I think something like this seems better:

int ps3_repository_read_highmem_base(unsigned int region_index, u64 *highmem_base)
{
	return read_node(PS3_LPAR_ID_CURRENT,
		make_first_field("highmem", 0),
		make_field("region", region_index),
		make_field("base", 0),
		0,
		highmem_base, NULL);
}

> +
> +int ps3_repository_read_highmem_size(u64 *highmem_size)
> +{
> +	return read_node(PS3_LPAR_ID_CURRENT,
> +			 make_first_field("bi", 0),
> +			 make_field("highmem", 0),
> +			 make_field("size", 0),
> +			 0,
> +			 highmem_size, NULL);
> +}
> +
> +/**
> + * ps3_repository_read_highmem_info - Read high memory info
> + * @highmem_base: High memory base address.
> + * @highmem_size: High mode memory size.
> + */
> +
> +int ps3_repository_read_highmem_info(u64 *highmem_base, u64 *highmem_size)
> +{
> +	int result;
> +
> +	*highmem_base = 0;
> +	result = ps3_repository_read_highmem_base(highmem_base);
> +	return result ? result
> +		: ps3_repository_read_highmem_size(highmem_size);


	ps3_repository_read_highmem_base(1, highmem_base);
	...
> +}
> +
>  /**
>   * ps3_repository_read_num_spu_reserved - Number of physical spus reserved.
>   * @num_spu: Number of physical spus.

-Geoff

^ permalink raw reply

* Re: [PATCH part1 v2 3/9] ps3: Get lv1 high memory region from the repository
From: Geoff Levand @ 2011-08-23 20:53 UTC (permalink / raw)
  To: Andre Heider; +Cc: cbe-oss-dev, Hector Martin, linuxppc-dev
In-Reply-To: <1313091073-4572-4-git-send-email-a.heider@gmail.com>

On 08/11/2011 12:31 PM, Andre Heider wrote:
> This lets the bootloader preallocate the high lv1 region and pass its
> location to the kernel through the repository. Thus, it can be used to
> hold the initrd. If the region info doesn't exist, the kernel retains
> the old behavior and attempts to allocate the region itself.
> 
> Based on the patch
> "[PS3] Get lv1 high memory region from devtree"
> from Hector Martin <hector@marcansoft.com>
> 
> Signed-off-by: Andre Heider <a.heider@gmail.com>
> ---
>  arch/powerpc/platforms/ps3/mm.c |   46 ++++++++++++++++++++++++++++++++++++--
>  1 files changed, 43 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/ps3/mm.c b/arch/powerpc/platforms/ps3/mm.c
> index c204588..983b719 100644
> --- a/arch/powerpc/platforms/ps3/mm.c
> +++ b/arch/powerpc/platforms/ps3/mm.c
> @@ -78,12 +78,14 @@ enum {
>   * @base: base address
>   * @size: size in bytes
>   * @offset: difference between base and rm.size
> + * @destroy: flag if region should be destroyed upon shutdown
>   */
>  
>  struct mem_region {
>  	u64 base;
>  	u64 size;
>  	unsigned long offset;
> +	int destroy;
>  };
>  
>  /**
> @@ -261,6 +263,7 @@ static int ps3_mm_region_create(struct mem_region *r, unsigned long size)
>  		goto zero_region;
>  	}
>  
> +	r->destroy = 1;
>  	r->offset = r->base - map.rm.size;
>  	return result;
>  
> @@ -279,6 +282,12 @@ static void ps3_mm_region_destroy(struct mem_region *r)
>  	int result;
>  
>  	DBG("%s:%d: r->base = %llxh\n", __func__, __LINE__, r->base);
> +
> +	if (!r->destroy) {
> +		DBG("%s:%d: not destroying region\n", __func__, __LINE__);
> +		return;
> +	}
> +
>  	if (r->base) {
>  		result = lv1_release_memory(r->base);
>  		BUG_ON(result);
> @@ -287,6 +296,29 @@ static void ps3_mm_region_destroy(struct mem_region *r)
>  	}
>  }
>  
> +static int ps3_mm_get_repository_highmem(struct mem_region *r)
> +{
> +	int result = ps3_repository_read_highmem_info(&r->base, &r->size);
> +
> +	if (result)
> +		goto zero_region;
> +
> +	if (!r->base || !r->size) {
> +		result = -1;
> +		goto zero_region;
> +	}
> +
> +	r->offset = r->base - map.rm.size;
> +	DBG("%s:%d got high region from repository: %llxh %llxh\n",
> +	    __func__, __LINE__, r->base, r->size);
> +	return 0;
> +
> +zero_region:
> +	DBG("%s:%d no high region in repository...\n", __func__, __LINE__);

Three dots implies something more is on its way.  I
don't think we need them.

> +	r->size = r->base = r->offset = 0;
> +	return result;
> +}
> +
>  /**
>   * ps3_mm_add_memory - hot add memory
>   */
> @@ -303,6 +335,12 @@ static int __init ps3_mm_add_memory(void)
>  
>  	BUG_ON(!mem_init_done);
>  
> +	if (!map.r1.size) {
> +		DBG("%s:%d: no region 1, not adding memory\n",
> +		    __func__, __LINE__);
> +		return 0;
> +	}
> +
>  	start_addr = map.rm.size;
>  	start_pfn = start_addr >> PAGE_SHIFT;
>  	nr_pages = (map.r1.size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> @@ -1217,9 +1255,11 @@ void __init ps3_mm_init(void)
>  	BUG_ON(map.rm.base);
>  	BUG_ON(!map.rm.size);
>  
> -
> -	/* arrange to do this in ps3_mm_add_memory */
> -	ps3_mm_region_create(&map.r1, map.total - map.rm.size);
> +	/* check if we got the highmem region from an earlier boot step */
> +	if (ps3_mm_get_repository_highmem(&map.r1)) {
> +		/* arrange to do this in ps3_mm_add_memory */
> +		ps3_mm_region_create(&map.r1, map.total - map.rm.size);
> +	}
>  
>  	/* correct map.total for the real total amount of memory we use */
>  	map.total = map.rm.size + map.r1.size;

^ permalink raw reply

* Re: [PATCH part1 v2 4/9] Add region 1 memory early
From: Geoff Levand @ 2011-08-23 20:53 UTC (permalink / raw)
  To: Andre Heider; +Cc: cbe-oss-dev, Hector Martin, linuxppc-dev
In-Reply-To: <1313091073-4572-5-git-send-email-a.heider@gmail.com>

On 08/11/2011 12:31 PM, Andre Heider wrote:
> From: Hector Martin <hector@marcansoft.com>
> 
> Real mode memory can be limited and runs out quickly as memory is
> allocated during kernel startup.
> Having region1 available sooner fixes this.
> 
> Signed-off-by: Hector Martin <hector@marcansoft.com>
> [a.heider: Various cleanups to make checkpatch.pl happy]
> Signed-off-by: Andre Heider <a.heider@gmail.com>
> ---
>  arch/powerpc/platforms/ps3/mm.c |   75 +++++++--------------------------------
>  1 files changed, 13 insertions(+), 62 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/ps3/mm.c b/arch/powerpc/platforms/ps3/mm.c
> index 983b719..68b3879 100644
> --- a/arch/powerpc/platforms/ps3/mm.c
> +++ b/arch/powerpc/platforms/ps3/mm.c
> @@ -20,7 +20,6 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> -#include <linux/memory_hotplug.h>
>  #include <linux/memblock.h>
>  #include <linux/slab.h>
>  
> @@ -94,10 +93,8 @@ struct mem_region {
>   * @vas_id - HV virtual address space id
>   * @htab_size: htab size in bytes
>   *
> - * The HV virtual address space (vas) allows for hotplug memory regions.
> - * Memory regions can be created and destroyed in the vas at runtime.

This is still true, so we should keep these comments.  We
are only changing the way we use the feature.

>   * @rm: real mode (bootmem) region
> - * @r1: hotplug memory region(s)
> + * @r1: high memory region

high memory region(s)

>   *
>   * ps3 addresses
>   * virt_addr: a cpu 'translated' effective address
> @@ -223,10 +220,6 @@ void ps3_mm_vas_destroy(void)
>  	}
>  }
>  
> -/*============================================================================*/
> -/* memory hotplug routines                                                    */
> -/*============================================================================*/
> -
>  /**
>   * ps3_mm_region_create - create a memory region in the vas
>   * @r: pointer to a struct mem_region to accept initialized values
> @@ -319,57 +312,6 @@ zero_region:
>  	return result;
>  }
>  
> -/**
> - * ps3_mm_add_memory - hot add memory
> - */
> -
> -static int __init ps3_mm_add_memory(void)
> -{
> -	int result;
> -	unsigned long start_addr;
> -	unsigned long start_pfn;
> -	unsigned long nr_pages;
> -
> -	if (!firmware_has_feature(FW_FEATURE_PS3_LV1))
> -		return -ENODEV;
> -
> -	BUG_ON(!mem_init_done);
> -
> -	if (!map.r1.size) {
> -		DBG("%s:%d: no region 1, not adding memory\n",
> -		    __func__, __LINE__);
> -		return 0;
> -	}
> -
> -	start_addr = map.rm.size;
> -	start_pfn = start_addr >> PAGE_SHIFT;
> -	nr_pages = (map.r1.size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -
> -	DBG("%s:%d: start_addr %lxh, start_pfn %lxh, nr_pages %lxh\n",
> -		__func__, __LINE__, start_addr, start_pfn, nr_pages);
> -
> -	result = add_memory(0, start_addr, map.r1.size);
> -
> -	if (result) {
> -		pr_err("%s:%d: add_memory failed: (%d)\n",
> -			__func__, __LINE__, result);
> -		return result;
> -	}
> -
> -	memblock_add(start_addr, map.r1.size);
> -	memblock_analyze();
> -
> -	result = online_pages(start_pfn, nr_pages);
> -
> -	if (result)
> -		pr_err("%s:%d: online_pages failed: (%d)\n",
> -			__func__, __LINE__, result);
> -
> -	return result;
> -}
> -
> -device_initcall(ps3_mm_add_memory);
> -
>  /*============================================================================*/
>  /* dma routines                                                               */
>  /*============================================================================*/
> @@ -1256,14 +1198,23 @@ void __init ps3_mm_init(void)
>  	BUG_ON(!map.rm.size);
>  
>  	/* check if we got the highmem region from an earlier boot step */
> -	if (ps3_mm_get_repository_highmem(&map.r1)) {
> -		/* arrange to do this in ps3_mm_add_memory */
> +	if (ps3_mm_get_repository_highmem(&map.r1))
>  		ps3_mm_region_create(&map.r1, map.total - map.rm.size);
> -	}

This should be folded into patch #3.

>  
>  	/* correct map.total for the real total amount of memory we use */
>  	map.total = map.rm.size + map.r1.size;
>  
> +	if (!map.r1.size) {
> +		DBG("%s:%d: no region 1, not adding memory\n",
> +			__func__, __LINE__);
> +	} else {

Remove brackets around a single line conditional.

> +		DBG("%s:%d: adding memory: start %llxh, size %llxh\n",
> +			__func__, __LINE__, map.rm.size, map.r1.size);
> +
> +		memblock_add(map.rm.size, map.r1.size);
> +		memblock_analyze();
> +	}
> +
>  	DBG(" <- %s:%d\n", __func__, __LINE__);
>  }
>  

^ permalink raw reply

* Re: [PATCH part1 v2 5/9] ps3: MEMORY_HOTPLUG is not a requirement anymore
From: Geoff Levand @ 2011-08-23 20:53 UTC (permalink / raw)
  To: Andre Heider; +Cc: cbe-oss-dev, Hector Martin, linuxppc-dev
In-Reply-To: <1313091073-4572-6-git-send-email-a.heider@gmail.com>

On 08/11/2011 12:31 PM, Andre Heider wrote:
> Signed-off-by: Andre Heider <a.heider@gmail.com>
> ---
>  arch/powerpc/platforms/ps3/Kconfig |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/ps3/Kconfig b/arch/powerpc/platforms/ps3/Kconfig
> index 476d9d9..84df5c8 100644
> --- a/arch/powerpc/platforms/ps3/Kconfig
> +++ b/arch/powerpc/platforms/ps3/Kconfig
> @@ -7,7 +7,6 @@ config PPC_PS3
>  	select USB_OHCI_BIG_ENDIAN_MMIO
>  	select USB_ARCH_HAS_EHCI
>  	select USB_EHCI_BIG_ENDIAN_MMIO
> -	select MEMORY_HOTPLUG


This one is OK, once the others are fixed up.

-Geoff

^ permalink raw reply

* Re: [PATCH part1 v2 6/9] ps3: Detect the current lpar
From: Geoff Levand @ 2011-08-23 22:08 UTC (permalink / raw)
  To: Andre Heider; +Cc: cbe-oss-dev, Hector Martin, linuxppc-dev
In-Reply-To: <1313091073-4572-7-git-send-email-a.heider@gmail.com>

Hi,

On 08/11/2011 12:31 PM, Andre Heider wrote:
> Detect it by reading the ss laid repository node, and make it
> accessible via ps3_get_ss_laid().

I'm wondering now if we even need this.  It is mainly used by your
later patch 8/9 that modifies ps3flash_init() to test if we should
call ps3_system_bus_driver_register().  If we don't use
ps3_get_ss_laid() and just allow ps3_system_bus_driver_register()
to be called, would the device probe fail and have the same result
as the test?

I would prefer to not have ps3_get_ss_laid() if possible.

-Geoff

^ permalink raw reply

* Re: [PATCH part1 v2 8/9] ps3flash: Refuse to work in lpars other than OtherOS
From: Geoff Levand @ 2011-08-23 22:12 UTC (permalink / raw)
  To: Andre Heider; +Cc: cbe-oss-dev, Hector Martin, linuxppc-dev
In-Reply-To: <1313091073-4572-9-git-send-email-a.heider@gmail.com>

On 08/11/2011 12:31 PM, Andre Heider wrote:
> The driver implements a character and misc device, meant for the
> axed OtherOS to exchange various settings with GameOS.
> Since Firmware 3.21 there is no support anymore to write these
> settings, so test if we're running in OtherOS, and refuse to load
> if that is not the case.

Please see my comments to the v1 patch regarding this text.

> Signed-off-by: Andre Heider <a.heider@gmail.com>
> ---
>  arch/powerpc/platforms/ps3/Kconfig |    2 +-
>  drivers/char/ps3flash.c            |    7 +++++++
>  2 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/ps3/Kconfig b/arch/powerpc/platforms/ps3/Kconfig
> index 84df5c8..72fdecd 100644
> --- a/arch/powerpc/platforms/ps3/Kconfig
> +++ b/arch/powerpc/platforms/ps3/Kconfig
> @@ -121,7 +121,7 @@ config PS3_FLASH
>  
>  	  This support is required to access the PS3 FLASH ROM, which
>  	  contains the boot loader and some boot options.
> -	  In general, all users will say Y or M.
> +	  In general, all PS3 OtherOS users will say Y or M.
>  
>  	  As this driver needs a fixed buffer of 256 KiB of memory, it can
>  	  be disabled on the kernel command line using "ps3flash=off", to
> diff --git a/drivers/char/ps3flash.c b/drivers/char/ps3flash.c
> index d0c57c2..fc6d867 100644
> --- a/drivers/char/ps3flash.c
> +++ b/drivers/char/ps3flash.c
> @@ -25,6 +25,7 @@
>  
>  #include <asm/lv1call.h>
>  #include <asm/ps3stor.h>
> +#include <asm/firmware.h>
>  
>  
>  #define DEVICE_NAME		"ps3flash"
> @@ -464,6 +465,12 @@ static struct ps3_system_bus_driver ps3flash = {
>  
>  static int __init ps3flash_init(void)
>  {
> +	if (!firmware_has_feature(FW_FEATURE_PS3_LV1))
> +		return -ENODEV;
> +
> +	if (ps3_get_ss_laid() != PS3_SS_LAID_OTHEROS)
> +		return -ENODEV;
> +
>  	return ps3_system_bus_driver_register(&ps3flash);
>  }
>  

^ permalink raw reply

* Re: [Cbe-oss-dev] [PATCH part1 v2 4/9] Add region 1 memory early
From: Antonio Ospite @ 2011-08-23 22:37 UTC (permalink / raw)
  To: Geoff Levand; +Cc: cbe-oss-dev, Andre Heider, linuxppc-dev, Hector Martin
In-Reply-To: <4E54135A.7030509@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1095 bytes --]

On Tue, 23 Aug 2011 13:53:46 -0700
Geoff Levand <geoff@infradead.org> wrote:

[...]
> >  
> > +	if (!map.r1.size) {
> > +		DBG("%s:%d: no region 1, not adding memory\n",
> > +			__func__, __LINE__);
> > +	} else {
> 
> Remove brackets around a single line conditional.
> 
> > +		DBG("%s:%d: adding memory: start %llxh, size %llxh\n",
> > +			__func__, __LINE__, map.rm.size, map.r1.size);
> > +
> > +		memblock_add(map.rm.size, map.r1.size);
> > +		memblock_analyze();
> > +	}
> > +

In Documentation/CodingStyle I read that if [only] one branch is a
single statement then the parenthesis are OK (and even recommended) for
both branches, I guess this is for style consistency. See Chapter 3,
around line 169 on my copy. I guess the wording on that paragraph can
be made more explicit, I'll try to fix that up.

Regards,
   Antonio

-- 
Antonio Ospite
http://ao2.it

PGP public key ID: 0x4553B001

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-08-23 23:35 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, iommu, chrisw,
	Alex Williamson, Avi Kivity, Anthony Liguori, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110823131819.GO2079@amd.com>

On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:
> On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote:
> > 
> > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be
> > > assigned to a guest, there can also be an ioctl to bind a group to an
> > > address-space of another group (certainly needs some care to not allow
> > > that both groups belong to different processes).
> > > 
> > > Btw, a problem we havn't talked about yet entirely is
> > > driver-deassignment. User space can decide to de-assign the device from
> > > vfio while a fd is open on it. With PCI there is no way to let this fail
> > > (the .release function returns void last time i checked). Is this a
> > > problem, and yes, how we handle that?
> > 
> > We can treat it as a hard unplug (like a cardbus gone away).
> > 
> > IE. Dispose of the direct mappings (switch to MMIO emulation) and return
> > all ff's from reads (& ignore writes).
> > 
> > Then send an unplug event via whatever mechanism the platform provides
> > (ACPI hotplug controller on x86 for example, we haven't quite sorted out
> > what to do on power for hotplug yet).
> 
> Hmm, good idea. But as far as I know the hotplug-event needs to be in
> the guest _before_ the device is actually unplugged (so that the guest
> can unbind its driver first). That somehow brings back the sleep-idea
> and the timeout in the .release function.

That's for normal assisted hotplug, but don't we support hard hotplug ?
I mean, things like cardbus, thunderbolt (if we ever support that)
etc... will need it and some platforms do support hard hotplug of PCIe
devices.

(That's why drivers should never spin on MMIO waiting for a 1 bit to
clear without a timeout :-)

Cheers,
Ben.

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-08-23 23:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, aafabbri,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1314116600.2859.28.camel@bling.home>

On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
> 
> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
> group to another via ioctl) is one option.  The tricky part will be
> implementing it to support hot unplug of any group from the
> supergroup.
> I believe Ben had a suggestion that supergroups could be created in
> sysfs, but I don't know what the mechanism to do that looks like.  It
> would also be an extra management step to dynamically bind and unbind
> groups to the supergroup around hotplug.  Thanks, 

I don't really care that much what the method for creating them is, to
be honest, I just prefer this concept of "meta groups" or "super groups"
or "synthetic groups" (whatever you want to name them) to having a
separate uiommu file descriptor.

The one reason I have a slight preference for creating them "statically"
using some kind of separate interface (again, I don't care whether it's
sysfs, netlink, etc...) is that it means things like qemu don't have to
care about them.

In general, apps that want to use vfio can just get passed the path to
such a group or the /dev/ path or the group number (whatever we chose as
the way to identify a group), and don't need to know anything about
"super groups", how to manipulate them, create them, possible
constraints etc...

Now, libvirt might want to know about that other API in order to provide
control on the creation of these things, but that's a different issue.

By "static" I mean they persist, they aren't tied to the lifetime of an
fd.

Now that's purely a preference on my side because I believe it will make
life easier for actual programs wanting to use vfio to not have to care
about those super-groups, but as I said earlier, I don't actually care
that much :-)

Cheers,
Ben.

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-08-23 23:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1314127809.2859.121.camel@bling.home>

> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?

Because a host bridge doesn't look like a PCI to PCI bridge at all for
us. It's an entire separate domain with it's own bus number space
(unlike most x86 setups).

In fact we have some problems afaik in qemu today with the concept of
PCI domains, for example, I think qemu has assumptions about a single
shared IO space domain which isn't true for us (each PCI host bridge
provides a distinct IO space domain starting at 0). We'll have to fix
that, but it's not a huge deal.

So for each "group" we'd expose in the guest an entire separate PCI
domain space with its own IO, MMIO etc... spaces, handed off from a
single device-tree "host bridge" which doesn't itself appear in the
config space, doesn't need any emulation of any config space etc...

> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.
>
> Given the device requirement on x86 and since everything is a PCI device
> on x86, I'd like to keep a qemu command line something like -device
> vfio,host=00:19.0.  I assume that some of the iommu properties, such as
> dma window size/address, will be query-able through an architecture
> specific (or general if possible) ioctl on the vfio group fd.  I hope
> that will help the specification, but I don't fully understand what all
> remains.  Thanks,

Well, for iommu there's a couple of different issues here but yes,
basically on one side we'll have some kind of ioctl to know what segment
of the device(s) DMA address space is assigned to the group and we'll
need to represent that to the guest via a device-tree property in some
kind of "parent" node of all the devices in that group.

We -might- be able to implement some kind of hotplug of individual
devices of a group under such a PHB (PCI Host Bridge), I don't know for
sure yet, some of that PAPR stuff is pretty arcane, but basically, for
all intend and purpose, we really want a group to be represented as a
PHB in the guest.

We cannot arbitrary have individual devices of separate groups be
represented in the guest as siblings on a single simulated PCI bus.

Cheers,
Ben.

^ permalink raw reply

* Re: [Cbe-oss-dev] [PATCH part1 v2 4/9] Add region 1 memory early
From: Geoff Levand @ 2011-08-24  2:15 UTC (permalink / raw)
  To: Antonio Ospite; +Cc: cbe-oss-dev, Andre Heider, linuxppc-dev, Hector Martin
In-Reply-To: <20110824003720.bc0589b8776467cc42e5d066@studenti.unina.it>

Hi Antonio,

On 08/23/2011 03:37 PM, Antonio Ospite wrote:
> Geoff Levand <geoff@infradead.org> wrote:
>> > +	if (!map.r1.size) {
>> > +		DBG("%s:%d: no region 1, not adding memory\n",
>> > +			__func__, __LINE__);
>> > +	} else {
>> 
>> Remove brackets around a single line conditional.
>> 
>> > +		DBG("%s:%d: adding memory: start %llxh, size %llxh\n",
>> > +			__func__, __LINE__, map.rm.size, map.r1.size);
>> > +
>> > +		memblock_add(map.rm.size, map.r1.size);
>> > +		memblock_analyze();
>> > +	}
>> > +
> 
> In Documentation/CodingStyle I read that if [only] one branch is a
> single statement then the parenthesis are OK (and even recommended) for
> both branches, I guess this is for style consistency. See Chapter 3,
> around line 169 on my copy. I guess the wording on that paragraph can
> be made more explicit, I'll try to fix that up.

Thanks for the comments.  I don't think its such an important change,
mainly for consistency of style within the PS3 files.

-Geoff

^ permalink raw reply

* Re: [PATCH v3] mtd/nand : workaround for Freescale FCM to support large-page Nand chip
From: LiuShuo @ 2011-08-24  2:48 UTC (permalink / raw)
  To: Matthieu CASTET
  Cc: Scott Wood, linuxppc-dev@ozlabs.org, dwmw2@infradead.org,
	Li Yang-R58472, linux-mtd@lists.infradead.org
In-Reply-To: <4E537AC4.6000301@parrot.com>

=E4=BA=8E 2011=E5=B9=B408=E6=9C=8823=E6=97=A5 18:02, Matthieu CASTET =E5=86=
=99=E9=81=93:
> LiuShuo a =C3=A9crit :
>> =E4=BA=8E 2011=E5=B9=B408=E6=9C=8819=E6=97=A5 00:25, Scott Wood =E5=86=
=99=E9=81=93:
>>> On 08/17/2011 09:33 PM, b35362@freescale.com wrote:
>>>> From: Liu Shuo<b35362@freescale.com>
>>>>
>>>> Freescale FCM controller has a 2K size limitation of buffer RAM. In =
order
>>>> to support the Nand flash chip whose page size is larger than 2K byt=
es,
>>>> we divide a page into multi-2K pages for MTD layer driver. In that c=
ase,
>>>> we force to set the page size to 2K bytes. We convert the page addre=
ss of
>>>> MTD layer driver to a real page address in flash chips and a column =
index
>>>> in fsl_elbc driver. We can issue any column address by UA instructio=
n of
>>>> elbc controller.
>>>>
>>>> NOTE: Due to there is a limitation of 'Number of Partial Program Cyc=
les in
>>>> the Same Page (NOP)', the flash chip which is supported by this work=
around
>>>> have to meet below conditions.
>>>> 	1. page size is not greater than 4KB
>>>> 	2.	1) if main area and spare area have independent NOPs:
>>>> 			  main  area NOP    :>=3D3
>>>> 			  spare area NOP    :>=3D2?
>>> How often are the NOPs split like this?
>>>
>>>> 		2) if main area and spare area have a common NOP:
>>>> 			  NOP               :>=3D4
>>> This depends on how the flash is used.  If you treat it as a NOP1 fla=
sh
>>> (e.g. run ubifs rather than jffs2), then you need NOP2 for a 4K chip =
and
>>> NOP4 for an 8K chip.  OTOH, if you would be making full use of NOP4 o=
n a
>>> real 2K chip, you'll need NOP8 for a 4K chip.
>>>
>>> The NOP restrictions should be documented in the code itself, not jus=
t
>>> in the git changelog.  Maybe print it to the console when this hack i=
s
>>> used, along with the NOP value read from the ID.
>> We can't read the NOP from the ID on any chip. Some chips don't
>> give this infomation.(e.g. Micron MT29F4G08BAC)
> Doesn't the micron chip provide it with onfi info ?
Sorry, there is something wrong with my expression.
We can get the NOP info from datasheet, but can't get it by READID=20
command in code.

-LiuShuo
> Matthieu
>

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Alexander Graf @ 2011-08-24  3:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, iommu, David Gibson,
	aafabbri, Alex Williamson, Avi Kivity, Anthony Liguori,
	linuxppc-dev, benve
In-Reply-To: <1314142892.30478.64.camel@pasglop>


On 23.08.2011, at 18:41, Benjamin Herrenschmidt wrote:

> On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote:
>>=20
>> Yeah.  Joerg's idea of binding groups internally (pass the fd of one
>> group to another via ioctl) is one option.  The tricky part will be
>> implementing it to support hot unplug of any group from the
>> supergroup.
>> I believe Ben had a suggestion that supergroups could be created in
>> sysfs, but I don't know what the mechanism to do that looks like.  It
>> would also be an extra management step to dynamically bind and unbind
>> groups to the supergroup around hotplug.  Thanks,=20
>=20
> I don't really care that much what the method for creating them is, to
> be honest, I just prefer this concept of "meta groups" or "super =
groups"
> or "synthetic groups" (whatever you want to name them) to having a
> separate uiommu file descriptor.
>=20
> The one reason I have a slight preference for creating them =
"statically"
> using some kind of separate interface (again, I don't care whether =
it's
> sysfs, netlink, etc...) is that it means things like qemu don't have =
to
> care about them.
>=20
> In general, apps that want to use vfio can just get passed the path to
> such a group or the /dev/ path or the group number (whatever we chose =
as
> the way to identify a group), and don't need to know anything about
> "super groups", how to manipulate them, create them, possible
> constraints etc...
>=20
> Now, libvirt might want to know about that other API in order to =
provide
> control on the creation of these things, but that's a different issue.
>=20
> By "static" I mean they persist, they aren't tied to the lifetime of =
an
> fd.
>=20
> Now that's purely a preference on my side because I believe it will =
make
> life easier for actual programs wanting to use vfio to not have to =
care
> about those super-groups, but as I said earlier, I don't actually care
> that much :-)

Oh I think it's one of the building blocks we need for a sane user space =
device exposure API. If I want to pass user X a few devices that are all =
behind a single IOMMU, I just chown that device node to user X and be =
done with it.

The user space tool actually using the VFIO interface wouldn't be in =
configuration business then - and it really shouldn't. That's what =
system configuration is there for :).

But I'm fairly sure we managed to persuade Alex that this is the right =
path on the BOF :)


Alex

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Alexander Graf @ 2011-08-24  3:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aafabbri, Alexey Kardashevskiy, kvm@vger.kernel.org list,
	Paul Mackerras, linux-pci, qemu-devel Developers, iommu,
	David Gibson, chrisw Wright, Alex Williamson, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1314143508.30478.72.camel@pasglop>


On 23.08.2011, at 18:51, Benjamin Herrenschmidt wrote:

>=20
>>> For us the most simple and logical approach (which is also what pHyp
>>> uses and what Linux handles well) is really to expose a given PCI =
host
>>> bridge per group to the guest. Believe it or not, it makes things
>>> easier :-)
>>=20
>> I'm all for easier.  Why does exposing the bridge use less bus =
numbers
>> than emulating a bridge?
>=20
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).
>=20
> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.
>=20
> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
>=20
>> On x86, I want to maintain that our default assignment is at the =
device
>> level.  A user should be able to pick single or multiple devices from
>> across several groups and have them all show up as individual,
>> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
>> also seen cases where users try to attach a bridge to the guest,
>> assuming they'll get all the devices below the bridge, so I'd be in
>> favor of making this "just work" if possible too, though we may have =
to
>> prevent hotplug of those.
>>=20
>> Given the device requirement on x86 and since everything is a PCI =
device
>> on x86, I'd like to keep a qemu command line something like -device
>> vfio,host=3D00:19.0.  I assume that some of the iommu properties, =
such as
>> dma window size/address, will be query-able through an architecture
>> specific (or general if possible) ioctl on the vfio group fd.  I hope
>> that will help the specification, but I don't fully understand what =
all
>> remains.  Thanks,
>=20
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what =
segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
>=20
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know =
for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
>=20
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.

So would it make sense for you to go the same route that we need to go =
on embedded power, with a separate VFIO style interface that simply =
exports memory ranges and irq bindings, but doesn't know anything about =
PCI? For e500, we'll be using something like that to pass through a full =
PCI bus into the system.


Alex

^ permalink raw reply

* Re: [PATCH 2/2] [PowerPC Book3E] Introduce new ptrace debug feature flag
From: David Gibson @ 2011-08-24  4:00 UTC (permalink / raw)
  To: K.Prasad; +Cc: linuxppc-dev, Thiago Jung Bauermann, Edjunior Barbosa Machado
In-Reply-To: <20110823092756.GB2962@in.ibm.com>

On Tue, Aug 23, 2011 at 02:57:56PM +0530, K.Prasad wrote:
> On Tue, Aug 23, 2011 at 03:09:31PM +1000, David Gibson wrote:
> > On Fri, Aug 19, 2011 at 01:23:38PM +0530, K.Prasad wrote:
> > > 
> > > While PPC_PTRACE_SETHWDEBUG ptrace flag in PowerPC accepts
> > > PPC_BREAKPOINT_MODE_EXACT mode of breakpoint, the same is not intimated to the
> > > user-space debuggers (like GDB) who may want to use it. Hence we introduce a
> > > new PPC_DEBUG_FEATURE_DATA_BP_EXACT flag which will be populated on the
> > > "features" member of "struct ppc_debug_info" to advertise support for the
> > > same on Book3E PowerPC processors.
> > 
> > I thought the idea was that the BP_EXACT mode was the default - if the
> > new interface was supported at all, then BP_EXACT was always
> > supported.  So, why do you need a new flag?
> > 
> 
> Yes, BP_EXACT was always supported but not advertised through
> PPC_PTRACE_GETHWDBGINFO. We're now doing that.

I can see that.  But you haven't answered why.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [PATCH 1/2] [hw-breakpoint] Use generic hw-breakpoint interfaces for new PPC ptrace flags
From: David Gibson @ 2011-08-24  3:59 UTC (permalink / raw)
  To: K.Prasad; +Cc: linuxppc-dev, Thiago Jung Bauermann, Edjunior Barbosa Machado
In-Reply-To: <20110823092513.GA2962@in.ibm.com>

On Tue, Aug 23, 2011 at 02:55:13PM +0530, K.Prasad wrote:
> On Tue, Aug 23, 2011 at 03:08:50PM +1000, David Gibson wrote:
> > On Fri, Aug 19, 2011 at 01:21:36PM +0530, K.Prasad wrote:
> > > PPC_PTRACE_GETHWDBGINFO, PPC_PTRACE_SETHWDEBUG and PPC_PTRACE_DELHWDEBUG are
> > > PowerPC specific ptrace flags that use the watchpoint register. While they are
> > > targeted primarily towards BookE users, user-space applications such as GDB
> > > have started using them for BookS too.
> > > 
> > > This patch enables the use of generic hardware breakpoint interfaces for these
> > > new flags. The version number of the associated data structures
> > > "ppc_hw_breakpoint" and "ppc_debug_info" is incremented to denote new semantics.
> > 
> > So, the structure itself doesn't seem to have been extended.  I don't
> > understand what the semantic difference is - your patch comment needs
> > to explain this clearly.
> >
> 
> We had a request to extend the structure but thought it was dangerous to
> do so. For instance if the user-space used version1 of the structure,
> while kernel did a copy_to_user() pertaining to version2, then we'd run
> into problems. Unfortunately the ptrace flags weren't designed to accept
> a version number as input from the user through the
> PPC_PTRACE_GETHWDBGINFO flag (which would have solved this issue).

I still don't follow you.

> I'll add a comment w.r.t change in semantics - such as the ability to
> accept 'range' breakpoints in BookS.
>  
> > > Apart from the usual benefits of using generic hw-breakpoint interfaces, these
> > > changes allow debuggers (such as GDB) to use a common set of ptrace flags for
> > > their watchpoint needs and allow more precise breakpoint specification (length
> > > of the variable can be specified).
> > 
> > What is the mechanism for implementing the range breakpoint on book3s?
> > 
> 
> The hw-breakpoint interface, accepts length as an argument in BookS (any
> value <= 8 Bytes) and would filter out extraneous interrupts arising out
> of accesses outside the range comprising <addr, addr + len> inside
> hw_breakpoint_handler function.
> 
> We put that ability to use here.

Ah, so in hardware the breakpoints are always 8 bytes long, but you
filter out false hits on a shorter range?  Of course, the utility of
range breakpoints is questionable when length <=8, but the start must
be aligned on an 8-byte boundary.

[snip]
> > >  	if ((unsigned long)bp_info->addr >= TASK_SIZE)
> > >  		return -EIO;
> > >  
> > > @@ -1398,15 +1400,86 @@ static long ppc_set_hwdebug(struct task_struct *child,
> > >  		dabr |= DABR_DATA_READ;
> > >  	if (bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_WRITE)
> > >  		dabr |= DABR_DATA_WRITE;
> > > +#ifdef CONFIG_HAVE_HW_BREAKPOINT
> > > +	if (bp_info->version == 1)
> > > +		goto version_one;
> > 
> > There are several legitimate uses of goto in the kernel, but this is
> > definitely not one of them.  You're essentially using it to put the
> > old and new versions of the same function in one block.  Nasty.
> > 
> 
> Maybe it's the label that's causing bother here. It might look elegant
> if it was called something like exit_* or error_* :-)
> 
> The goto here helps reduce code, is similar to the error exits we use
> everywhere.

Rubbish, it is not an exception exit at all, it is two separate code
paths for the different versions which would be much clearer as two
different functions.

> > > +	if (ptrace_get_breakpoints(child) < 0)
> > > +		return -ESRCH;
> > >  
> > > -	child->thread.dabr = dabr;
> > > +	bp = thread->ptrace_bps[0];
> > > +	if (!bp_info->addr) {
> > > +		if (bp) {
> > > +			unregister_hw_breakpoint(bp);
> > > +			thread->ptrace_bps[0] = NULL;
> > > +		}
> > > +		ptrace_put_breakpoints(child);
> > > +		return 0;
> > 
> > Why are you making setting a 0 watchpoint remove the existing one (I
> > think that's what this does).  I thought there was an explicit del
> > breakpoint operation instead.
> 
> We had to define the semantics for what writing a 0 to DABR could mean,
> and I think it is intuitive to consider it as deletion
> request...couldn't think of a case where DABR with addr=0 and RW=1 would
> be required.

When a user space program maps pages at virtual address 0, which it
can do.

> > > +	}
> > > +	/*
> > > +	 * Check if the request is for 'range' breakpoints. We can
> > > +	 * support it if range < 8 bytes.
> > > +	 */
> > > +	if (bp_info->addr_mode == PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE)
> > > +		len = bp_info->addr2 - bp_info->addr;
> > 
> > So you compute the length here, but I don't see you ever test if it is
> > < 8 and return an error.
> > 
> 
> The hw-breakpoint interfaces would fail if the length was > 8.

Ok.

> > > +	else if (bp_info->addr_mode != PPC_BREAKPOINT_MODE_EXACT) {
> > > +			ptrace_put_breakpoints(child);
> > > +			return -EINVAL;
> > > +		}
> > > +	if (bp) {
> > > +		attr = bp->attr;
> > > +		attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;
> > > +		arch_bp_generic_fields(dabr &
> > > +					(DABR_DATA_WRITE | DABR_DATA_READ),
> > > +							&attr.bp_type);
> > > +		attr.bp_len = len;
> > > +		ret =  modify_user_hw_breakpoint(bp, &attr);
> > > +		if (ret) {
> > > +			ptrace_put_breakpoints(child);
> > > +			return ret;
> > > +		}
> > > +		thread->ptrace_bps[0] = bp;
> > > +		ptrace_put_breakpoints(child);
> > > +		thread->dabr = dabr;
> > > +		return 0;
> > > +	}
> > >  
> > > +	/* Create a new breakpoint request if one doesn't exist already */
> > > +	hw_breakpoint_init(&attr);
> > > +	attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;
> > 
> > You seem to be silently masking the given address, which seems
> > completely wrong.
> > 
> 
> We have two ways of looking at the input address.
> a) Assume that the input address is not multiplexed with the read/write
> bits and return -EINVAL (for not confirming to the 8-byte alignment
> requirement).
> b) Consider the input address to be encoded with the read/write
> watchpoint type request and align the address by default. This is how
> the code behaves presently for the !CONFIG_HAVE_HW_BREAKPOINT case.

Hrm, ok, but this needs commenting.

> I chose to go with b) and discard the last 3-bits from the address.
> 
> Thanks for the detailed review. Looking forward for your comments.
> 
> Thanks,
> K.Prasad
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Joerg Roedel @ 2011-08-24  8:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve@cisco.com
In-Reply-To: <1314127809.2859.121.camel@bling.home>

On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote:

> > Could be tho in what form ? returning sysfs pathes ?
> 
> I'm at a loss there, please suggest.  I think we need an ioctl that
> returns some kind of array of devices within the group and another that
> maybe takes an index from that array and returns an fd for that device.
> A sysfs path string might be a reasonable array element, but it sounds
> like a pain to work with.

Limiting to PCI we can just pass the BDF as the argument to optain the
device-fd. For a more generic solution we need a unique identifier in
some way which is unique across all 'struct device' instances in the
system. As far as I know we don't have that yet (besides the sysfs-path)
so we either add that or stick with bus-specific solutions.

> > 1:1 process has the advantage of linking to an -mm which makes the whole
> > mmu notifier business doable. How do you want to track down mappings and
> > do the second level translation in the case of explicit map/unmap (like
> > on power) if you are not tied to an mm_struct ?
> 
> Right, I threw away the mmu notifier code that was originally part of
> vfio because we can't do anything useful with it yet on x86.  I
> definitely don't want to prevent it where it makes sense though.  Maybe
> we just record current->mm on open and restrict subsequent opens to the
> same.

Hmm, I think we need io-page-fault support in the iommu-api then.

> > Another aspect I don't see discussed is how we represent these things to
> > the guest.
> > 
> > On Power for example, I have a requirement that a given iommu domain is
> > represented by a single dma window property in the device-tree. What
> > that means is that that property needs to be either in the node of the
> > device itself if there's only one device in the group or in a parent
> > node (ie a bridge or host bridge) if there are multiple devices.
> > 
> > Now I do -not- want to go down the path of simulating P2P bridges,
> > besides we'll quickly run out of bus numbers if we go there.
> > 
> > For us the most simple and logical approach (which is also what pHyp
> > uses and what Linux handles well) is really to expose a given PCI host
> > bridge per group to the guest. Believe it or not, it makes things
> > easier :-)
> 
> I'm all for easier.  Why does exposing the bridge use less bus numbers
> than emulating a bridge?
> 
> On x86, I want to maintain that our default assignment is at the device
> level.  A user should be able to pick single or multiple devices from
> across several groups and have them all show up as individual,
> hotpluggable devices on bus 0 in the guest.  Not surprisingly, we've
> also seen cases where users try to attach a bridge to the guest,
> assuming they'll get all the devices below the bridge, so I'd be in
> favor of making this "just work" if possible too, though we may have to
> prevent hotplug of those.

A side-note: Might it be better to expose assigned devices in a guest on
a seperate bus? This will make it easier to emulate an IOMMU for the
guest inside qemu.


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-24  8:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	qemu-devel, chrisw, iommu, Avi Kivity, Anthony Liguori,
	linux-pci@vger.kernel.org, linuxppc-dev, benve@cisco.com
In-Reply-To: <1314119311.2859.59.camel@bling.home>

On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:

> > Handling it through fds is a good idea. This makes sure that everything
> > belongs to one process. I am not really sure yet if we go the way to
> > just bind plain groups together or if we create meta-groups. The
> > meta-groups thing seems somewhat cleaner, though.
> 
> I'm leaning towards binding because we need to make it dynamic, but I
> don't really have a good picture of the lifecycle of a meta-group.

In my view the life-cycle of the meta-group is a subrange of the
qemu-instance's life-cycle.

> > Putting the process to sleep (which would be uninterruptible) seems bad.
> > The process would sleep until the guest releases the device-group, which
> > can take days or months.
> > The best thing (and the most intrusive :-) ) is to change PCI core to
> > allow unbindings to fail, I think. But this probably further complicates
> > the way to upstream VFIO...
> 
> Yes, it's not ideal but I think it's sufficient for now and if we later
> get support for returning an error from release, we can set a timeout
> after notifying the user to make use of that.  Thanks,

Ben had the idea of just forcing to hard-unplug this device from the
guest. Thats probably the best way to deal with that, I think. VFIO
sends a notification to qemu that the device is gone and qemu informs
the guest in some way about it.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-24  8:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, iommu, chrisw,
	Alex Williamson, Avi Kivity, Anthony Liguori, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <1314142537.30478.59.camel@pasglop>

On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote:
> On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote:

> > Hmm, good idea. But as far as I know the hotplug-event needs to be in
> > the guest _before_ the device is actually unplugged (so that the guest
> > can unbind its driver first). That somehow brings back the sleep-idea
> > and the timeout in the .release function.
> 
> That's for normal assisted hotplug, but don't we support hard hotplug ?
> I mean, things like cardbus, thunderbolt (if we ever support that)
> etc... will need it and some platforms do support hard hotplug of PCIe
> devices.
> 
> (That's why drivers should never spin on MMIO waiting for a 1 bit to
> clear without a timeout :-)

Right, thats probably the best semantic for this issue then. The worst
thing that happens is that the admin crashed the guest.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Joerg Roedel @ 2011-08-24  9:10 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, iommu, chrisw,
	Alex Williamson, Avi Kivity, Anthony Liguori, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <CA79326A.FB97%aafabbri@cisco.com>

On Tue, Aug 23, 2011 at 01:33:14PM -0400, Aaron Fabbri wrote:
> On 8/23/11 10:01 AM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> > The iommu domain would probably be allocated when the first device is
> > bound to vfio.  As each device is bound, it gets attached to the group.
> > DMAs are done via an ioctl on the group.
> > 
> > I think group + uiommu leads to effectively reliving most of the
> > problems with the current code.  The only benefit is the group
> > assignment to enforce hardware restrictions.  We still have the problem
> > that uiommu open() = iommu_domain_alloc(), whose properties are
> > meaningless without attached devices (groups).  Which I think leads to
> > the same awkward model of attaching groups to define the domain, then we
> > end up doing mappings via the group to enforce ordering.
> 
> Is there a better way to allow groups to share an IOMMU domain?
> 
> Maybe, instead of having an ioctl to allow a group A to inherit the same
> iommu domain as group B, we could have an ioctl to fully merge two groups
> (could be what Ben was thinking):
> 
> A.ioctl(MERGE_TO_GROUP, B)
> 
> The group A now goes away and its devices join group B.  If A ever had an
> iommu domain assigned (and buffers mapped?) we fail.
> 
> Groups cannot get smaller (they are defined as minimum granularity of an
> IOMMU, initially).  They can get bigger if you want to share IOMMU
> resources, though.
> 
> Any downsides to this approach?

As long as this is a 2-way road its fine. There must be a way to split
the groups again after the guest exits. But then we are again at the
super-groups (aka meta-groups, aka uiommu) point.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-24  9:14 UTC (permalink / raw)
  To: aafabbri
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve@cisco.com
In-Reply-To: <CA792953.FB87%aafabbri@cisco.com>

On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > That is makes uiommu basically the same as the meta-groups, right?
> 
> Yes, functionality seems the same, thus my suggestion to keep uiommu
> explicit.  Is there some need for group-groups besides defining sets of
> groups which share IOMMU resources?
> 
> I do all this stuff (bringing up sets of devices which may share IOMMU
> domain) dynamically from C applications.  I don't really want some static
> (boot-time or sysfs fiddling) supergroup config unless there is a good
> reason KVM/power needs it.
> 
> As you say in your next email, doing it all from ioctls is very easy,
> programmatically.

I don't see a reason to make this meta-grouping static. It would harm
flexibility on x86. I think it makes things easier on power but there
are options on that platform to get the dynamic solution too.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

* RE: [PATCH] fsl-rio: Correct IECSR register clear value
From: Liu Gang-B34182 @ 2011-08-24  9:31 UTC (permalink / raw)
  To: 'akpm@linux-foundation.org',
	'linuxppc-dev@lists.ozlabs.org'
  Cc: Gala Kumar-B11780, Li Yang-R58472, Zang Roy-R61911
In-Reply-To: <1312798419-3912-1-git-send-email-b34182@freescale.com>

Hi, Andrew,

Thank you for applying the patch "[PATCH] rio: Use discovered bit to test i=
f enumeration is complete "!

So far the following patch "[PATCH] fsl-rio: Correct IECSR register clear v=
alue " has no comment or response.

So could you please apply this patch into your tree?

Thanks a lot!

Regards,

Liu Gang
-----Original Message-----
From: Liu Gang-B34182=20
Sent: Monday, August 08, 2011 6:14 PM
To: linuxppc-dev@ozlabs.org
Cc: akpm@linux-foundation.org; Li Yang-R58472; Gala Kumar-B11780; Zang Roy-=
R61911; Liu Gang-B34182; Liu Gang-B34182
Subject: [PATCH] fsl-rio: Correct IECSR register clear value

The RETE bit in IECSR is cleared by writing a 1 to it.

Signed-off-by: Liu Gang <Gang.Liu@freescale.com>
---
 arch/powerpc/sysdev/fsl_rio.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/fsl_rio.c b/arch/powerpc/sysdev/fsl_rio.c =
index b3fd081..cdd765b 100644
--- a/arch/powerpc/sysdev/fsl_rio.c
+++ b/arch/powerpc/sysdev/fsl_rio.c
@@ -54,6 +54,7 @@
 #define ODSR_CLEAR		0x1c00
 #define LTLEECSR_ENABLE_ALL	0xFFC000FC
 #define ESCSR_CLEAR		0x07120204
+#define IECSR_CLEAR		0x80000000
=20
 #define RIO_PORT1_EDCSR		0x0640
 #define RIO_PORT2_EDCSR		0x0680
@@ -1089,11 +1090,11 @@ static void port_error_handler(struct rio_mport *po=
rt, int offset)
=20
 	if (offset =3D=3D 0) {
 		out_be32((u32 *)(rio_regs_win + RIO_PORT1_EDCSR), 0);
-		out_be32((u32 *)(rio_regs_win + RIO_PORT1_IECSR), 0);
+		out_be32((u32 *)(rio_regs_win + RIO_PORT1_IECSR), IECSR_CLEAR);
 		out_be32((u32 *)(rio_regs_win + RIO_ESCSR), ESCSR_CLEAR);
 	} else {
 		out_be32((u32 *)(rio_regs_win + RIO_PORT2_EDCSR), 0);
-		out_be32((u32 *)(rio_regs_win + RIO_PORT2_IECSR), 0);
+		out_be32((u32 *)(rio_regs_win + RIO_PORT2_IECSR), IECSR_CLEAR);
 		out_be32((u32 *)(rio_regs_win + RIO_PORT2_ESCSR), ESCSR_CLEAR);
 	}
 }
--
1.7.3.1

^ permalink raw reply related

* Re: kvm PCI assignment & VFIO ramblings
From: David Gibson @ 2011-08-24  9:33 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, aafabbri, iommu,
	Avi Kivity, Anthony Liguori, linuxppc-dev, benve@cisco.com
In-Reply-To: <20110824091425.GE2079@amd.com>

On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote:
> > On 8/23/11 4:04 AM, "Joerg Roedel" <joerg.roedel@amd.com> wrote:
> > > That is makes uiommu basically the same as the meta-groups, right?
> > 
> > Yes, functionality seems the same, thus my suggestion to keep uiommu
> > explicit.  Is there some need for group-groups besides defining sets of
> > groups which share IOMMU resources?
> > 
> > I do all this stuff (bringing up sets of devices which may share IOMMU
> > domain) dynamically from C applications.  I don't really want some static
> > (boot-time or sysfs fiddling) supergroup config unless there is a good
> > reason KVM/power needs it.
> > 
> > As you say in your next email, doing it all from ioctls is very easy,
> > programmatically.
> 
> I don't see a reason to make this meta-grouping static. It would harm
> flexibility on x86. I think it makes things easier on power but there
> are options on that platform to get the dynamic solution too.

I think several people are misreading what Ben means by "static".  I
would prefer to say 'persistent', in that the meta-groups lifetime is
not tied to an fd, but they can be freely created, altered and removed
during runtime.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox