* Re: [PATCH] PSeries: Cancel RTAS event scan before firmware flash
From: Suzuki Poulose @ 2011-08-30 6:17 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: mikey, Ravi K. Nittala, sbest, antonb, subrata.modak, ranittal,
linuxppc-dev, divya.vikas
In-Reply-To: <1314684237.2488.77.camel@pasglop>
On 08/30/11 11:33, Benjamin Herrenschmidt wrote:
> On Wed, 2011-07-27 at 17:39 +0530, Ravi K. Nittala wrote:
>> The firmware flash update is conducted using an RTAS call, that is serialized
>> by lock_rtas() which uses spin_lock. rtasd keeps scanning for the RTAS events
>> generated on the machine. This is performed via a delayed workqueue, invoking
>> an RTAS call to scan the events.
>>
>> The flash update takes a while to complete and during this time, any other
>> RTAS call has to wait. In this case, rtas_event_scan() waits for a long time
>> on the spin_lock resulting in a soft lockup.
>>
>> Approaches to fix the issue :
>>
>> Approach 1: Stop all the other CPUs before we start flashing the firmware.
>>
>> Before the rtas firmware update starts, all other CPUs should be stopped.
>> Which means no other CPU should be in lock_rtas(). We do not want other CPUs
>> execute while FW update is in progress and the system will be rebooted anyway
>> after the update.
>
> Shouldn't we resume the event scan after the flash ?
>
The flash operation is performed in the reboot path at the very end.
So, even if we restart the event scan, the thread may not be able to process
the events. Hence we thought we would leave it stopped.
Again, we do not have much expertise in deciding which is the best thing to do.
We could resume the event scan, if you think that is needed.
Thanks for the review.
Suzuki
^ permalink raw reply
* Re: [PATCH] powerpc: Fix xmon for systems without MSR[RI]
From: Benjamin Herrenschmidt @ 2011-08-30 6:11 UTC (permalink / raw)
To: Jimi Xenidis; +Cc: linuxppc-dev
In-Reply-To: <1312838739-20660-1-git-send-email-jimix@pobox.com>
On Mon, 2011-08-08 at 16:25 -0500, Jimi Xenidis wrote:
> From: David Gibson <dwg@au1.ibm.com>
>
> Based on patch by David Gibson <dwg@au1.ibm.com>
>
> xmon has a longstanding bug on systems which are SMP-capable but lack
> the MSR[RI] bit. In these cases, xmon invoked by IPI on secondary
> CPUs will not properly keep quiet, but will print stuff, thereby
> garbling the primary xmon's output. This patch fixes it, by ignoring
> the RI bit if the processor does not support it.
>
> There's already a version of this for 4xx upstream, which we'll need
> to extend to other RI-lacking CPUs at some point. For now this adds
> BookE processors to the mix.
Don't freescale one have RI ?
Cheers,
Ben.
> Signed-off-by: Jimi Xenidis <jimix@pobox.com>
> ---
> arch/powerpc/xmon/xmon.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 42541bb..fdb2f7e 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -340,8 +340,8 @@ int cpus_are_in_xmon(void)
>
> static inline int unrecoverable_excp(struct pt_regs *regs)
> {
> -#ifdef CONFIG_4xx
> - /* We have no MSR_RI bit on 4xx, so we simply return false */
> +#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
> + /* We have no MSR_RI bit on 4xx or Book3e, so we simply return false */
> return 0;
> #else
> return ((regs->msr & MSR_RI) == 0);
^ permalink raw reply
* Re: [PATCH] PSeries: Cancel RTAS event scan before firmware flash
From: Benjamin Herrenschmidt @ 2011-08-30 6:03 UTC (permalink / raw)
To: Ravi K. Nittala
Cc: antonb, sbest, mikey, subrata.modak, suzuki, ranittal,
linuxppc-dev, divya.vikas
In-Reply-To: <20110727120801.10429.7276.stgit@localhost6.localdomain6>
On Wed, 2011-07-27 at 17:39 +0530, Ravi K. Nittala wrote:
> The firmware flash update is conducted using an RTAS call, that is serialized
> by lock_rtas() which uses spin_lock. rtasd keeps scanning for the RTAS events
> generated on the machine. This is performed via a delayed workqueue, invoking
> an RTAS call to scan the events.
>
> The flash update takes a while to complete and during this time, any other
> RTAS call has to wait. In this case, rtas_event_scan() waits for a long time
> on the spin_lock resulting in a soft lockup.
>
> Approaches to fix the issue :
>
> Approach 1: Stop all the other CPUs before we start flashing the firmware.
>
> Before the rtas firmware update starts, all other CPUs should be stopped.
> Which means no other CPU should be in lock_rtas(). We do not want other CPUs
> execute while FW update is in progress and the system will be rebooted anyway
> after the update.
Shouldn't we resume the event scan after the flash ?
Appart from that, no objection with the approach.
Cheers,
Ben.
> --- arch/powerpc/kernel/setup-common.c.orig 2011-07-01 22:41:12.952507971 -0400
> +++ arch/powerpc/kernel/setup-common.c 2011-07-01 22:48:31.182507915 -0400
> @@ -109,11 +109,12 @@ void machine_shutdown(void)
> void machine_restart(char *cmd)
> {
> machine_shutdown();
> - if (ppc_md.restart)
> - ppc_md.restart(cmd);
> #ifdef CONFIG_SMP
> - smp_send_stop();
> + smp_send_stop();
> #endif
> + if (ppc_md.restart)
> + ppc_md.restart(cmd);
> +
> printk(KERN_EMERG "System Halted, OK to turn off power\n");
> local_irq_disable();
> while (1) ;
>
> Problems with this approach:
> Stopping the CPUs suddenly may cause other serious problems depending on what
> was running on them. Hence, this approach cannot be considered.
>
>
> Approach 2: Cancel the rtas_scan_event work before starting the firmware flash.
>
> Just before the flash update is performed, the queued rtas_event_scan() work
> item is cancelled from the work queue so that there is no other RTAS call
> issued while the flash is in progress. After the flash completes, the system
> reboots and the rtas_event_scan() is rescheduled.
>
> Approach 2 looks to be a better solution than Approach 1. Kindly let us know
> your thoughts. Patch attached.
>
>
> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> Signed-off-by: Ravi Nittala <ravi.nittala@in.ibm.com>
>
>
> ---
> arch/powerpc/include/asm/rtas.h | 2 ++
> arch/powerpc/kernel/rtas_flash.c | 6 ++++++
> arch/powerpc/kernel/rtasd.c | 6 ++++++
> 3 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index 58625d1..3f26f87 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -245,6 +245,8 @@ extern int early_init_dt_scan_rtas(unsigned long node,
>
> extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal);
>
> +extern bool rtas_cancel_event_scan(void);
> +
> /* Error types logged. */
> #define ERR_FLAG_ALREADY_LOGGED 0x0
> #define ERR_FLAG_BOOT 0x1 /* log was pulled from NVRAM on boot */
> diff --git a/arch/powerpc/kernel/rtas_flash.c b/arch/powerpc/kernel/rtas_flash.c
> index e037c74..4174b4b 100644
> --- a/arch/powerpc/kernel/rtas_flash.c
> +++ b/arch/powerpc/kernel/rtas_flash.c
> @@ -568,6 +568,12 @@ static void rtas_flash_firmware(int reboot_type)
> }
>
> /*
> + * Just before starting the firmware flash, cancel the event scan work
> + * to avoid any soft lockup issues.
> + */
> + rtas_cancel_event_scan();
> +
> + /*
> * NOTE: the "first" block must be under 4GB, so we create
> * an entry with no data blocks in the reserved buffer in
> * the kernel data segment.
> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
> index 481ef06..e8f03fa 100644
> --- a/arch/powerpc/kernel/rtasd.c
> +++ b/arch/powerpc/kernel/rtasd.c
> @@ -472,6 +472,12 @@ static void start_event_scan(void)
> &event_scan_work, event_scan_delay);
> }
>
> +/* Cancel the rtas event scan work */
> +bool rtas_cancel_event_scan(void)
> +{
> + return cancel_delayed_work_sync(&event_scan_work);
> +}
> +
> static int __init rtas_init(void)
> {
> struct proc_dir_entry *entry;
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
^ permalink raw reply
* Re: [v3 PATCH 1/1] booke/kprobe: make program exception to use one dedicated exception stack
From: Benjamin Herrenschmidt @ 2011-08-30 5:44 UTC (permalink / raw)
To: tiejun.chen; +Cc: Scott Wood, linuxppc-dev@ozlabs.org
In-Reply-To: <4E2561F0.5040701@windriver.com>
> > As I understand it, the problem comes from the fact that stwu combines the
> > creation of a stack frame with storing into that stack frame. If they were
>
> Yes.
>
> > separate instructions you'd have a new exception frame at a lower address
> > by the time you actually store to the non-exception frame.
>
> So when kprobe we should use a unique stack frame to skip that stack frame the
> kprobed stwu want to create.
I still don't like that patch. Potentially the problem exist for all
variants of powerpc, not just booke, and I'm not sure I like adding yet
another exception stack.
Another (non-great) approach would be to special case stwu to the stack,
and instead of doing the store while emulating the instruction, keep the
store address around and do it later, after the stack has been unwound,
in the exit path (a TIF flag to hit the slow path and then do it in the
slow path).
It sounds hackish but it makes it easier to fix everybody at once, there
are "issues" with changing stacks especially on ppc64 and it would
definitely be affected as well if the stack frame created is larger than
our gap.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH v9 11/13] powerpc: select HAVE_SECCOMP_FILTER and provide seccomp_execve
From: Benjamin Herrenschmidt @ 2011-08-30 5:28 UTC (permalink / raw)
To: Will Drewry
Cc: linuxppc-dev, fweisbec, scarybeasts, djm, linux-kernel, rostedt,
segoon, tglx, Paul Mackerras, kees.cook, jmorris, torvalds, mingo
In-Reply-To: <1308875813-20122-11-git-send-email-wad@chromium.org>
On Thu, 2011-06-23 at 19:36 -0500, Will Drewry wrote:
> Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
> system call numbering for execve and selecting HAVE_SECCOMP_FILTER.
>
> v9: rebase on to bccaeafd7c117acee36e90d37c7e05c19be9e7bf
>
> Signed-off-by: Will Drewry <wad@chromium.org>
Seen these around for a while ... :-)
I don't see a harm in the patches per-se tho I haven't reviewed the
actual seccomp filter stuff and it's good (or bad) behaviour on ppc.
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cheers,
Ben.
> ---
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/include/asm/seccomp.h | 2 ++
> 2 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 2729c66..030d392 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -129,6 +129,7 @@ config PPC
> select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
> select HAVE_GENERIC_HARDIRQS
> select HAVE_SPARSE_IRQ
> + select HAVE_SECCOMP_FILTER
> select IRQ_PER_CPU
> select GENERIC_IRQ_SHOW
> select GENERIC_IRQ_SHOW_LEVEL
> diff --git a/arch/powerpc/include/asm/seccomp.h b/arch/powerpc/include/asm/seccomp.h
> index 00c1d91..3cb9cc1 100644
> --- a/arch/powerpc/include/asm/seccomp.h
> +++ b/arch/powerpc/include/asm/seccomp.h
> @@ -7,10 +7,12 @@
> #define __NR_seccomp_write __NR_write
> #define __NR_seccomp_exit __NR_exit
> #define __NR_seccomp_sigreturn __NR_rt_sigreturn
> +#define __NR_seccomp_execve __NR_execve
>
> #define __NR_seccomp_read_32 __NR_read
> #define __NR_seccomp_write_32 __NR_write
> #define __NR_seccomp_exit_32 __NR_exit
> #define __NR_seccomp_sigreturn_32 __NR_sigreturn
> +#define __NR_seccomp_execve_32 __NR_execve
>
> #endif /* _ASM_POWERPC_SECCOMP_H */
^ permalink raw reply
* [PATCH] perf events, powerpc: Add POWER7 stalled-cycles-frontend/backend events
From: Anshuman Khandual @ 2011-08-30 4:43 UTC (permalink / raw)
To: linux-kernel, linuxppc-dev, Paul Mackerras
perf events, powerpc: Add POWER7 stalled-cycles-frontend/backend events
Extent the POWER7 PMU driver with definitions
for generic front-end and back-end stall events.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
diff --git a/arch/powerpc/kernel/power7-pmu.c b/arch/powerpc/kernel/power7-pmu.c
index 593740f..e5d2844 100644
--- a/arch/powerpc/kernel/power7-pmu.c
+++ b/arch/powerpc/kernel/power7-pmu.c
@@ -297,6 +297,8 @@ static void power7_disable_pmc(unsigned int pmc, unsigned long mmcr[])
static int power7_generic_events[] = {
[PERF_COUNT_HW_CPU_CYCLES] = 0x1e,
+ [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */
+ [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a, /* CMPLU_STALL */
[PERF_COUNT_HW_INSTRUCTIONS] = 2,
[PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880, /* LD_REF_L1_LSU*/
[PERF_COUNT_HW_CACHE_MISSES] = 0x400f0, /* LD_MISS_L1 */
--
Anshuman Khandual
^ permalink raw reply related
* Re: VFIO v2 design plan
From: Alex Williamson @ 2011-08-30 4:24 UTC (permalink / raw)
To: David Gibson
Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras,
Roedel, Joerg, agraf, qemu-devel, chrisw, iommu, Avi Kivity,
Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev, benve
In-Reply-To: <20110830030439.GF4254@yookeroo.fritz.box>
On Tue, 2011-08-30 at 13:04 +1000, David Gibson wrote:
> On Fri, Aug 26, 2011 at 11:05:23AM -0600, Alex Williamson wrote:
> >
> > I don't think too much has changed since the previous email went out,
> > but it seems like a good idea to post a summary in case there were
> > suggestions or objections that I missed.
> >
> > VFIO v2 will rely on the platform iommu driver reporting grouping
> > information. Again, a group is a set of devices for which the iommu
> > cannot differentiate transactions. An example would be a set of devices
> > behind a PCI-to-PCI bridge. All transactions appear to be from the
> > bridge itself rather than devices behind the bridge. Platforms are free
> > to have whatever constraints they need to for what constitutes a group.
> >
> > I posted a rough draft of patch to implement that for the base iommu
> > driver and VT-d, adding an iommu_device_group callback on iommu ops.
> > The iommu base driver also populates an iommu_group sysfs file for each
> > device that's part of a group. Members of the same group return the
> > same value via either the sysfs or iommu_device_group. The value
> > returned is arbitrary, should not be assumed to be persistent across
> > boots, and is left to the iommu driver to generate. There are some
> > implementation details around how to do this without favoring one bus
> > over another, but the interface should be bus/device type agnostic in
> > the end.
> >
> > When the vfio module is loaded, character devices will be created for
> > each group in /dev/vfio/$GROUP. Setting file permissions on these files
> > should be sufficient for providing a user with complete access to the
> > group. Opening this device file provides what we'll call the "group
> > fd". The group fd is restricted to only work with a single mm context.
> > Concurrent opens will be denied if the opening process mm does not
> > match. The group fd will provide interfaces for enumerating the devices
> > in the group, returning a file descriptor for each device in the group
> > (the "device fd"), binding groups together, and returning a file
> > descriptor for iommu operations (the "iommu fd").
> >
> > A group is "viable" when all member devices of the group are bound to
> > the vfio driver. Until that point, the group fd only allows enumeration
> > interfaces (ie. listing of group devices). I'm currently thinking
> > enumeration will be done by a simple read() on the device file returning
> > a list of dev_name()s.
>
> Ok. Are you envisaging this interface as a virtual file, or as a
> stream? That is, can you seek around the list of devices like a
> regular file - in which case, what are the precise semantics when the
> list is changed by a bind - or is there no meaningful notion of file
> pointer and read() just gives you the next device - in which case how
> to you rewind to enumerate the group again.
I was implementing it as a virtual file that gets generated on read()
(see example in note[2] below). It is a bit clunky as reading it a byte
at a time could experience races w/ device add/remove. If it's read all
at once, it's an accurate snapshot. Suggestions welcome, this just
seemed easier than trying to stuff it into a struct for an ioctl. For a
while I thought I could do a VFIO_GROUP_GET_NUM_DEVICES +
VFIO_GROUP_GET_DEVICE_INDEX, but that assumes device stability, which I
don't think we can guarantee.
> > Once the group is viable, the user may bind the
> > group to another group, retrieve the iommu fd, or retrieve device fds.
> > Internally, each of these operations will result in an iommu domain
> > being allocated and all of the devices attached to the domain.
> >
> > The purpose of binding groups is to share the iommu domain. Groups
> > making use of incompatible iommu domains will fail to bind. Groups
> > making use of different mm's will fail to bind. The vfio driver may
> > reject some binding based on domain capabilities, but final veto power
> > is left to the iommu driver[1]. If a user makes use of a group
> > independently and later wishes to bind it to another group, all the
> > device fds and the iommu fd must first be closed. This prevents using a
> > stale iommu fd or accessing devices while the iommu is being switched.
> > Operations on any group fds of a merged group are performed globally on
> > the group (ie. enumerating the devices lists all devices in the merged
> > group, retrieving the iommu fd from any group fd results in the same fd,
> > device fds from any group can be retrieved from any group fd[2]).
> > Groups can be merged and unmerged dynamically. Unmerging a group
> > requires the device fds for the outgoing group are closed. The iommu fd
> > will remain persistent for the remaining merged group.
>
> As I've said I prefer a persistent group model, rather than this
> transient group model, but it's not a dealbreaker by itself. How are
> unmerges specified?
VFIO_GROUP_UNMERGE ioctl taking a group fd parameter.
> I'm also assuming that in this model closing a
> (bound) group fd will unmerge everything down to atomic groups again.
Yes, it will unmerge the closed group down to the atomic group.
> > If a device within a group is unbound from the vfio driver while it's in
> > use (iommu fd refcnt > 0 || device fd recnt > 0), vfio will block the
> > release and send netlink remove requests for every opened device in the
> > group (or merged group).
>
> Hrm, I do dislike netlink being yet another aspect of an already
> complex interface. Would it be possible to do kernel->user
> notifications with a poll()/read() interface on one of the existing
> fds instead?
I think it'd have to be a new eventfd, but yes, it would be possible.
Then we'd have to figure out if we filter all requests through that
(remove, PCI AER, suspend/resume, etc..) or do we use a new fd for each
and how we return info for each of those. As much as everyone hates
netlink, it still feels like the right interface for these.
Beyond unbind, we also need to think about hotplug. If a system had
multiple hotplug slots below a P2P bridge and a device was added while
the group is in use, what do we do? Maybe we can somehow disable it or
mark it for vfio in our bus notifier routines(?).
> > If the device fds are not released and
> > subsequently the iommu fd released as well, vfio will kill the user
> > process after some delay.
>
> Ouch, this seems to me a problematic semantic. Whether the user
> process survives depends on whether it processes the remove requests
> fast enough - and a user process could be slowed down by system load
> or other factors not entirely in its control.
I was assuming "ample" time to process a hot remove, but yes, it's an
area of concern. I'm not sure how much of a problem it is in practice
though. Yes you can shoot your VM accidentally as root... don't do
that.
> I'd be more comfortable with a model where there was a distinction
> between a "soft" and "hard" remove. The soft would either simply
> fail, if the device is in use by vfio, or block indefinitely. The
> hard would kill the user process without delay. This effectively
> allows your semantics to be implemented in userspace (soft remove,
> wait, hard remove) - where it's easier to tweak the policy of how long
> to wait.
Your first example is essentially what current vfio does now, request
remove, wait indefinitely and qemu triggers an abort if the guest
doesn't respond. The trouble with moving this policy to userspace is
that we're not protecting the host. You won't like this, but we can
also use whether the vfio user registers with netlink as a signal of
when to do notify-wait-kill, or just kill (yeah, we could do the same
with a notification eventfd too). Thanks,
Alex
> > At some point in the future we may be able to
> > adapt this to perform a hard removal and revoke all device access
> > without killing the user.
> >
> > The iommu fd supports dma mapping and unmapping ioctls as well as some,
> > yet to be defined and possibly architecture specific, iommu description
> > interfaces. At some point we may also make use of read/write/mmap on
> > the iommu fd as means to setup dma.
>
> Ok.
>
> > The device fds will largely support the existing vfio interface, with
> > generalizations to make it non-pci specific. We'll access mmio/pio/pci
> > config using segmented offset into the device fd. Interrupts will use
> > the existing mechanisms (eventfds/irqfd). We'll need to add ioctls to
> > describe the type of device, number, size, and type of each resource and
> > available interrupts.
> >
> > We still have outstanding questions with how devices are exposed in
> > qemu, but I think that's largely a qemu-vfio problem and the vfio kernel
> > interface described here supports all the interesting ways that devices
> > can be exposed as individuals or sets. I'm currently working on code
> > changes to support the above and will post as I complete useful chunks.
> > Thanks,
> >
> > Alex
> >
> > [1] Implementation note: the current iommu ops makes some of this
> > awkward. We'll need to temporarily setup a domain for incoming devices
> > to validate the capabilities of that domain, then tear it down and try
> > to attach devices to the existing domain. In particular I'm thinking of
> > the cache coherence capability and whether we remap existing dma
> > mappings to allow this to change or just reject as incompatible (I'm
> > leaning to the latter).
> >
> > [2] Implementation note: I think a container object makes sense here
> > where reads/ioctls are passed from the group to the container, which
> > performs them across all groups making use of that container (there are
> > no performance critical paths through the group fd). This also implies
> > the enumeration interface should report groups so we can easily see
> > which groups are merged. The group fd could simply read as:
> > group: 1234
> > device: 0000:00:19.0
> > group: 5678
> > device: 0000:01:00.0
> > device: 0000:01:00.1
> > Some might say this is screaming for xml. Do we need to go there? We
> > could also do this via the netlink interface. Suggestions welcome.
> >
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> >
>
^ permalink raw reply
* Re: VFIO v2 design plan
From: David Gibson @ 2011-08-30 3:04 UTC (permalink / raw)
To: Alex Williamson
Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras,
Roedel, Joerg, agraf, qemu-devel, chrisw, iommu, Avi Kivity,
Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev, benve
In-Reply-To: <1314378325.2859.307.camel@bling.home>
On Fri, Aug 26, 2011 at 11:05:23AM -0600, Alex Williamson wrote:
>
> I don't think too much has changed since the previous email went out,
> but it seems like a good idea to post a summary in case there were
> suggestions or objections that I missed.
>
> VFIO v2 will rely on the platform iommu driver reporting grouping
> information. Again, a group is a set of devices for which the iommu
> cannot differentiate transactions. An example would be a set of devices
> behind a PCI-to-PCI bridge. All transactions appear to be from the
> bridge itself rather than devices behind the bridge. Platforms are free
> to have whatever constraints they need to for what constitutes a group.
>
> I posted a rough draft of patch to implement that for the base iommu
> driver and VT-d, adding an iommu_device_group callback on iommu ops.
> The iommu base driver also populates an iommu_group sysfs file for each
> device that's part of a group. Members of the same group return the
> same value via either the sysfs or iommu_device_group. The value
> returned is arbitrary, should not be assumed to be persistent across
> boots, and is left to the iommu driver to generate. There are some
> implementation details around how to do this without favoring one bus
> over another, but the interface should be bus/device type agnostic in
> the end.
>
> When the vfio module is loaded, character devices will be created for
> each group in /dev/vfio/$GROUP. Setting file permissions on these files
> should be sufficient for providing a user with complete access to the
> group. Opening this device file provides what we'll call the "group
> fd". The group fd is restricted to only work with a single mm context.
> Concurrent opens will be denied if the opening process mm does not
> match. The group fd will provide interfaces for enumerating the devices
> in the group, returning a file descriptor for each device in the group
> (the "device fd"), binding groups together, and returning a file
> descriptor for iommu operations (the "iommu fd").
>
> A group is "viable" when all member devices of the group are bound to
> the vfio driver. Until that point, the group fd only allows enumeration
> interfaces (ie. listing of group devices). I'm currently thinking
> enumeration will be done by a simple read() on the device file returning
> a list of dev_name()s.
Ok. Are you envisaging this interface as a virtual file, or as a
stream? That is, can you seek around the list of devices like a
regular file - in which case, what are the precise semantics when the
list is changed by a bind - or is there no meaningful notion of file
pointer and read() just gives you the next device - in which case how
to you rewind to enumerate the group again.
> Once the group is viable, the user may bind the
> group to another group, retrieve the iommu fd, or retrieve device fds.
> Internally, each of these operations will result in an iommu domain
> being allocated and all of the devices attached to the domain.
>
> The purpose of binding groups is to share the iommu domain. Groups
> making use of incompatible iommu domains will fail to bind. Groups
> making use of different mm's will fail to bind. The vfio driver may
> reject some binding based on domain capabilities, but final veto power
> is left to the iommu driver[1]. If a user makes use of a group
> independently and later wishes to bind it to another group, all the
> device fds and the iommu fd must first be closed. This prevents using a
> stale iommu fd or accessing devices while the iommu is being switched.
> Operations on any group fds of a merged group are performed globally on
> the group (ie. enumerating the devices lists all devices in the merged
> group, retrieving the iommu fd from any group fd results in the same fd,
> device fds from any group can be retrieved from any group fd[2]).
> Groups can be merged and unmerged dynamically. Unmerging a group
> requires the device fds for the outgoing group are closed. The iommu fd
> will remain persistent for the remaining merged group.
As I've said I prefer a persistent group model, rather than this
transient group model, but it's not a dealbreaker by itself. How are
unmerges specified? I'm also assuming that in this model closing a
(bound) group fd will unmerge everything down to atomic groups again.
> If a device within a group is unbound from the vfio driver while it's in
> use (iommu fd refcnt > 0 || device fd recnt > 0), vfio will block the
> release and send netlink remove requests for every opened device in the
> group (or merged group).
Hrm, I do dislike netlink being yet another aspect of an already
complex interface. Would it be possible to do kernel->user
notifications with a poll()/read() interface on one of the existing
fds instead?
> If the device fds are not released and
> subsequently the iommu fd released as well, vfio will kill the user
> process after some delay.
Ouch, this seems to me a problematic semantic. Whether the user
process survives depends on whether it processes the remove requests
fast enough - and a user process could be slowed down by system load
or other factors not entirely in its control.
I'd be more comfortable with a model where there was a distinction
between a "soft" and "hard" remove. The soft would either simply
fail, if the device is in use by vfio, or block indefinitely. The
hard would kill the user process without delay. This effectively
allows your semantics to be implemented in userspace (soft remove,
wait, hard remove) - where it's easier to tweak the policy of how long
to wait.
> At some point in the future we may be able to
> adapt this to perform a hard removal and revoke all device access
> without killing the user.
>
> The iommu fd supports dma mapping and unmapping ioctls as well as some,
> yet to be defined and possibly architecture specific, iommu description
> interfaces. At some point we may also make use of read/write/mmap on
> the iommu fd as means to setup dma.
Ok.
> The device fds will largely support the existing vfio interface, with
> generalizations to make it non-pci specific. We'll access mmio/pio/pci
> config using segmented offset into the device fd. Interrupts will use
> the existing mechanisms (eventfds/irqfd). We'll need to add ioctls to
> describe the type of device, number, size, and type of each resource and
> available interrupts.
>
> We still have outstanding questions with how devices are exposed in
> qemu, but I think that's largely a qemu-vfio problem and the vfio kernel
> interface described here supports all the interesting ways that devices
> can be exposed as individuals or sets. I'm currently working on code
> changes to support the above and will post as I complete useful chunks.
> Thanks,
>
> Alex
>
> [1] Implementation note: the current iommu ops makes some of this
> awkward. We'll need to temporarily setup a domain for incoming devices
> to validate the capabilities of that domain, then tear it down and try
> to attach devices to the existing domain. In particular I'm thinking of
> the cache coherence capability and whether we remap existing dma
> mappings to allow this to change or just reject as incompatible (I'm
> leaning to the latter).
>
> [2] Implementation note: I think a container object makes sense here
> where reads/ioctls are passed from the group to the container, which
> performs them across all groups making use of that container (there are
> no performance critical paths through the group fd). This also implies
> the enumeration interface should report groups so we can easily see
> which groups are merged. The group fd could simply read as:
> group: 1234
> device: 0000:00:19.0
> group: 5678
> device: 0000:01:00.0
> device: 0000:01:00.1
> Some might say this is screaming for xml. Do we need to go there? We
> could also do this via the netlink interface. Suggestions welcome.
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply
* Re: Please pull 'next' branch of 4xx tree
From: Benjamin Herrenschmidt @ 2011-08-30 3:09 UTC (permalink / raw)
To: Josh Boyer; +Cc: linuxppc-dev
In-Reply-To: <CA+5PVA6t+mavcQnm+0QdoLshVc8LVwLZMfBGHyTW8Ai-Kmt3Sg@mail.gmail.com>
On Mon, 2011-08-29 at 09:05 -0400, Josh Boyer wrote:
> On Wed, Aug 10, 2011 at 2:26 PM, Josh Boyer <jwboyer@gmail.com> wrote:
> > Hi Ben,
> >
> > Finally somewhat caught up. Now that -rc1 is out, here are some
> > patches for the next merge window.
> >
> > josh
> >
> > The following changes since commit 53d1e658df6e26d62500410719aaee2b82067c03:
> >
> > Merge branch 'devicetree/merge' of
> > git://git.secretlab.ca/git/linux-2.6 (2011-08-04 06:37:07 -1000)
> >
> > are available in the git repository at:
> >
> > ssh://master.kernel.org/pub/scm/linux/kernel/git/jwboyer/powerpc-4xx.git next
>
> Ben, ping?
Been on & off & travelling. Will deal with this some time this week.
Cheers,
Ben.
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: David Gibson @ 2011-08-30 1:29 UTC (permalink / raw)
To: Aaron Fabbri
Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
linux-pci@vger.kernel.org, Alexander Graf, qemu-devel,
Chris Wright, iommu, Avi Kivity, Anthony Liguori, Roedel, Joerg,
linuxppc-dev, benve@cisco.com
In-Reply-To: <CA7D4D51.FD84%aafabbri@cisco.com>
eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote:
[snip]
> Yes. In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.
The "persistent group" model doesn't necessarily prevent that.
There's no reason your program can't use the administrative interface
as well as the "use" interface, and I don't see that making the admin
interface separate and persistent makes this any harder.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply
* Re: linux-next: build failure in Linus' tree
From: James Bottomley @ 2011-08-29 22:50 UTC (permalink / raw)
To: Stephen Rothwell
Cc: J. Bruce Fields, linux-parisc, NeilBrown, Linus, linux-kernel,
linux-next, linuxppc-dev
In-Reply-To: <20110830083218.1819a5d73c3a33e5053e8312@canb.auug.org.au>
On Tue, 2011-08-30 at 08:32 +1000, Stephen Rothwell wrote:
> Hi Linus,
>
> On Mon, 29 Aug 2011 10:44:51 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >
> > After merging the fixes tree, today's linux-next build (powerpc
> > ppc64_defconfig) failed like this:
> >
> > arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> > (.text+0xbd00): undefined reference to `.sys_nfsservctl'
> > arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> > (.text+0xbd08): undefined reference to `.compat_sys_nfsservctl'
> >
> > Caused by commit f5b940997397 ("All Arch: remove linkage for
> > sys_nfsservctl system call") which also missed parisc.
> >
> > I will apply this patch for today:
>
> Will you please appply this? (repeated for ease of inclusion)
>
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Mon, 29 Aug 2011 10:38:57 +1000
> Subject: [PATCH] remove remaining references to nfsservctl
>
> These were missed in commit f5b940997397 "All Arch: remove linkage
> for sys_nfsservctl system call" due to them having no sys_ prefix
> (presumably).
>
> Cc: NeilBrown <neilb@suse.de>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-parisc@vger.kernel.org
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Thanks for finding this ... definitely acked by me if necessary.
James
^ permalink raw reply
* Re: linux-next: build failure in Linus' tree
From: Stephen Rothwell @ 2011-08-29 22:32 UTC (permalink / raw)
To: Linus
Cc: J. Bruce Fields, linux-parisc, NeilBrown, linux-kernel,
linux-next, linuxppc-dev
In-Reply-To: <20110829104451.1c777e24ff72823d1e399f12@canb.auug.org.au>
Hi Linus,
On Mon, 29 Aug 2011 10:44:51 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> After merging the fixes tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
>
> arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> (.text+0xbd00): undefined reference to `.sys_nfsservctl'
> arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> (.text+0xbd08): undefined reference to `.compat_sys_nfsservctl'
>
> Caused by commit f5b940997397 ("All Arch: remove linkage for
> sys_nfsservctl system call") which also missed parisc.
>
> I will apply this patch for today:
Will you please appply this? (repeated for ease of inclusion)
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 29 Aug 2011 10:38:57 +1000
Subject: [PATCH] remove remaining references to nfsservctl
These were missed in commit f5b940997397 "All Arch: remove linkage
for sys_nfsservctl system call" due to them having no sys_ prefix
(presumably).
Cc: NeilBrown <neilb@suse.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
arch/parisc/kernel/syscall_table.S | 2 +-
arch/powerpc/include/asm/systbl.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/parisc/kernel/syscall_table.S b/arch/parisc/kernel/syscall_table.S
index e66366f..3735abd 100644
--- a/arch/parisc/kernel/syscall_table.S
+++ b/arch/parisc/kernel/syscall_table.S
@@ -259,7 +259,7 @@
ENTRY_SAME(ni_syscall) /* query_module */
ENTRY_SAME(poll)
/* structs contain pointers and an in_addr... */
- ENTRY_COMP(nfsservctl)
+ ENTRY_SAME(ni_syscall) /* was nfsservctl */
ENTRY_SAME(setresgid) /* 170 */
ENTRY_SAME(getresgid)
ENTRY_SAME(prctl)
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index f6736b7..fa0d27a 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -171,7 +171,7 @@ SYSCALL_SPU(setresuid)
SYSCALL_SPU(getresuid)
SYSCALL(ni_syscall)
SYSCALL_SPU(poll)
-COMPAT_SYS(nfsservctl)
+SYSCALL(ni_syscall)
SYSCALL_SPU(setresgid)
SYSCALL_SPU(getresgid)
COMPAT_SYS_SPU(prctl)
--
1.7.5.4
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
^ permalink raw reply related
* [PATCH] powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
From: Timur Tabi @ 2011-08-29 19:09 UTC (permalink / raw)
To: kumar.gala, linuxppc-dev, devicetree-discuss
Standarize and document the FPGA nodes used on Freescale QorIQ reference
boards. There are three kinds of FPGAs used on the boards: pixis, qixis, and
cpld. Although their are minor differences among the boards that have one
kind of FPGA, most of the functionality is the same, so it makes sense
to create common compatibility strings.
Signed-off-by: Timur Tabi <timur@freescale.com>
---
Changes for other Freescale boards will be made in future patches.
.../devicetree/bindings/powerpc/fsl/board.txt | 30 ++++++++++++--------
arch/powerpc/boot/dts/p1010rdb.dts | 10 ++----
arch/powerpc/boot/dts/p1020rdb.dts | 7 ++++-
arch/powerpc/boot/dts/p1022ds.dts | 2 +-
arch/powerpc/boot/dts/p2020ds.dts | 5 +++
arch/powerpc/boot/dts/p2020rdb.dts | 4 ++
arch/powerpc/boot/dts/p2040rdb.dts | 8 ++++-
arch/powerpc/boot/dts/p3041ds.dts | 4 +-
arch/powerpc/boot/dts/p4080ds.dts | 8 ++++-
arch/powerpc/boot/dts/p5020ds.dts | 4 +-
10 files changed, 55 insertions(+), 27 deletions(-)
diff --git a/Documentation/devicetree/bindings/powerpc/fsl/board.txt b/Documentation/devicetree/bindings/powerpc/fsl/board.txt
index 39e9415..ba46a7a 100644
--- a/Documentation/devicetree/bindings/powerpc/fsl/board.txt
+++ b/Documentation/devicetree/bindings/powerpc/fsl/board.txt
@@ -1,3 +1,8 @@
+Freescale Reference Board Bindings
+
+This document describes device tree bindings for various devices that
+exist on some Freescale reference boards.
+
* Board Control and Status (BCSR)
Required properties:
@@ -12,25 +17,26 @@ Example:
reg = <f8000000 8000>;
};
-* Freescale on board FPGA
+* Freescale on-board FPGA
This is the memory-mapped registers for on board FPGA.
Required properities:
-- compatible : should be "fsl,fpga-pixis".
-- reg : should contain the address and the length of the FPPGA register
- set.
+- compatible: should be a board-specific string followed by a string
+ indicating the type of FPGA. Example:
+ "fsl,<board>-pixis", "fsl,fpga-pixis"
+- reg: should contain the address and the length of the FPGA register set.
- interrupt-parent: should specify phandle for the interrupt controller.
-- interrupts : should specify event (wakeup) IRQ.
+- interrupts: should specify event (wakeup) IRQ.
-Example (MPC8610HPCD):
+Example (P1022DS):
- board-control@e8000000 {
- compatible = "fsl,fpga-pixis";
- reg = <0xe8000000 32>;
- interrupt-parent = <&mpic>;
- interrupts = <8 8>;
- };
+ board-control@3,0 {
+ compatible = "fsl,p1022ds-pixis", "fsl,fpga-pixis";
+ reg = <3 0 0x30>;
+ interrupt-parent = <&mpic>;
+ interrupts = <8 8 0 0>;
+ };
* Freescale BCSR GPIO banks
diff --git a/arch/powerpc/boot/dts/p1010rdb.dts b/arch/powerpc/boot/dts/p1010rdb.dts
index 6b33b73..7769e40 100644
--- a/arch/powerpc/boot/dts/p1010rdb.dts
+++ b/arch/powerpc/boot/dts/p1010rdb.dts
@@ -116,13 +116,9 @@
};
};
- cpld@3,0 {
- #address-cells = <1>;
- #size-cells = <1>;
- compatible = "fsl,p1010rdb-cpld";
- reg = <0x3 0x0 0x0000020>;
- bank-width = <1>;
- device-width = <1>;
+ board-control@3,0 {
+ compatible = "fsl,p1010rdb-cpld", "fsl,fpga-cpld";
+ reg = <0x3 0x0 0x20>;
};
};
diff --git a/arch/powerpc/boot/dts/p1020rdb.dts b/arch/powerpc/boot/dts/p1020rdb.dts
index d6a8ae4..982d3ea 100644
--- a/arch/powerpc/boot/dts/p1020rdb.dts
+++ b/arch/powerpc/boot/dts/p1020rdb.dts
@@ -34,7 +34,8 @@
/* NOR, NAND Flashes and Vitesse 5 port L2 switch */
ranges = <0x0 0x0 0x0 0xef000000 0x01000000
0x1 0x0 0x0 0xffa00000 0x00040000
- 0x2 0x0 0x0 0xffb00000 0x00020000>;
+ 0x2 0x0 0x0 0xffb00000 0x00020000
+ 0x3 0x0 0x0 0xffdf0000 0x00008000>;
nor@0,0 {
#address-cells = <1>;
@@ -138,6 +139,10 @@
reg = <0x2 0x0 0x20000>;
};
+ board-control@3,0 {
+ compatible = "fsl,p1020rdb-cpld", "fsl,fpga-cpld";
+ reg = <0x3 0x0 0x20>;
+ };
};
soc@ffe00000 {
diff --git a/arch/powerpc/boot/dts/p1022ds.dts b/arch/powerpc/boot/dts/p1022ds.dts
index 1be9743..97a0b87 100644
--- a/arch/powerpc/boot/dts/p1022ds.dts
+++ b/arch/powerpc/boot/dts/p1022ds.dts
@@ -150,7 +150,7 @@
};
board-control@3,0 {
- compatible = "fsl,p1022ds-pixis";
+ compatible = "fsl,p1022ds-pixis", "fsl,fpga-pixis";
reg = <3 0 0x30>;
interrupt-parent = <&mpic>;
/*
diff --git a/arch/powerpc/boot/dts/p2020ds.dts b/arch/powerpc/boot/dts/p2020ds.dts
index dae4031..d1e52f3 100644
--- a/arch/powerpc/boot/dts/p2020ds.dts
+++ b/arch/powerpc/boot/dts/p2020ds.dts
@@ -118,6 +118,11 @@
};
};
+ board-control@3,0 {
+ compatible = "fsl,p2020ds-pixis", "fsl,fpga-pixis";
+ reg = <0x3 0x0 0x30>;
+ };
+
nand@4,0 {
compatible = "fsl,elbc-fcm-nand";
reg = <0x4 0x0 0x40000>;
diff --git a/arch/powerpc/boot/dts/p2020rdb.dts b/arch/powerpc/boot/dts/p2020rdb.dts
index 1d7a05f..1bf9b8c 100644
--- a/arch/powerpc/boot/dts/p2020rdb.dts
+++ b/arch/powerpc/boot/dts/p2020rdb.dts
@@ -138,6 +138,10 @@
reg = <0x2 0x0 0x20000>;
};
+ board-control@3,0 {
+ compatible = "fsl,p2020rdb-cpld", "fsl,fpga-cpld";
+ reg = <0x3 0x0 0x20>;
+ };
};
soc@ffe00000 {
diff --git a/arch/powerpc/boot/dts/p2040rdb.dts b/arch/powerpc/boot/dts/p2040rdb.dts
index 7d84e39..1c72d65 100644
--- a/arch/powerpc/boot/dts/p2040rdb.dts
+++ b/arch/powerpc/boot/dts/p2040rdb.dts
@@ -109,7 +109,8 @@
localbus@ffe124000 {
reg = <0xf 0xfe124000 0 0x1000>;
- ranges = <0 0 0xf 0xe8000000 0x08000000>;
+ ranges = <0 0 0xf 0xe8000000 0x08000000
+ 3 0 0xf 0xffdf0000 0x00008000>;
flash@0,0 {
compatible = "cfi-flash";
@@ -117,6 +118,11 @@
bank-width = <2>;
device-width = <2>;
};
+
+ board-control@3,0 {
+ compatible = "fsl,p2040rdb-cpld", "fsl,fpga-cpld";
+ reg = <3 0 0x20>;
+ };
};
pci0: pcie@ffe200000 {
diff --git a/arch/powerpc/boot/dts/p3041ds.dts b/arch/powerpc/boot/dts/p3041ds.dts
index 69cae67..92937ce 100644
--- a/arch/powerpc/boot/dts/p3041ds.dts
+++ b/arch/powerpc/boot/dts/p3041ds.dts
@@ -147,8 +147,8 @@
};
board-control@3,0 {
- compatible = "fsl,p3041ds-pixis";
- reg = <3 0 0x20>;
+ compatible = "fsl,p3041ds-pixis", "fsl,fpga-pixis";
+ reg = <3 0 0x30>;
};
};
diff --git a/arch/powerpc/boot/dts/p4080ds.dts b/arch/powerpc/boot/dts/p4080ds.dts
index eb11098..a26cf15 100644
--- a/arch/powerpc/boot/dts/p4080ds.dts
+++ b/arch/powerpc/boot/dts/p4080ds.dts
@@ -108,7 +108,8 @@
localbus@ffe124000 {
reg = <0xf 0xfe124000 0 0x1000>;
- ranges = <0 0 0xf 0xe8000000 0x08000000>;
+ ranges = <0 0 0xf 0xe8000000 0x08000000
+ 3 0 0xf 0xffdf0000 0x00008000>;
flash@0,0 {
compatible = "cfi-flash";
@@ -116,6 +117,11 @@
bank-width = <2>;
device-width = <2>;
};
+
+ board-control@3,0 {
+ compatible = "fsl,p4080ds-pixis", "fsl,fpga-pixis";
+ reg = <3 0 0x30>;
+ };
};
pci0: pcie@ffe200000 {
diff --git a/arch/powerpc/boot/dts/p5020ds.dts b/arch/powerpc/boot/dts/p5020ds.dts
index 8366e2f..b959986 100644
--- a/arch/powerpc/boot/dts/p5020ds.dts
+++ b/arch/powerpc/boot/dts/p5020ds.dts
@@ -147,8 +147,8 @@
};
board-control@3,0 {
- compatible = "fsl,p5020ds-pixis";
- reg = <3 0 0x20>;
+ compatible = "fsl,p5020ds-pixis", "fsl,fpga-pixis";
+ reg = <3 0 0x30>;
};
};
--
1.7.3.4
^ permalink raw reply related
* Re: Please pull 'next' branch of 4xx tree
From: Josh Boyer @ 2011-08-29 13:05 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <CA+5PVA6mHk+vGsY0OibzJs27rpaKjGgaCEJVpJe4fKqrtBEmSA@mail.gmail.com>
On Wed, Aug 10, 2011 at 2:26 PM, Josh Boyer <jwboyer@gmail.com> wrote:
> Hi Ben,
>
> Finally somewhat caught up. =A0Now that -rc1 is out, here are some
> patches for the next merge window.
>
> josh
>
> The following changes since commit 53d1e658df6e26d62500410719aaee2b82067c=
03:
>
> =A0Merge branch 'devicetree/merge' of
> git://git.secretlab.ca/git/linux-2.6 (2011-08-04 06:37:07 -1000)
>
> are available in the git repository at:
>
> =A0ssh://master.kernel.org/pub/scm/linux/kernel/git/jwboyer/powerpc-4xx.g=
it next
Ben, ping?
josh
^ permalink raw reply
* Re: [PATCH 1/2] arch/powerpc/platforms/cell/iommu.c: add missing of_node_put
From: Arnd Bergmann @ 2011-08-29 11:26 UTC (permalink / raw)
To: Julia Lawall
Cc: cbe-oss-dev, devicetree-discuss, kernel-janitors, linux-kernel,
Paul Mackerras, linuxppc-dev
In-Reply-To: <1313943001-12884-1-git-send-email-julia@diku.dk>
On Sunday 21 August 2011, Julia Lawall wrote:
> From: Julia Lawall <julia@diku.dk>
>
> np is initialized to the result of calling a function that calls
> of_node_get, so of_node_put should be called before the pointer is dropped.
>
> The semantic match that finds this problem is as follows:
> (http://coccinelle.lip6.fr/)
>
> // <smpl>
> @@
> expression e,e1,e2;
> @@
>
> * e = \(of_find_node_by_type\|of_find_node_by_name\)(...)
> ... when != of_node_put(e)
> when != true e == NULL
> when != e2 = e
> e = e1
> // </smpl>
>
> Signed-off-by: Julia Lawall <julia@diku.dk>
>
Acked-by: Arnd Bergmann <arnd@arndb.de>
^ permalink raw reply
* linux-next: build failure in Linus' tree
From: Stephen Rothwell @ 2011-08-29 0:44 UTC (permalink / raw)
To: Linus
Cc: J. Bruce Fields, linux-parisc, NeilBrown, linux-kernel,
linux-next, linuxppc-dev
Hi Linus,
After merging the fixes tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:
arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
(.text+0xbd00): undefined reference to `.sys_nfsservctl'
arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
(.text+0xbd08): undefined reference to `.compat_sys_nfsservctl'
Caused by commit f5b940997397 ("All Arch: remove linkage for
sys_nfsservctl system call") which also missed parisc.
I will apply this patch for today:
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 29 Aug 2011 10:38:57 +1000
Subject: [PATCH] remove remaining references to nfsservctl
These were missed in commit f5b940997397 "All Arch: remove linkage
for sys_nfsservctl system call" due to them having no sys_ prefix
(presumably).
Cc: NeilBrown <neilb@suse.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
arch/parisc/kernel/syscall_table.S | 2 +-
arch/powerpc/include/asm/systbl.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/parisc/kernel/syscall_table.S b/arch/parisc/kernel/syscall_table.S
index e66366f..3735abd 100644
--- a/arch/parisc/kernel/syscall_table.S
+++ b/arch/parisc/kernel/syscall_table.S
@@ -259,7 +259,7 @@
ENTRY_SAME(ni_syscall) /* query_module */
ENTRY_SAME(poll)
/* structs contain pointers and an in_addr... */
- ENTRY_COMP(nfsservctl)
+ ENTRY_SAME(ni_syscall) /* was nfsservctl */
ENTRY_SAME(setresgid) /* 170 */
ENTRY_SAME(getresgid)
ENTRY_SAME(prctl)
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index f6736b7..fa0d27a 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -171,7 +171,7 @@ SYSCALL_SPU(setresuid)
SYSCALL_SPU(getresuid)
SYSCALL(ni_syscall)
SYSCALL_SPU(poll)
-COMPAT_SYS(nfsservctl)
+SYSCALL(ni_syscall)
SYSCALL_SPU(setresgid)
SYSCALL_SPU(getresgid)
COMPAT_SYS_SPU(prctl)
--
1.7.5.4
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
^ permalink raw reply related
* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-08-28 14:04 UTC (permalink / raw)
To: Joerg Roedel
Cc: Alex Williamson, Alexey Kardashevskiy, kvm@vger.kernel.org,
Paul Mackerras, Roedel, Joerg, qemu-devel, Alexander Graf, chrisw,
iommu, Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
benve@cisco.com
In-Reply-To: <20110828135632.GG8978@8bytes.org>
On 08/28/2011 04:56 PM, Joerg Roedel wrote:
> On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> > On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
>
> >> The biggest problem with this approach is that it has to happen in the
> >> context of the given process. Linux can't really modify an mm which
> >> which belong to another context in a safe way.
> >>
> >
> > Is use_mm() insufficient?
>
> Yes, it introduces a set of race conditions when a process that already
> has an mm wants to take over another processes mm temporarily (and when
> use_mm is modified to actually provide this functionality). It is only
> save when used from kernel-thread context.
>
> One example:
>
> Process A Process B Process C
> . . .
> . <-- takes A->mm .
> . and assignes as B->mm .
> . . --> Wants to take
> . . B->mm, but gets
> A->mm now
Good catch.
>
> This can't be secured by a lock, because it introduces potential
> A->B<-->B->A lock problem when two processes try to take each others mm.
> It could probably be solved by a task->real_mm pointer, havn't thought
> about this yet...
>
Or a workqueue - you get a kernel thread context with a bit of boilerplate.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Joerg Roedel @ 2011-08-28 13:56 UTC (permalink / raw)
To: Avi Kivity
Cc: Alex Williamson, Alexey Kardashevskiy, kvm@vger.kernel.org,
Paul Mackerras, Roedel, Joerg, qemu-devel, Alexander Graf, chrisw,
iommu, Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
benve@cisco.com
In-Reply-To: <4E5A3F18.7050903@redhat.com>
On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
>> The biggest problem with this approach is that it has to happen in the
>> context of the given process. Linux can't really modify an mm which
>> which belong to another context in a safe way.
>>
>
> Is use_mm() insufficient?
Yes, it introduces a set of race conditions when a process that already
has an mm wants to take over another processes mm temporarily (and when
use_mm is modified to actually provide this functionality). It is only
save when used from kernel-thread context.
One example:
Process A Process B Process C
. . .
. <-- takes A->mm .
. and assignes as B->mm .
. . --> Wants to take
. . B->mm, but gets
A->mm now
This can't be secured by a lock, because it introduces potential
A->B<-->B->A lock problem when two processes try to take each others mm.
It could probably be solved by a task->real_mm pointer, havn't thought
about this yet...
Joerg
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-08-28 13:14 UTC (permalink / raw)
To: Roedel, Joerg
Cc: Alex Williamson, Alexey Kardashevskiy, kvm@vger.kernel.org,
Paul Mackerras, linux-pci@vger.kernel.org, qemu-devel,
Alexander Graf, chrisw, iommu, Anthony Liguori, linuxppc-dev,
benve@cisco.com
In-Reply-To: <20110826092440.GO1923@amd.com>
On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
> >
> > As I see it there are two options: (a) make subsequent accesses from
> > userspace or the guest result in either a SIGBUS that userspace must
> > either deal with or die, or (b) replace the mapping with a dummy RO
> > mapping containing 0xff, with any trapped writes emulated as nops.
>
> The biggest problem with this approach is that it has to happen in the
> context of the given process. Linux can't really modify an mm which
> which belong to another context in a safe way.
>
Is use_mm() insufficient?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* [PATCH] powerpc/fsl-booke: Handle L1 D-cache parity error correctly on e500mc
From: Kumar Gala @ 2011-08-27 11:18 UTC (permalink / raw)
To: linuxppc-dev
If the L1 D-Cache is in write shadow mode the HW will auto-recover the
error. However we might still log the error and cause a machine check
(if L1CSR0[CPE] - Cache error checking enable). We should only treat
the non-write shadow case as non-recoverable.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
---
arch/powerpc/include/asm/reg_booke.h | 3 +++
arch/powerpc/kernel/traps.c | 9 ++++++++-
2 files changed, 11 insertions(+), 1 deletions(-)
diff --git a/arch/powerpc/include/asm/reg_booke.h b/arch/powerpc/include/asm/reg_booke.h
index 2d8c920..9856452 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -551,6 +551,9 @@
#define L1CSR1_ICFI 0x00000002 /* Instr Cache Flash Invalidate */
#define L1CSR1_ICE 0x00000001 /* Instr Cache Enable */
+/* Bit definitions for L1CSR2. */
+#define L1CSR2_DCWS 0x40000000 /* Data Cache write shadow */
+
/* Bit definitions for L2CSR0. */
#define L2CSR0_L2E 0x80000000 /* L2 Cache Enable */
#define L2CSR0_L2PE 0x40000000 /* L2 Cache Parity/ECC Enable */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 1a01414..a1a40f9 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -457,7 +457,14 @@ int machine_check_e500mc(struct pt_regs *regs)
if (reason & MCSR_DCPERR_MC) {
printk("Data Cache Parity Error\n");
- recoverable = 0;
+
+ /*
+ * In write shadow mode we auto-recover from the error, but it
+ * may still get logged and cause a machine check. We should
+ * only treat the non-write shadow case as non-recoverable.
+ */
+ if (!(mfspr(SPRN_L1CSR2) & L1CSR2_DCWS))
+ recoverable = 0;
}
if (reason & MCSR_L2MMU_MHIT) {
--
1.7.3.4
^ permalink raw reply related
* [PATCH] powerpc/fsl_msi: clean up and document calculation of MSIIR address
From: Timur Tabi @ 2011-08-27 0:28 UTC (permalink / raw)
To: kumar.gala, linuxppc-dev
Commit 3da34aae (powerpc/fsl: Support unique MSI addresses per PCIe Root
Complex) redefined the meanings of msi->msi_addr_hi and msi->msi_addr_lo to be
an offset rather than an address. To help clarify the code, we make the
following changes:
1) Get rid of msi_addr_hi, which is always zero anyway.
2) Rename msi_addr_lo to ccsr_msiir_offset, to indicate that it's an offset
relative to the beginning of CCSR.
3) Calculate 64-bit addresses using actual 64-bit math.
4) Document some of the code and assumptions we make.
Signed-off-by: Timur Tabi <timur@freescale.com>
---
arch/powerpc/sysdev/fsl_msi.c | 26 ++++++++++++++++++--------
arch/powerpc/sysdev/fsl_msi.h | 3 +--
2 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/sysdev/fsl_msi.c b/arch/powerpc/sysdev/fsl_msi.c
index 419a772..d824230 100644
--- a/arch/powerpc/sysdev/fsl_msi.c
+++ b/arch/powerpc/sysdev/fsl_msi.c
@@ -30,7 +30,7 @@ LIST_HEAD(msi_head);
struct fsl_msi_feature {
u32 fsl_pic_ip;
- u32 msiir_offset;
+ u32 msiir_offset; /* offset of MSIIR, relative to start of MSI regs */
};
struct fsl_msi_cascade_data {
@@ -120,16 +120,23 @@ static void fsl_teardown_msi_irqs(struct pci_dev *pdev)
return;
}
+/*
+ * Initialize the address and data fields of an MSI message object
+ */
static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
struct msi_msg *msg,
- struct fsl_msi *fsl_msi_data)
+ struct fsl_msi *msi_data)
{
- struct fsl_msi *msi_data = fsl_msi_data;
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
- u64 base = fsl_pci_immrbar_base(hose);
- msg->address_lo = msi_data->msi_addr_lo + lower_32_bits(base);
- msg->address_hi = msi_data->msi_addr_hi + upper_32_bits(base);
+ /*
+ * The PCI address of MSIIR is equal to the PCI base address of CCSR
+ * plus the offset of MSIIR.
+ */
+ u64 addr = fsl_pci_immrbar_base(hose) + msi_data->ccsr_msiir_offset;
+
+ msg->address_hi = upper_32_bits(addr);
+ msg->address_lo = lower_32_bits(addr);
msg->data = hwirq;
@@ -359,8 +366,11 @@ static int __devinit fsl_of_msi_probe(struct platform_device *dev)
msi->irqhost->host_data = msi;
- msi->msi_addr_hi = 0x0;
- msi->msi_addr_lo = features->msiir_offset + (res.start & 0xfffff);
+ /*
+ * We assume that the 'reg' property of the MSI node contains an
+ * offset that has five (or fewer) digits, hence the 0xfffff.
+ */
+ msi->ccsr_msiir_offset = features->msiir_offset + (res.start & 0xfffff);
rc = fsl_msi_init_allocator(msi);
if (rc) {
diff --git a/arch/powerpc/sysdev/fsl_msi.h b/arch/powerpc/sysdev/fsl_msi.h
index 624580c..eb68c42 100644
--- a/arch/powerpc/sysdev/fsl_msi.h
+++ b/arch/powerpc/sysdev/fsl_msi.h
@@ -28,8 +28,7 @@ struct fsl_msi {
unsigned long cascade_irq;
- u32 msi_addr_lo;
- u32 msi_addr_hi;
+ u32 ccsr_msiir_offset; /* offset of MSIIR, relative to start of CCSR */
void __iomem *msi_regs;
u32 feature;
int msi_virqs[NR_MSI_REG];
--
1.7.3.4
^ permalink raw reply related
* Re: kvm PCI assignment & VFIO ramblings
From: Chris Wright @ 2011-08-26 21:06 UTC (permalink / raw)
To: Aaron Fabbri
Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
Roedel, Joerg, Alexander Graf, qemu-devel, Chris Wright, iommu,
Avi Kivity, Anthony Liguori, linux-pci@vger.kernel.org,
linuxppc-dev, benve@cisco.com
In-Reply-To: <CA7D4D51.FD84%aafabbri@cisco.com>
* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:
> > * Aaron Fabbri (aafabbri@cisco.com) wrote:
> >> Each process will open vfio devices on the fly, and they need to be able to
> >> share IOMMU resources.
> >
> > How do you share IOMMU resources w/ multiple processes, are the processes
> > sharing memory?
>
> Sorry, bad wording. I share IOMMU domains *within* each process.
Ah, got it. Thanks.
> E.g. If one process has 3 devices and another has 10, I can get by with two
> iommu domains (and can share buffers among devices within each process).
>
> If I ever need to share devices across processes, the shared memory case
> might be interesting.
>
> >
> >> So I need the ability to dynamically bring up devices and assign them to a
> >> group. The number of actual devices and how they map to iommu domains is
> >> not known ahead of time. We have a single piece of silicon that can expose
> >> hundreds of pci devices.
> >
> > This does not seem fundamentally different from the KVM use case.
> >
> > We have 2 kinds of groupings.
> >
> > 1) low-level system or topoolgy grouping
> >
> > Some may have multiple devices in a single group
> >
> > * the PCIe-PCI bridge example
> > * the POWER partitionable endpoint
> >
> > Many will not
> >
> > * singleton group, e.g. typical x86 PCIe function (majority of
> > assigned devices)
> >
> > Not sure it makes sense to have these administratively defined as
> > opposed to system defined.
> >
> > 2) logical grouping
> >
> > * multiple low-level groups (singleton or otherwise) attached to same
> > process, allowing things like single set of io page tables where
> > applicable.
> >
> > These are nominally adminstratively defined. In the KVM case, there
> > is likely a privileged task (i.e. libvirtd) involved w/ making the
> > device available to the guest and can do things like group merging.
> > In your userspace case, perhaps it should be directly exposed.
>
> Yes. In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.
I don't see an issue w/ this. As long it can not add devices to the
system defined groups, it's not a privileged operation. So we still
need the iommu domain concept exposed in some form to logically put
groups into a single iommu domain (if desired). In fact, I believe Alex
covered this in his most recent recap:
...The group fd will provide interfaces for enumerating the devices
in the group, returning a file descriptor for each device in the group
(the "device fd"), binding groups together, and returning a file
descriptor for iommu operations (the "iommu fd").
thanks,
-chris
^ permalink raw reply
* [PATCH] powerpc/eeh: fix /proc/ppc64/eeh creation
From: Thadeu Lima de Souza Cascardo @ 2011-08-26 20:36 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: linux-kernel, Paul Mackerras, Thadeu Lima de Souza Cascardo,
linuxppc-dev, Breno Leitao
Since commit 188917e183cf9ad0374b571006d0fc6d48a7f447, /proc/ppc64 is a
symlink to /proc/powerpc/. That means that creating /proc/ppc64/eeh will
end up with a unaccessible file, that is not listed under /proc/powerpc/
and, then, not listed under /proc/ppc64/.
Creating /proc/powerpc/eeh fixes that problem and maintain the
compatibility intended with the ppc64 symlink.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
---
arch/powerpc/platforms/pseries/eeh.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c
index ada6e07..d42f37d 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -1338,7 +1338,7 @@ static const struct file_operations proc_eeh_operations = {
static int __init eeh_init_proc(void)
{
if (machine_is(pseries))
- proc_create("ppc64/eeh", 0, NULL, &proc_eeh_operations);
+ proc_create("powerpc/eeh", 0, NULL, &proc_eeh_operations);
return 0;
}
__initcall(eeh_init_proc);
--
1.7.4.4
^ permalink raw reply related
* Re: kvm PCI assignment & VFIO ramblings
From: Chris Wright @ 2011-08-26 19:35 UTC (permalink / raw)
To: Aaron Fabbri
Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
linux-pci@vger.kernel.org, Alexander Graf, qemu-devel, chrisw,
iommu, Avi Kivity, Anthony Liguori, Roedel, Joerg, linuxppc-dev,
benve@cisco.com
In-Reply-To: <CA7D2B86.FD79%aafabbri@cisco.com>
* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
> > Forget the KVM case for a moment and think of a user space device driver. I as
> > a user am not root. But I as a user when having access to /dev/vfioX want to
> > be able to access the device and manage it - and only it. The admin of that
> > box needs to set it up properly for me to be able to access it.
> >
> > So having two steps is really the correct way to go:
> >
> > * create VFIO group
> > * use VFIO group
> >
> > because the two are done by completely different users.
>
> This is not the case for my userspace drivers using VFIO today.
>
> Each process will open vfio devices on the fly, and they need to be able to
> share IOMMU resources.
How do you share IOMMU resources w/ multiple processes, are the processes
sharing memory?
> So I need the ability to dynamically bring up devices and assign them to a
> group. The number of actual devices and how they map to iommu domains is
> not known ahead of time. We have a single piece of silicon that can expose
> hundreds of pci devices.
This does not seem fundamentally different from the KVM use case.
We have 2 kinds of groupings.
1) low-level system or topoolgy grouping
Some may have multiple devices in a single group
* the PCIe-PCI bridge example
* the POWER partitionable endpoint
Many will not
* singleton group, e.g. typical x86 PCIe function (majority of
assigned devices)
Not sure it makes sense to have these administratively defined as
opposed to system defined.
2) logical grouping
* multiple low-level groups (singleton or otherwise) attached to same
process, allowing things like single set of io page tables where
applicable.
These are nominally adminstratively defined. In the KVM case, there
is likely a privileged task (i.e. libvirtd) involved w/ making the
device available to the guest and can do things like group merging.
In your userspace case, perhaps it should be directly exposed.
> In my case, the only administrative task would be to give my processes/users
> access to the vfio groups (which are initially singletons), and the
> application actually opens them and needs the ability to merge groups
> together to conserve IOMMU resources (assuming we're not going to expose
> uiommu).
I agree, we definitely need to expose _some_ way to do this.
thanks,
-chris
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Aaron Fabbri @ 2011-08-26 20:17 UTC (permalink / raw)
To: Chris Wright
Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
linux-pci@vger.kernel.org, Alexander Graf, qemu-devel, iommu,
Avi Kivity, Anthony Liguori, Roedel, Joerg, linuxppc-dev,
benve@cisco.com
In-Reply-To: <20110826193559.GD13060@sequoia.sous-sol.org>
On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:
> * Aaron Fabbri (aafabbri@cisco.com) wrote:
>> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
>>> Forget the KVM case for a moment and think of a user space device driver. I
>>> as
>>> a user am not root. But I as a user when having access to /dev/vfioX want to
>>> be able to access the device and manage it - and only it. The admin of that
>>> box needs to set it up properly for me to be able to access it.
>>>
>>> So having two steps is really the correct way to go:
>>>
>>> * create VFIO group
>>> * use VFIO group
>>>
>>> because the two are done by completely different users.
>>
>> This is not the case for my userspace drivers using VFIO today.
>>
>> Each process will open vfio devices on the fly, and they need to be able to
>> share IOMMU resources.
>
> How do you share IOMMU resources w/ multiple processes, are the processes
> sharing memory?
Sorry, bad wording. I share IOMMU domains *within* each process.
E.g. If one process has 3 devices and another has 10, I can get by with two
iommu domains (and can share buffers among devices within each process).
If I ever need to share devices across processes, the shared memory case
might be interesting.
>
>> So I need the ability to dynamically bring up devices and assign them to a
>> group. The number of actual devices and how they map to iommu domains is
>> not known ahead of time. We have a single piece of silicon that can expose
>> hundreds of pci devices.
>
> This does not seem fundamentally different from the KVM use case.
>
> We have 2 kinds of groupings.
>
> 1) low-level system or topoolgy grouping
>
> Some may have multiple devices in a single group
>
> * the PCIe-PCI bridge example
> * the POWER partitionable endpoint
>
> Many will not
>
> * singleton group, e.g. typical x86 PCIe function (majority of
> assigned devices)
>
> Not sure it makes sense to have these administratively defined as
> opposed to system defined.
>
> 2) logical grouping
>
> * multiple low-level groups (singleton or otherwise) attached to same
> process, allowing things like single set of io page tables where
> applicable.
>
> These are nominally adminstratively defined. In the KVM case, there
> is likely a privileged task (i.e. libvirtd) involved w/ making the
> device available to the guest and can do things like group merging.
> In your userspace case, perhaps it should be directly exposed.
Yes. In essence, I'd rather not have to run any other admin processes.
Doing things programmatically, on the fly, from each process, is the
cleanest model right now.
>
>> In my case, the only administrative task would be to give my processes/users
>> access to the vfio groups (which are initially singletons), and the
>> application actually opens them and needs the ability to merge groups
>> together to conserve IOMMU resources (assuming we're not going to expose
>> uiommu).
>
> I agree, we definitely need to expose _some_ way to do this.
>
> thanks,
> -chris
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox