From: Alex Williamson <alex.williamson@redhat.com>
To: "Christian Benvenuti (benve)" <benve@cisco.com>
Cc: "Aaron Fabbri (aafabbri)" <aafabbri@cisco.com>,
aik@au1.ibm.com, kvm@vger.kernel.org, pmac@au1.ibm.com,
qemu-devel@nongnu.org, joerg.roedel@amd.com,
konrad.wilk@oracle.com, agraf@suse.de, dwg@au1.ibm.com,
chrisw@sous-sol.org, B08248@freescale.com,
iommu@lists.linux-foundation.org, avi@redhat.com,
linux-pci@vger.kernel.org, B07421@freescale.com
Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
Date: Fri, 11 Nov 2011 11:04:07 -0700 [thread overview]
Message-ID: <1321034647.2682.98.camel@bling.home> (raw)
In-Reply-To: <184D23435BECB444AB6B9D4630C8EC8302D5FE38@XMB-RCD-303.cisco.com>
On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> Here are few minor comments on vfio_iommu.c ...
Sorry, I've been poking sticks at trying to figure out a clean way to
solve the force vfio driver attach problem.
> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 0000000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
<snip>
> > +
> > +#include "vfio_private.h"
>
> Doesn't the 'dma_' prefix belong to the generic DMA code?
Sure, we could these more vfio-centric.
> > +struct dma_map_page {
> > + struct list_head list;
> > + dma_addr_t daddr;
> > + unsigned long vaddr;
> > + int npage;
> > + int rdwr;
> > +};
> > +
> > +/*
> > + * This code handles mapping and unmapping of user data buffers
> > + * into DMA'ble space using the IOMMU
> > + */
> > +
> > +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) << PAGE_SHIFT)
> > +
> > +struct vwork {
> > + struct mm_struct *mm;
> > + int npage;
> > + struct work_struct work;
> > +};
> > +
> > +/* delayed decrement for locked_vm */
> > +static void vfio_lock_acct_bg(struct work_struct *work)
> > +{
> > + struct vwork *vwork = container_of(work, struct vwork, work);
> > + struct mm_struct *mm;
> > +
> > + mm = vwork->mm;
> > + down_write(&mm->mmap_sem);
> > + mm->locked_vm += vwork->npage;
> > + up_write(&mm->mmap_sem);
> > + mmput(mm); /* unref mm */
> > + kfree(vwork);
> > +}
> > +
> > +static void vfio_lock_acct(int npage)
> > +{
> > + struct vwork *vwork;
> > + struct mm_struct *mm;
> > +
> > + if (!current->mm) {
> > + /* process exited */
> > + return;
> > + }
> > + if (down_write_trylock(¤t->mm->mmap_sem)) {
> > + current->mm->locked_vm += npage;
> > + up_write(¤t->mm->mmap_sem);
> > + return;
> > + }
> > + /*
> > + * Couldn't get mmap_sem lock, so must setup to decrement
> ^^^^^^^^^
>
> Increment?
Yep
<snip>
> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > + size_t size, struct dma_map_page *mlp)
> > +{
> > + struct dma_map_page *split;
> > + int npage_lo, npage_hi;
> > +
> > + /* Existing dma region is completely covered, unmap all */
>
> This works. However, given how vfio_dma_map_dm implements the merging
> logic, I think it is impossible to have
>
> (start < mlp->daddr &&
> start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))
It's quite possible. This allows userspace to create a sparse mapping,
then blow it all away with a single unmap from 0 to ~0.
> > + if (start <= mlp->daddr &&
> > + start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > + vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > + list_del(&mlp->list);
> > + npage_lo = mlp->npage;
> > + kfree(mlp);
> > + return npage_lo;
> > + }
> > +
> > + /* Overlap low address of existing range */
>
> Same as above (ie, '<' is impossible)
existing: |<--- A --->| |<--- B --->|
unmap: |<--- C --->|
Maybe not good practice from userspace, but we shouldn't count on
userspace to be well behaved.
> > + if (start <= mlp->daddr) {
> > + size_t overlap;
> > +
> > + overlap = start + size - mlp->daddr;
> > + npage_lo = overlap >> PAGE_SHIFT;
> > + npage_hi = mlp->npage - npage_lo;
> > +
> > + vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > + mlp->daddr += overlap;
> > + mlp->vaddr += overlap;
> > + mlp->npage -= npage_lo;
> > + return npage_lo;
> > + }
>
> Same as above (ie, '>' is impossible).
Same example as above.
> > + /* Overlap high address of existing range */
> > + if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > + size_t overlap;
> > +
> > + overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > + npage_hi = overlap >> PAGE_SHIFT;
> > + npage_lo = mlp->npage - npage_hi;
> > +
> > + vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > + mlp->npage -= npage_hi;
> > + return npage_hi;
> > + }
<snip>
> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map
> > *dmp)
> > +{
> > + int npage;
> > + struct dma_map_page *mlp, *mmlp = NULL;
> > + dma_addr_t daddr = dmp->dmaaddr;
> > + unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > + size_t size = dmp->size;
> > + int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > + if (vaddr & (PAGE_SIZE-1))
> > + return -EINVAL;
> > + if (daddr & (PAGE_SIZE-1))
> > + return -EINVAL;
> > + if (size & (PAGE_SIZE-1))
> > + return -EINVAL;
> > +
> > + npage = size >> PAGE_SHIFT;
> > + if (!npage)
> > + return -EINVAL;
> > +
> > + if (!iommu)
> > + return -EINVAL;
> > +
> > + mutex_lock(&iommu->dgate);
> > +
> > + if (vfio_find_dma(iommu, daddr, size)) {
> > + ret = -EBUSY;
> > + goto out_lock;
> > + }
> > +
> > + /* account for locked pages */
> > + locked = current->mm->locked_vm + npage;
> > + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > + printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > + __func__, rlimit(RLIMIT_MEMLOCK));
> > + ret = -ENOMEM;
> > + goto out_lock;
> > + }
> > +
> > + ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > + if (ret)
> > + goto out_lock;
> > +
> > + /* Check if we abut a region below */
>
> Is !daddr possible?
Sure, an IOVA of 0x0. There's no region below if we start at zero.
> > + if (daddr) {
> > + mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > + if (mlp && mlp->rdwr == rdwr &&
> > + mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > +
> > + mlp->npage += npage;
> > + daddr = mlp->daddr;
> > + vaddr = mlp->vaddr;
> > + npage = mlp->npage;
> > + size = NPAGE_TO_SIZE(npage);
> > +
> > + mmlp = mlp;
> > + }
> > + }
>
> Is !(daddr + size) possible?
Same, there's no region above if this region goes to the top of the
address space, ie. 0xffffffff_fffff000 + 0x1000
Hmm, wonder if I'm missing a check for wrapping.
> > + if (daddr + size) {
> > + mlp = vfio_find_dma(iommu, daddr + size, 1);
> > + if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size)
> > {
> > +
> > + mlp->npage += npage;
> > + mlp->daddr = daddr;
> > + mlp->vaddr = vaddr;
> > +
> > + /* If merged above and below, remove previously
> > + * merged entry. New entry covers it. */
> > + if (mmlp) {
> > + list_del(&mmlp->list);
> > + kfree(mmlp);
> > + }
> > + mmlp = mlp;
> > + }
> > + }
> > +
> > + if (!mmlp) {
> > + mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> > + if (!mlp) {
> > + ret = -ENOMEM;
> > + vfio_dma_unmap(iommu, daddr, npage, rdwr);
> > + goto out_lock;
> > + }
> > +
> > + mlp->npage = npage;
> > + mlp->daddr = daddr;
> > + mlp->vaddr = vaddr;
> > + mlp->rdwr = rdwr;
> > + list_add(&mlp->list, &iommu->dm_list);
> > + }
> > +
> > +out_lock:
> > + mutex_unlock(&iommu->dgate);
> > + return ret;
> > +}
> > +
> > +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> > +{
> > + struct vfio_iommu *iommu = filep->private_data;
> > +
> > + vfio_release_iommu(iommu);
> > + return 0;
> > +}
> > +
> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > + unsigned int cmd, unsigned long arg)
> > +{
> > + struct vfio_iommu *iommu = filep->private_data;
> > + int ret = -ENOSYS;
>
> Any reason for not using "switch" ?
It got ugly in vfio_main, so I decided to be consistent w/ it in the
driver and use if/else here too. I don't like the aesthetics of extra
{}s to declare variables within a switch, nor do I like declaring all
the variables for each case for the whole function. Personal quirk.
> > + if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > + u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > + ret = put_user(flags, (u64 __user *)arg);
> > +
> > + } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > + struct vfio_dma_map dm;
> > +
> > + if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > + return -EFAULT;
>
> What does the "_dm" suffix stand for?
Inherited from Tom, but I figure _dma_map_dm = action(dma map),
object(dm), which is a vfio_Dma_Map.
Thanks,
Alex
next prev parent reply other threads:[~2011-11-11 19:05 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-03 20:12 [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework Alex Williamson
2011-11-09 4:17 ` Aaron Fabbri
2011-11-09 4:41 ` Alex Williamson
2011-11-09 8:11 ` Christian Benvenuti (benve)
2011-11-09 18:02 ` Alex Williamson
2011-11-09 21:08 ` Christian Benvenuti (benve)
2011-11-09 23:40 ` Alex Williamson
2011-11-10 0:57 ` Christian Benvenuti (benve)
2011-11-11 18:04 ` Alex Williamson [this message]
2011-11-11 22:22 ` Christian Benvenuti (benve)
2011-11-14 22:59 ` Alex Williamson
2011-11-15 0:05 ` David Gibson
2011-11-15 0:49 ` Benjamin Herrenschmidt
2011-11-11 17:51 ` Konrad Rzeszutek Wilk
2011-11-11 22:10 ` Alex Williamson
2011-11-15 0:00 ` David Gibson
2011-11-16 16:52 ` Konrad Rzeszutek Wilk
2011-11-17 20:22 ` Alex Williamson
2011-11-17 20:56 ` Scott Wood
2011-11-16 17:47 ` Scott Wood
2011-11-17 20:52 ` Alex Williamson
2011-11-12 0:14 ` Scott Wood
2011-11-14 20:54 ` Alex Williamson
2011-11-14 21:46 ` Alex Williamson
2011-11-14 22:26 ` Scott Wood
2011-11-14 22:48 ` Alexander Graf
2011-11-15 2:29 ` Alex Williamson
2011-11-15 6:34 ` David Gibson
2011-11-15 18:01 ` Alex Williamson
2011-11-17 0:02 ` David Gibson
2011-11-18 20:32 ` Alex Williamson
2011-11-18 21:09 ` Scott Wood
2011-11-22 19:16 ` Alex Williamson
2011-11-22 20:00 ` Scott Wood
2011-11-22 21:28 ` Alex Williamson
2011-11-21 2:47 ` David Gibson
2011-11-22 18:22 ` Alex Williamson
2011-11-15 20:10 ` Scott Wood
2011-11-15 21:40 ` Aaron Fabbri
2011-11-15 22:29 ` Scott Wood
2011-11-16 23:34 ` Alex Williamson
2011-11-29 1:52 ` Alexey Kardashevskiy
2011-11-29 2:01 ` Alexey Kardashevskiy
2011-11-29 2:11 ` Alexey Kardashevskiy
2011-11-29 3:54 ` Alex Williamson
2011-11-29 19:26 ` Alex Williamson
2011-11-29 23:20 ` Stuart Yoder
2011-11-29 23:44 ` Alex Williamson
2011-11-30 15:41 ` Stuart Yoder
2011-11-30 16:58 ` Alex Williamson
2011-12-01 20:58 ` Stuart Yoder
2011-12-01 21:25 ` Alex Williamson
2011-12-02 14:40 ` Stuart Yoder
2011-12-02 18:11 ` Bhushan Bharat-R65777
2011-12-02 18:27 ` Scott Wood
2011-12-02 18:35 ` Bhushan Bharat-R65777
2011-12-02 18:45 ` Bhushan Bharat-R65777
2011-12-02 18:52 ` Scott Wood
2011-12-02 18:21 ` Scott Wood
2011-11-29 3:46 ` Alex Williamson
2011-11-29 4:34 ` Alexey Kardashevskiy
2011-11-29 5:48 ` Alex Williamson
2011-12-02 5:06 ` Alexey Kardashevskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1321034647.2682.98.camel@bling.home \
--to=alex.williamson@redhat.com \
--cc=B07421@freescale.com \
--cc=B08248@freescale.com \
--cc=aafabbri@cisco.com \
--cc=agraf@suse.de \
--cc=aik@au1.ibm.com \
--cc=avi@redhat.com \
--cc=benve@cisco.com \
--cc=chrisw@sous-sol.org \
--cc=dwg@au1.ibm.com \
--cc=iommu@lists.linux-foundation.org \
--cc=joerg.roedel@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=pmac@au1.ibm.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).